scholarly journals Overlap graph-based generation of haplotigs for diploids and polyploids

2018 ◽  
Author(s):  
Jasmijn A. Baaijens ◽  
Alexander Schönhuth

AbstractHaplotype aware genome assembly plays an important role in genetics, medicine, and various other disciplines, yet generation of haplotype-resolved de novo assemblies remains a major challenge. Beyond distinguishing between errors and true sequential variants, one needs to assign the true variants to the different genome copies. Recent work has pointed out that the enormous quantities of traditional NGS read data have been greatly underexploited in terms of haplotig computation so far, which reflects that methodology for reference independent haplotig computation has not yet reached maturity. We present POLYTE (POLYploid genome fitTEr) as a new approach to de novo generation of haplotigs for diploid and polyploid genomes. Our method follows an iterative scheme where in each iteration reads or contigs are joined, based on their interplay in terms of an underlying haplotype-aware overlap graph. Along the iterations, contigs grow while preserving their haplotype identity. Benchmarking experiments on both real and simulated data demonstrate that POLYTE establishes new standards in terms of error-free reconstruction of haplotype-specific sequence. As a consequence, POLYTE outperforms state-of-the-art approaches in various relevant aspects, where advantages become particularly distinct in polyploid settings. POLYTE is freely available as part of the HaploConduct package at https://github.com/HaploConduct/HaploConduct, implemented in Python and C++.

2019 ◽  
Vol 35 (21) ◽  
pp. 4281-4289 ◽  
Author(s):  
Jasmijn A Baaijens ◽  
Alexander Schönhuth

Abstract Motivation Haplotype-aware genome assembly plays an important role in genetics, medicine and various other disciplines, yet generation of haplotype-resolved de novo assemblies remains a major challenge. Beyond distinguishing between errors and true sequential variants, one needs to assign the true variants to the different genome copies. Recent work has pointed out that the enormous quantities of traditional NGS read data have been greatly underexploited in terms of haplotig computation so far, which reflects that methodology for reference independent haplotig computation has not yet reached maturity. Results We present POLYploid genome fitTEr (POLYTE) as a new approach to de novo generation of haplotigs for diploid and polyploid genomes of known ploidy. Our method follows an iterative scheme where in each iteration reads or contigs are joined, based on their interplay in terms of an underlying haplotype-aware overlap graph. Along the iterations, contigs grow while preserving their haplotype identity. Benchmarking experiments on both real and simulated data demonstrate that POLYTE establishes new standards in terms of error-free reconstruction of haplotype-specific sequence. As a consequence, POLYTE outperforms state-of-the-art approaches in various relevant aspects, where advantages become particularly distinct in polyploid settings. Availability and implementation POLYTE is freely available as part of the HaploConduct package at https://github.com/HaploConduct/HaploConduct, implemented in Python and C++. Supplementary information Supplementary data are available at Bioinformatics online.


2018 ◽  
Author(s):  
Avantika Lal ◽  
Keli Liu ◽  
Robert Tibshirani ◽  
Arend Sidow ◽  
Daniele Ramazzotti

AbstractCancer is the result of mutagenic processes that can be inferred from tumor genomes by analyzing rate spectra of point mutations, or “mutational signatures”. Here we present SparseSignatures, a novel framework to extract signatures from somatic point mutation data. Our approach incorporates DNA replication error as a background, employs regularization to reduce noise in non-background signatures, uses cross-validation to identify the number of signatures, and is scalable to large datasets. We show that SparseSignatures outperforms current state-of-the-art methods on simulated data using standard metrics. We then apply SparseSignatures to whole genome sequences of 147 tumors from pancreatic cancer, discovering 8 signatures in addition to the background.


2020 ◽  
Vol 21 (1) ◽  
Author(s):  
Mikko Rautiainen ◽  
Tobias Marschall

Abstract Genome graphs can represent genetic variation and sequence uncertainty. Aligning sequences to genome graphs is key to many applications, including error correction, genome assembly, and genotyping of variants in a pangenome graph. Yet, so far, this step is often prohibitively slow. We present GraphAligner, a tool for aligning long reads to genome graphs. Compared to the state-of-the-art tools, GraphAligner is 13x faster and uses 3x less memory. When employing GraphAligner for error correction, we find it to be more than twice as accurate and over 12x faster than extant tools.Availability: Package manager: https://anaconda.org/bioconda/graphalignerand source code: https://github.com/maickrau/GraphAligner


2019 ◽  
Author(s):  
Priyanka Ghosh ◽  
Sriram Krishnamoorthy ◽  
Ananth Kalyanaraman

AbstractDe novo genome assembly is a fundamental problem in the field of bioinformatics, that aims to assemble the DNA sequence of an unknown genome from numerous short DNA fragments (aka reads) obtained from it. With the advent of high-throughput sequencing technologies, billions of reads can be generated in a matter of hours, necessitating efficient parallelization of the assembly process. While multiple parallel solutions have been proposed in the past, conducting a large-scale assembly at scale remains a challenging problem because of the inherent complexities associated with data movement, and irregular access footprints of memory and I/O operations. In this paper, we present a novel algorithm, called PaKman, to address the problem of performing large-scale genome assemblies on a distributed memory parallel computer. Our approach focuses on improving performance through a combination of novel data structures and algorithmic strategies for reducing the communication and I/O footprint during the assembly process. PaKman presents a solution for the two most time-consuming phases in the full genome assembly pipeline, namely, k-mer counting and contig generation.A key aspect of our algorithm is its graph data structure, which comprises fat nodes (or what we call “macro-nodes”) that reduce the communication burden during contig generation. We present an extensive performance and qualitative evaluation of our algorithm, including comparisons to other state-of-the-art parallel assemblers. Our results demonstrate the ability to achieve near-linear speedups on up to 8K cores (tested); outperform state-of-the-art distributed memory and shared memory tools in performance while delivering comparable (if not better) quality; and reduce time to solution significantly. For instance, PaKman is able to generate a high-quality set of assembled contigs for complex genomes such as the human and wheat genomes in a matter of minutes on 8K cores.


2019 ◽  
Author(s):  
Mikko Rautiainen ◽  
Tobias Marschall

AbstractGenome graphs can represent genetic variation and sequence uncertainty. Aligning sequences to genome graphs is key to many applications, including error correction, genome assembly, and genotyping of variants in a pan-genome graph. Yet, so far this step is often prohibitively slow. We present GraphAligner, a tool for aligning long reads to genome graphs. Compared to state-of-the-art tools, GraphAligner is 12x faster and uses 5x less memory, making it as efficient as aligning reads to linear reference genomes. When employing GraphAligner for error correction, we find it to be almost 3x more accurate and over 15x faster than extant tools.Availability Package managerhttps://anaconda.org/bioconda/graphaligner and source code: https://github.com/maickrau/GraphAligner


2021 ◽  
Vol 17 (6) ◽  
pp. e1009119
Author(s):  
Avantika Lal ◽  
Keli Liu ◽  
Robert Tibshirani ◽  
Arend Sidow ◽  
Daniele Ramazzotti

Cancer is the result of mutagenic processes that can be inferred from tumor genomes by analyzing rate spectra of point mutations, or “mutational signatures”. Here we present SparseSignatures, a novel framework to extract signatures from somatic point mutation data. Our approach incorporates a user-specified background signature, employs regularization to reduce noise in non-background signatures, uses cross-validation to identify the number of signatures, and is scalable to large datasets. We show that SparseSignatures outperforms current state-of-the-art methods on simulated data using a variety of standard metrics. We then apply SparseSignatures to whole genome sequences of pancreatic and breast tumors, discovering well-differentiated signatures that are linked to known mutagenic mechanisms and are strongly associated with patient clinical features.


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
A. S. Boev ◽  
A. S. Rakitko ◽  
S. R. Usmanov ◽  
A. N. Kobzeva ◽  
I. V. Popov ◽  
...  

AbstractRecent advances in DNA sequencing open prospects to make whole-genome analysis rapid and reliable, which is promising for various applications including personalized medicine. However, existing techniques for de novo genome assembly, which is used for the analysis of genomic rearrangements, chromosome phasing, and reconstructing genomes without a reference, require solving tasks of high computational complexity. Here we demonstrate a method for solving genome assembly tasks with the use of quantum and quantum-inspired optimization techniques. Within this method, we present experimental results on genome assembly using quantum annealers both for simulated data and the $$\phi $$ ϕ X 174 bacteriophage. Our results pave a way for a significant increase in the efficiency of solving bioinformatics problems with the use of quantum computing technologies and, in particular, quantum annealing might be an effective method. We expect that the new generation of quantum annealing devices would outperform existing techniques for de novo genome assembly. To the best of our knowledge, this is the first experimental study of de novo genome assembly problems both for real and synthetic data on quantum annealing devices and quantum-inspired techniques.


2015 ◽  
Author(s):  
Kristoffer Sahlin ◽  
Mattias Frånberg ◽  
Lars Arvestad

Insert size distributions from paired read protocols are used for inference in bioinformatic applications such as genome assembly and structural variation detection. However, many of the models that are being used are subject to bias. This bias arises when we assume that all insert sizes within a distribution are equally likely to be observed, when in fact, size matters. These systematic errors exist in popular software even when the assumptions made about data are true. We have previously shown that bias occurs for scaffolders in genome assembly. Here, we generalize the theory and demonstrate that it is applicable in other contexts. We provide examples of bias in state-of the-art software and improve them using our model. One key application of our theory is structural variation detection using read pairs. We show that an incorrect null-hypothesis is commonly used in popular tools and can be corrected using our theory. Furthermore, we approximate the smallest size of indels that are possible to discover given an insert size distribution. Two other applications are inference of insert size distribution on \emph{de novo} genome assemblies and error correction of genome assemblies using mated reads. Our theory is implemented in a tool called GetDistr (\url{https://github.com/ksahlin/GetDistr}).


Author(s):  
Axel Andersson ◽  
Ferran Diego ◽  
Fred A. Hamprecht ◽  
Carolina Wählby

In Situ Transcriptomics (IST) is a set of image-based transcriptomics approaches that enables localisation of gene expression directly in tissue samples. IST techniques produce multiplexed image series in which fluorescent spots are either present or absent across imaging rounds and colour channels. A spot’s presence and absence form a type of barcoded pattern that labels a particular type of mRNA. Therefore, the expression of a gene can be determined by localising the fluorescent spots and decode the barcode that they form. Existing IST algorithms usually do this in two separate steps: spot localisation and barcode decoding. Although these algorithms are efficient, they are limited by strictly separating the localisation and decoding steps. This limitation becomes apparent in regions with low signal-to-noise ratio or high spot densities. We argue that an improved gene expression decoding can be obtained by combining these two steps into a single algorithm. This allows for an efficient decoding that is less sensitive to noise and optical crowding.We present IST Decoding by Deconvolution (ISTDECO), a principled decoding approach combining spectral and spatial deconvolution into a single algorithm. We evaluate ISTDECO on simulated data, as well as on two real IST datasets, and compare with state-of-the-art. ISTDECO achieves state-of-the-art performance despite high spot densities and low signal-to-noise ratios. It is easily implemented and runs efficiently using a GPU.ISTDECO implementation, datasets and demos are available online at:github.com/axanderssonuu/istdeco


2018 ◽  
Author(s):  
Igor Mandric ◽  
Alex Zelikovsky

AbstractOne of the most important steps in genome assembly is scaffolding. Increasing the length of sequencing reads allows assembling short genomes but assembly of long repeat-rich genomes remains one of the most interesting and challenging problems in bioinformatics. There is a high demand in developing computational approaches for repeat aware scaffolding. In this paper, we propose a novel repeat-aware scaffolder BATISCAF based on the optimization formulation for filtering out repeated and short contigs. Our experiments with five benchmarking datasets show that the proposed tool BATISCAF outperforms state-of-the-art tools. BATISCAF is freely available on GitHub: https://github.com/mandricigor/batiscaf.


Sign in / Sign up

Export Citation Format

Share Document