Overlap graph-based generation of haplotigs for diploids and polyploids

Mapping Intimacies ◽

10.1101/378356 ◽

2018 ◽

Author(s):

Jasmijn A. Baaijens ◽

Alexander Schönhuth

Keyword(s):

Recent Work ◽

Genome Assembly ◽

De Novo ◽

Iterative Scheme ◽

State Of The Art ◽

Simulated Data ◽

Specific Sequence ◽

New Approach ◽

Link Type ◽

Polyploid Genome

AbstractHaplotype aware genome assembly plays an important role in genetics, medicine, and various other disciplines, yet generation of haplotype-resolved de novo assemblies remains a major challenge. Beyond distinguishing between errors and true sequential variants, one needs to assign the true variants to the different genome copies. Recent work has pointed out that the enormous quantities of traditional NGS read data have been greatly underexploited in terms of haplotig computation so far, which reflects that methodology for reference independent haplotig computation has not yet reached maturity. We present POLYTE (POLYploid genome fitTEr) as a new approach to de novo generation of haplotigs for diploid and polyploid genomes. Our method follows an iterative scheme where in each iteration reads or contigs are joined, based on their interplay in terms of an underlying haplotype-aware overlap graph. Along the iterations, contigs grow while preserving their haplotype identity. Benchmarking experiments on both real and simulated data demonstrate that POLYTE establishes new standards in terms of error-free reconstruction of haplotype-specific sequence. As a consequence, POLYTE outperforms state-of-the-art approaches in various relevant aspects, where advantages become particularly distinct in polyploid settings. POLYTE is freely available as part of the HaploConduct package at https://github.com/HaploConduct/HaploConduct, implemented in Python and C++.

Download Full-text

Overlap graph-based generation of haplotigs for diploids and polyploids

Bioinformatics ◽

10.1093/bioinformatics/btz255 ◽

2019 ◽

Vol 35 (21) ◽

pp. 4281-4289 ◽

Cited By ~ 1

Author(s):

Jasmijn A Baaijens ◽

Alexander Schönhuth

Keyword(s):

Genome Assembly ◽

De Novo ◽

Iterative Scheme ◽

State Of The Art ◽

Simulated Data ◽

Supplementary Information ◽

Supplementary Data ◽

Specific Sequence ◽

New Approach ◽

Polyploid Genome

Abstract Motivation Haplotype-aware genome assembly plays an important role in genetics, medicine and various other disciplines, yet generation of haplotype-resolved de novo assemblies remains a major challenge. Beyond distinguishing between errors and true sequential variants, one needs to assign the true variants to the different genome copies. Recent work has pointed out that the enormous quantities of traditional NGS read data have been greatly underexploited in terms of haplotig computation so far, which reflects that methodology for reference independent haplotig computation has not yet reached maturity. Results We present POLYploid genome fitTEr (POLYTE) as a new approach to de novo generation of haplotigs for diploid and polyploid genomes of known ploidy. Our method follows an iterative scheme where in each iteration reads or contigs are joined, based on their interplay in terms of an underlying haplotype-aware overlap graph. Along the iterations, contigs grow while preserving their haplotype identity. Benchmarking experiments on both real and simulated data demonstrate that POLYTE establishes new standards in terms of error-free reconstruction of haplotype-specific sequence. As a consequence, POLYTE outperforms state-of-the-art approaches in various relevant aspects, where advantages become particularly distinct in polyploid settings. Availability and implementation POLYTE is freely available as part of the HaploConduct package at https://github.com/HaploConduct/HaploConduct, implemented in Python and C++. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

De Novo Mutational Signature Discovery in Tumor Genomes using SparseSignatures

10.1101/384834 ◽

2018 ◽

Cited By ~ 5

Author(s):

Avantika Lal ◽

Keli Liu ◽

Robert Tibshirani ◽

Arend Sidow ◽

Daniele Ramazzotti

Keyword(s):

Cross Validation ◽

De Novo ◽

State Of The Art ◽

Point Mutations ◽

Simulated Data ◽

Large Datasets ◽

Genome Sequences ◽

Mutational Signatures ◽

Mutational Signature ◽

Current State

AbstractCancer is the result of mutagenic processes that can be inferred from tumor genomes by analyzing rate spectra of point mutations, or “mutational signatures”. Here we present SparseSignatures, a novel framework to extract signatures from somatic point mutation data. Our approach incorporates DNA replication error as a background, employs regularization to reduce noise in non-background signatures, uses cross-validation to identify the number of signatures, and is scalable to large datasets. We show that SparseSignatures outperforms current state-of-the-art methods on simulated data using standard metrics. We then apply SparseSignatures to whole genome sequences of 147 tumors from pancreatic cancer, discovering 8 signatures in addition to the background.

Download Full-text

GraphAligner: rapid and versatile sequence-to-graph alignment

Genome Biology ◽

10.1186/s13059-020-02157-2 ◽

2020 ◽

Vol 21 (1) ◽

Cited By ~ 1

Author(s):

Mikko Rautiainen ◽

Tobias Marschall

Keyword(s):

Genetic Variation ◽

Error Correction ◽

Genome Assembly ◽

State Of The Art ◽

Source Code ◽

The State ◽

Graph Alignment ◽

Link Type ◽

Long Reads

Abstract Genome graphs can represent genetic variation and sequence uncertainty. Aligning sequences to genome graphs is key to many applications, including error correction, genome assembly, and genotyping of variants in a pangenome graph. Yet, so far, this step is often prohibitively slow. We present GraphAligner, a tool for aligning long reads to genome graphs. Compared to the state-of-the-art tools, GraphAligner is 13x faster and uses 3x less memory. When employing GraphAligner for error correction, we find it to be more than twice as accurate and over 12x faster than extant tools.Availability: Package manager: https://anaconda.org/bioconda/graphalignerand source code: https://github.com/maickrau/GraphAligner

Download Full-text

PaKman: Scalable Assembly of Large Genomes on Distributed Memory Machines

10.1101/523068 ◽

2019 ◽

Author(s):

Priyanka Ghosh ◽

Sriram Krishnamoorthy ◽

Ananth Kalyanaraman

Keyword(s):

Genome Assembly ◽

Large Scale ◽

Distributed Memory ◽

High Throughput Sequencing ◽

De Novo ◽

State Of The Art ◽

Fundamental Problem ◽

Parallel Computer ◽

Assembly Process ◽

Data Movement

AbstractDe novo genome assembly is a fundamental problem in the field of bioinformatics, that aims to assemble the DNA sequence of an unknown genome from numerous short DNA fragments (aka reads) obtained from it. With the advent of high-throughput sequencing technologies, billions of reads can be generated in a matter of hours, necessitating efficient parallelization of the assembly process. While multiple parallel solutions have been proposed in the past, conducting a large-scale assembly at scale remains a challenging problem because of the inherent complexities associated with data movement, and irregular access footprints of memory and I/O operations. In this paper, we present a novel algorithm, called PaKman, to address the problem of performing large-scale genome assemblies on a distributed memory parallel computer. Our approach focuses on improving performance through a combination of novel data structures and algorithmic strategies for reducing the communication and I/O footprint during the assembly process. PaKman presents a solution for the two most time-consuming phases in the full genome assembly pipeline, namely, k-mer counting and contig generation.A key aspect of our algorithm is its graph data structure, which comprises fat nodes (or what we call “macro-nodes”) that reduce the communication burden during contig generation. We present an extensive performance and qualitative evaluation of our algorithm, including comparisons to other state-of-the-art parallel assemblers. Our results demonstrate the ability to achieve near-linear speedups on up to 8K cores (tested); outperform state-of-the-art distributed memory and shared memory tools in performance while delivering comparable (if not better) quality; and reduce time to solution significantly. For instance, PaKman is able to generate a high-quality set of assembled contigs for complex genomes such as the human and wheat genomes in a matter of minutes on 8K cores.

Download Full-text

GraphAligner: Rapid and Versatile Sequence-to-Graph Alignment

10.1101/810812 ◽

2019 ◽

Cited By ~ 9

Author(s):

Mikko Rautiainen ◽

Tobias Marschall

Keyword(s):

Genetic Variation ◽

Error Correction ◽

Genome Assembly ◽

State Of The Art ◽

Source Code ◽

Graph Alignment ◽

Link Type ◽

Long Reads ◽

Reference Genomes ◽

Genome Graph

AbstractGenome graphs can represent genetic variation and sequence uncertainty. Aligning sequences to genome graphs is key to many applications, including error correction, genome assembly, and genotyping of variants in a pan-genome graph. Yet, so far this step is often prohibitively slow. We present GraphAligner, a tool for aligning long reads to genome graphs. Compared to state-of-the-art tools, GraphAligner is 12x faster and uses 5x less memory, making it as efficient as aligning reads to linear reference genomes. When employing GraphAligner for error correction, we find it to be almost 3x more accurate and over 15x faster than extant tools.Availability Package managerhttps://anaconda.org/bioconda/graphaligner and source code: https://github.com/maickrau/GraphAligner

Download Full-text

De novo mutational signature discovery in tumor genomes using SparseSignatures

PLoS Computational Biology ◽

10.1371/journal.pcbi.1009119 ◽

2021 ◽

Vol 17 (6) ◽

pp. e1009119

Author(s):

Avantika Lal ◽

Keli Liu ◽

Robert Tibshirani ◽

Arend Sidow ◽

Daniele Ramazzotti

Keyword(s):

Cross Validation ◽

De Novo ◽

State Of The Art ◽

Point Mutations ◽

Simulated Data ◽

Large Datasets ◽

Genome Sequences ◽

Mutational Signatures ◽

Current State ◽

Well Differentiated

Cancer is the result of mutagenic processes that can be inferred from tumor genomes by analyzing rate spectra of point mutations, or “mutational signatures”. Here we present SparseSignatures, a novel framework to extract signatures from somatic point mutation data. Our approach incorporates a user-specified background signature, employs regularization to reduce noise in non-background signatures, uses cross-validation to identify the number of signatures, and is scalable to large datasets. We show that SparseSignatures outperforms current state-of-the-art methods on simulated data using a variety of standard metrics. We then apply SparseSignatures to whole genome sequences of pancreatic and breast tumors, discovering well-differentiated signatures that are linked to known mutagenic mechanisms and are strongly associated with patient clinical features.

Download Full-text

Genome assembly using quantum and quantum-inspired annealing

Scientific Reports ◽

10.1038/s41598-021-88321-5 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

A. S. Boev ◽

A. S. Rakitko ◽

S. R. Usmanov ◽

A. N. Kobzeva ◽

I. V. Popov ◽

...

Keyword(s):

Genome Assembly ◽

De Novo ◽

Synthetic Data ◽

Simulated Data ◽

Optimization Techniques ◽

Quantum Annealing ◽

Whole Genome Analysis ◽

De Novo Genome Assembly ◽

New Generation ◽

Assembly Tasks

AbstractRecent advances in DNA sequencing open prospects to make whole-genome analysis rapid and reliable, which is promising for various applications including personalized medicine. However, existing techniques for de novo genome assembly, which is used for the analysis of genomic rearrangements, chromosome phasing, and reconstructing genomes without a reference, require solving tasks of high computational complexity. Here we demonstrate a method for solving genome assembly tasks with the use of quantum and quantum-inspired optimization techniques. Within this method, we present experimental results on genome assembly using quantum annealers both for simulated data and the $$\phi $$ ϕ X 174 bacteriophage. Our results pave a way for a significant increase in the efficiency of solving bioinformatics problems with the use of quantum computing technologies and, in particular, quantum annealing might be an effective method. We expect that the new generation of quantum annealing devices would outperform existing techniques for de novo genome assembly. To the best of our knowledge, this is the first experimental study of de novo genome assembly problems both for real and synthetic data on quantum annealing devices and quantum-inspired techniques.

Download Full-text

Correcting bias from stochastic insert size in read pair data — applications to structural variation detection and genome assembly

10.1101/023929 ◽

2015 ◽

Cited By ~ 1

Author(s):

Kristoffer Sahlin ◽

Mattias Frånberg ◽

Lars Arvestad

Keyword(s):

Size Distribution ◽

Genome Assembly ◽

Structural Variation ◽

De Novo ◽

State Of The Art ◽

Size Distributions ◽

Insert Size ◽

Genome Assemblies ◽

Paired Read ◽

Insert Size Distribution

Insert size distributions from paired read protocols are used for inference in bioinformatic applications such as genome assembly and structural variation detection. However, many of the models that are being used are subject to bias. This bias arises when we assume that all insert sizes within a distribution are equally likely to be observed, when in fact, size matters. These systematic errors exist in popular software even when the assumptions made about data are true. We have previously shown that bias occurs for scaffolders in genome assembly. Here, we generalize the theory and demonstrate that it is applicable in other contexts. We provide examples of bias in state-of the-art software and improve them using our model. One key application of our theory is structural variation detection using read pairs. We show that an incorrect null-hypothesis is commonly used in popular tools and can be corrected using our theory. Furthermore, we approximate the smallest size of indels that are possible to discover given an insert size distribution. Two other applications are inference of insert size distribution on \emph{de novo} genome assemblies and error correction of genome assemblies using mated reads. Our theory is implemented in a tool called GetDistr (\url{https://github.com/ksahlin/GetDistr}).

Download Full-text

ISTDECO: In Situ Transcriptomics Decoding by Deconvolution

10.1101/2021.03.01.433040 ◽

2021 ◽

Cited By ~ 1

Author(s):

Axel Andersson ◽

Ferran Diego ◽

Fred A. Hamprecht ◽

Carolina Wählby

Keyword(s):

Gene Expression ◽

State Of The Art ◽

Signal To Noise Ratio ◽

Simulated Data ◽

Signal To Noise ◽

Tissue Samples ◽

Image Series ◽

Link Type ◽

Efficient Decoding

In Situ Transcriptomics (IST) is a set of image-based transcriptomics approaches that enables localisation of gene expression directly in tissue samples. IST techniques produce multiplexed image series in which fluorescent spots are either present or absent across imaging rounds and colour channels. A spot’s presence and absence form a type of barcoded pattern that labels a particular type of mRNA. Therefore, the expression of a gene can be determined by localising the fluorescent spots and decode the barcode that they form. Existing IST algorithms usually do this in two separate steps: spot localisation and barcode decoding. Although these algorithms are efficient, they are limited by strictly separating the localisation and decoding steps. This limitation becomes apparent in regions with low signal-to-noise ratio or high spot densities. We argue that an improved gene expression decoding can be obtained by combining these two steps into a single algorithm. This allows for an efficient decoding that is less sensitive to noise and optical crowding.We present IST Decoding by Deconvolution (ISTDECO), a principled decoding approach combining spectral and spatial deconvolution into a single algorithm. We evaluate ISTDECO on simulated data, as well as on two real IST datasets, and compare with state-of-the-art. ISTDECO achieves state-of-the-art performance despite high spot densities and low signal-to-noise ratios. It is easily implemented and runs efficiently using a GPU.ISTDECO implementation, datasets and demos are available online at:github.com/axanderssonuu/istdeco

Download Full-text

Solving scaffolding problem with repeats

10.1101/330472 ◽

2018 ◽

Author(s):

Igor Mandric ◽

Alex Zelikovsky

Keyword(s):

Genome Assembly ◽

State Of The Art ◽

High Demand ◽

Computational Approaches ◽

Link Type ◽

Optimization Formulation

AbstractOne of the most important steps in genome assembly is scaffolding. Increasing the length of sequencing reads allows assembling short genomes but assembly of long repeat-rich genomes remains one of the most interesting and challenging problems in bioinformatics. There is a high demand in developing computational approaches for repeat aware scaffolding. In this paper, we propose a novel repeat-aware scaffolder BATISCAF based on the optimization formulation for filtering out repeated and short contigs. Our experiments with five benchmarking datasets show that the proposed tool BATISCAF outperforms state-of-the-art tools. BATISCAF is freely available on GitHub: https://github.com/mandricigor/batiscaf.

Download Full-text