GraphAligner: Rapid and Versatile Sequence-to-Graph Alignment

Mapping Intimacies ◽

10.1101/810812 ◽

2019 ◽

Cited By ~ 9

Author(s):

Mikko Rautiainen ◽

Tobias Marschall

Keyword(s):

Genetic Variation ◽

Error Correction ◽

Genome Assembly ◽

State Of The Art ◽

Source Code ◽

Graph Alignment ◽

Link Type ◽

Long Reads ◽

Reference Genomes ◽

Genome Graph

AbstractGenome graphs can represent genetic variation and sequence uncertainty. Aligning sequences to genome graphs is key to many applications, including error correction, genome assembly, and genotyping of variants in a pan-genome graph. Yet, so far this step is often prohibitively slow. We present GraphAligner, a tool for aligning long reads to genome graphs. Compared to state-of-the-art tools, GraphAligner is 12x faster and uses 5x less memory, making it as efficient as aligning reads to linear reference genomes. When employing GraphAligner for error correction, we find it to be almost 3x more accurate and over 15x faster than extant tools.Availability Package managerhttps://anaconda.org/bioconda/graphaligner and source code: https://github.com/maickrau/GraphAligner

Download Full-text

GraphAligner: rapid and versatile sequence-to-graph alignment

Genome Biology ◽

10.1186/s13059-020-02157-2 ◽

2020 ◽

Vol 21 (1) ◽

Cited By ~ 1

Author(s):

Mikko Rautiainen ◽

Tobias Marschall

Keyword(s):

Genetic Variation ◽

Error Correction ◽

Genome Assembly ◽

State Of The Art ◽

Source Code ◽

The State ◽

Graph Alignment ◽

Link Type ◽

Long Reads

Abstract Genome graphs can represent genetic variation and sequence uncertainty. Aligning sequences to genome graphs is key to many applications, including error correction, genome assembly, and genotyping of variants in a pangenome graph. Yet, so far, this step is often prohibitively slow. We present GraphAligner, a tool for aligning long reads to genome graphs. Compared to the state-of-the-art tools, GraphAligner is 13x faster and uses 3x less memory. When employing GraphAligner for error correction, we find it to be more than twice as accurate and over 12x faster than extant tools.Availability: Package manager: https://anaconda.org/bioconda/graphalignerand source code: https://github.com/maickrau/GraphAligner

Download Full-text

Tapestry: validate and edit small eukaryotic genome assemblies with long reads

10.1101/2020.04.24.059402 ◽

2020 ◽

Author(s):

John W. Davey ◽

Seth J. Davis ◽

Jeremy C. Mottram ◽

Peter D. Ashton

Keyword(s):

Genome Assembly ◽

Source Code ◽

Gc Content ◽

Eukaryotic Genome ◽

Link Type ◽

Long Reads ◽

Genome Assemblies

AbstractSummarySmall eukaryotic genome assemblies based on long reads are often close to complete, but still require validation and editing. Tapestry produces an interactive report which can be used to validate, sort and filter the contigs in a raw genome assembly, taking into account GC content, telomeres, read depths, contig alignments and read alignments. The report can be shared with collaborators and included as supplemental material in publications.AvailabilitySource code is freely available at https://github.com/johnomics/tapestry. Package is freely available in Bioconda (https://anaconda.org/bioconda/tapestry)[email protected]

Download Full-text

Enabling multiscale variation analysis with genome graphs

10.1101/2021.02.03.429603 ◽

2021 ◽

Author(s):

Brice Letcher ◽

Martin Hunt ◽

Zamin Iqbal

Keyword(s):

Genetic Variation ◽

Directed Acyclic Graph ◽

Structural Variation ◽

Reference Genome ◽

Multiple Scales ◽

State Of The Art ◽

Variant Calling ◽

Variation Analysis ◽

New Algorithms ◽

Genome Graph

AbstractBackgroundStandard approaches to characterising genetic variation revolve around mapping reads to a reference genome and describing variants in terms of differences from the reference; this is based on the assumption that these differences will be small and provides a simple coordinate system. However this fails, and the coordinates break down, when there are diverged haplotypes at a locus (e.g. one haplotype contains a multi-kilobase deletion, a second contains a few SNPs, and a third is highly diverged with hundreds of SNPs). To handle these, we need to model genetic variation that occurs at different length-scales (SNPs to large structural variants) and that occurs on alternate backgrounds. We refer to these together as multiscale variation.ResultsWe model the genome as a directed acyclic graph consisting of successive hierarchical subgraphs (“sites”) that naturally incorporate multiscale variation, and introduce an algorithm for genotyping, implemented in the software gramtools. This enables variant calling on different sequence backgrounds. In addition to producing regular VCF files, we introduce a JSON file format based on VCF, which records variant site relationships and alternate sequence backgrounds.We show two applications. First, we benchmark gramtools against existing state-of-the-art methods in joint-genotyping 17 M. tuberculosis samples at long deletions and the overlapping small variants that segregate in a cohort of 1,017 genomes. Second, in 706 African and SE Asian P. falciparum genomes, we analyse a dimorphic surface antigen gene which possesses variation on two diverged backgrounds which appeared to not recombine. This generates the first map of variation on both backgrounds, revealing patterns of recombination that were previously unknown.ConclusionsWe need new approaches to be able to jointly analyse SNP and structural variation in cohorts, and even more to handle variants on different genetic backgrounds. We have demonstrated that by modelling with a directed, acyclic and locally hierarchical genome graph, we can apply new algorithms to accurately genotype dense variation at multiple scales. We also propose a generalisation of VCF for accessing multiscale variation in genome graphs, which we hope will be of wide utility.

Download Full-text

Graphmap2 - splice-aware RNA-seq mapper for long reads

10.1101/720458 ◽

2019 ◽

Cited By ~ 1

Author(s):

Josip Marić ◽

Ivan Sović ◽

Krešimir Križanović ◽

Niranjan Nagarajan ◽

Mile Šikić

Keyword(s):

State Of The Art ◽

The State ◽

Rna Seq ◽

Link Type ◽

Pacific Biosciences ◽

Long Reads ◽

Oxford Nanopore

AbstractIn this paper we present Graphmap2, a splice-aware mapper built on our previously developed DNA mapper Graphmap. Graphmap2 is tailored for long reads produced by Pacific Biosciences and Oxford Nanopore devices. It uses several newly developed algorithms which enable higher precision and recall of correctly detected transcripts and exon boundaries. We compared its performance with the state-of-the-art tools Minimap2 and Gmap. On both simulated and real datasets Graphmap2 achieves higher mappability and more correctly recognized exons and their ends. In addition we present an analysis of potential of splice aware mappers and long reads for the identification of previously unknown isoforms and even genes. The Graphmap2 tool is publicly available at https://github.com/lbcb-sci/graphmap2.

Download Full-text

Overlap graph-based generation of haplotigs for diploids and polyploids

10.1101/378356 ◽

2018 ◽

Author(s):

Jasmijn A. Baaijens ◽

Alexander Schönhuth

Keyword(s):

Recent Work ◽

Genome Assembly ◽

De Novo ◽

Iterative Scheme ◽

State Of The Art ◽

Simulated Data ◽

Specific Sequence ◽

New Approach ◽

Link Type ◽

Polyploid Genome

AbstractHaplotype aware genome assembly plays an important role in genetics, medicine, and various other disciplines, yet generation of haplotype-resolved de novo assemblies remains a major challenge. Beyond distinguishing between errors and true sequential variants, one needs to assign the true variants to the different genome copies. Recent work has pointed out that the enormous quantities of traditional NGS read data have been greatly underexploited in terms of haplotig computation so far, which reflects that methodology for reference independent haplotig computation has not yet reached maturity. We present POLYTE (POLYploid genome fitTEr) as a new approach to de novo generation of haplotigs for diploid and polyploid genomes. Our method follows an iterative scheme where in each iteration reads or contigs are joined, based on their interplay in terms of an underlying haplotype-aware overlap graph. Along the iterations, contigs grow while preserving their haplotype identity. Benchmarking experiments on both real and simulated data demonstrate that POLYTE establishes new standards in terms of error-free reconstruction of haplotype-specific sequence. As a consequence, POLYTE outperforms state-of-the-art approaches in various relevant aspects, where advantages become particularly distinct in polyploid settings. POLYTE is freely available as part of the HaploConduct package at https://github.com/HaploConduct/HaploConduct, implemented in Python and C++.

Download Full-text

ELECTOR: Evaluator for long reads correction methods

10.1101/512889 ◽

2019 ◽

Cited By ~ 1

Author(s):

Camille Marchet ◽

Pierre Morisse ◽

Lolita Lecompte ◽

Arnaud Lefebvre ◽

Thierry Lecroq ◽

...

Keyword(s):

Error Correction ◽

State Of The Art ◽

Error Rates ◽

Sequencing Data ◽

Third Generation Sequencing ◽

Long Reads ◽

Wide Range ◽

Downstream Processes ◽

Generation Sequencing

AbstractMotivationIn the last few years, the error rates of third generation sequencing data have been capped above 5%, including many insertions and deletions. Thereby, an increasing number of long reads correction methods have been proposed to reduce the noise in these sequences. Whether hybrid or self-correction methods, there exist multiple approaches to correct long reads. As the quality of the error correction has huge impacts on downstream processes, developing methods allowing to evaluate error correction tools with precise and reliable statistics is therefore a crucial need. Since error correction is often a resource bottleneck in long reads pipelines, a key feature of assessment methods is therefore to be efficient, in order to allow the fast comparison of different tools.ResultsWe propose ELECTOR, a reliable and efficient tool to evaluate long reads correction, that enables the evaluation of hybrid and self-correction methods. Our tool provides a complete and relevant set of metrics to assess the read quality improvement after correction and scales to large datasets. ELECTOR is directly compatible with a wide range of state-of-the-art error correction tools, using whether simulated or real long reads. We show that ELECTOR displays a wider range of metrics than the state-of-the-art tool, LRCstats, and additionally importantly decreases the runtime needed for assessment on all the studied datasets.AvailabilityELECTOR is available at https://github.com/kamimrcht/[email protected] or [email protected]

Download Full-text

GPU accelerated partial order multiple sequence alignment for long reads self-correction

10.1101/2020.02.14.946939 ◽

2020 ◽

Author(s):

Francesco Peverelli ◽

Lorenzo Di Tucci ◽

Marco D. Santambrogio ◽

Nan Ding ◽

Steven Hofmeyr ◽

...

Keyword(s):

Error Correction ◽

Partial Order ◽

Sequence Alignment ◽

Multiple Sequence Alignment ◽

Consensus Sequence ◽

Pairwise Alignment ◽

Multiple Sequence ◽

Graph Alignment ◽

Correction Process ◽

Long Reads

AbstractAs third generation sequencing technologies become more reliable and widely used to solve several genome-related problems, self-correction of long reads is becoming the preferred method to reduce the error rate of Pacific Biosciences and Oxford Nanopore long reads, that is now around 10-12%. Several of these self-correction methods rely on some form of Multiple Sequence Alignment (MSA) to obtain a consensus sequence for the original reads. In particular, error-correction tools such as RACON and CONSENT use Partial Order (PO) graph alignment to accomplish this task. PO graph alignment, which is computationally more expensive than optimal global pairwise alignment between two sequences, needs to be performed several times for each read during the error correction process. GPUs have proven very effective in accelerating several compute-intensive tasks in different scientific fields. We harnessed the power of these architectures to accelerate the error correction process of existing self-correction tools, to improve the efficiency of this step of genome analysis.In this paper, we introduce a GPU-accelerated version of the PO alignment presented in the POA v2 software library, implemented on an NVIDIA Tesla V100 GPU. We obtain up to 6.5x speedup compared to 64 CPU threads run on two 2.3 GHz 16-core Intel Xeon Processors E5-2698 v3. In our implementation we focused on the alignment of smaller sequences, as the CONSENT segmentation strategy based on k-mer chaining provides an optimal opportunity to exploit the parallel-processing power of GPUs. To demonstrate this, we have integrated our kernel in the CONSENT software. This accelerated version of CONSENT provides a speedup for the whole error correction step that ranges from 1.95x to 8.5x depending on the input reads.

Download Full-text

AKT: Ancestry and Kinship Toolkit

10.1101/047829 ◽

2016 ◽

Author(s):

Rudy Arthur ◽

Ole Schulz-Trieglaff ◽

Anthony J. Cox ◽

Jared Michael O’Connell

Keyword(s):

Data Clustering ◽

State Of The Art ◽

Source Code ◽

Statistical Genetics ◽

Data Sets ◽

Whole Genome ◽

Link Type ◽

Art Methods ◽

Invaluable Tool

AbstractAncestry and Kinship Toolkit (AKT) is a statistical genetics tool for analysing large cohorts of whole-genome sequenced samples. It can rapidly detect related samples, characterise sample ancestry, calculate correlation between variants, check Mendel consistency and perform data clustering. AKT brings together the functionality of many state-of-the-art methods, with a focus on speed and a unified interface. We believe it will be an invaluable tool for the curation of large WGS data-sets.AvailabilityThe source code is available at https://illumina.github.io/[email protected], [email protected]

Download Full-text

pyconsFold: A fast and easy tool for modelling and docking using distance predictions

10.1101/2021.02.08.430195 ◽

2021 ◽

Author(s):

J Lamb ◽

A Elofsson

Keyword(s):

State Of The Art ◽

Source Code ◽

Link Type ◽

Contact Distance ◽

Protein Dimers ◽

Strong Focus ◽

Distance Distributions ◽

Regular Contact ◽

Contact Predictions ◽

Viable Method

AbstractMotivationContact predictions within a protein has recently become a viable method for accurate prediction of protein structure. Using predicted distance distributions has been shown in many cases to be superior to only using a binary contact annotation. Using predicted inter-protein distances has also been shown to be able to dock some protein dimers.ResultsHere we present pyconsFold. Using CNS as its underlying folding mechanism and predicted contact distance it outperforms regular contact prediction based modelling on our dataset of 210 proteins. It performs marginally worse than the state of the art pyRosetta folding pipeline but is on average about 20 times faster per model. More importantly pyconsFold can also be used as a fold-and-dock protocol by using predicted inter-protein contacts to simultaneously fold and dock two protein chains.Availability and implementationpyconsFold is implemented in Python 3 with a strong focus on using as few dependencies as possible for longevity. It is available both as a pip package in Python 3 and as source code on GitHub and is published under the GPLv3 [email protected] materialInstall instructions, examples and parameters can be found in the supplemental notes.Availability of dataThe data underlying this article together with source code are available on github, at https://github.com/johnlamb/pyconsfold.

Download Full-text

Solving scaffolding problem with repeats

10.1101/330472 ◽

2018 ◽

Author(s):

Igor Mandric ◽

Alex Zelikovsky

Keyword(s):

Genome Assembly ◽

State Of The Art ◽

High Demand ◽

Computational Approaches ◽

Link Type ◽

Optimization Formulation

AbstractOne of the most important steps in genome assembly is scaffolding. Increasing the length of sequencing reads allows assembling short genomes but assembly of long repeat-rich genomes remains one of the most interesting and challenging problems in bioinformatics. There is a high demand in developing computational approaches for repeat aware scaffolding. In this paper, we propose a novel repeat-aware scaffolder BATISCAF based on the optimization formulation for filtering out repeated and short contigs. Our experiments with five benchmarking datasets show that the proposed tool BATISCAF outperforms state-of-the-art tools. BATISCAF is freely available on GitHub: https://github.com/mandricigor/batiscaf.

Download Full-text