scholarly journals MALVA: genotyping by Mapping-free ALlele detection of known VAriants

2019 ◽  
Author(s):  
Giulia Bernardini ◽  
Paola Bonizzoni ◽  
Luca Denti ◽  
Marco Previtali ◽  
Alexander Schönhuth

AbstractThe amount of genetic variation discovered and characterized in human populations is huge, and is growing rapidly with the widespread availability of modern sequencing technologies. Such a great deal of variation data, that accounts for human diversity, leads to various challenging computational tasks, including variant calling and genotyping of newly sequenced individuals. The standard pipelines for addressing these problems include read mapping, which is a computationally expensive procedure. A few mapping-free tools were proposed in recent years to speed up the genotyping process. While such tools have highly efficient run-times, they focus on isolated, bi-allelic SNPs, providing limited support for multi-allelic SNPs, indels, and genomic regions with high variant density.To address these issues, we introduceMALVA, a fast and lightweight mapping-free method to genotype an individual directly from a sample of reads.MALVAis the first mapping-free tool that is able to genotype multi-allelic SNPs and indels, even in high density genomic regions, and to effectively handle a huge number of variants such as those provided by the 1000 Genome Project. An experimental evaluation on whole-genome data shows thatMALVArequires one order of magnitude less time to genotype a donor than alignment-based pipelines, providing similar accuracy. Remarkably, on indels,MALVAprovides even better results than the most widely adopted variant discovery tools.

Author(s):  
Pierre Morisse ◽  
Claire Lemaitre ◽  
Fabrice Legeai

Abstract Motivation Linked-Reads technologies combine both the high-quality and low cost of short-reads sequencing and long-range information, through the use of barcodes tagging reads which originate from a common long DNA molecule. This technology has been employed in a broad range of applications including genome assembly, phasing and scaffolding, as well as structural variant calling. However, to date, no tool or API dedicated to the manipulation of Linked-Reads data exist. Results We introduce LRez, a C ++ API and toolkit which allows easy management of Linked-Reads data. LRez includes various functionalities, for computing numbers of common barcodes between genomic regions, extracting barcodes from BAM files, as well as indexing and querying BAM, FASTQ and gzipped FASTQ files to quickly fetch all reads or alignments containing a given barcode. LRez is compatible with a wide range of Linked-Reads sequencing technologies, and can thus be used in any tool or pipeline requiring barcode processing or indexing, in order to improve their performances. Availability and implementation LRez is implemented in C ++, supported on Unix-based platforms, and available under AGPL-3.0 License at https://github.com/morispi/LRez, and as a bioconda module. Supplementary information Supplementary data are available at Bioinformatics Advances


2015 ◽  
Author(s):  
Justin M Zook ◽  
David Catoe ◽  
Jennifer McDaniel ◽  
Lindsay Vang ◽  
Noah Spies ◽  
...  

The Genome in a Bottle Consortium, hosted by the National Institute of Standards and Technology (NIST) is creating reference materials and data for human genome sequencing, as well as methods for genome comparison and benchmarking. Here, we describe a large, diverse set of sequencing data for seven human genomes; five are current or candidate NIST Reference Materials. The pilot genome, NA12878, has been released as NIST RM 8398. We also describe data from two Personal Genome Project trios, one of Ashkenazim Jewish ancestry and one of Chinese ancestry. The data come from 12 technologies: BioNano Genomics, Complete Genomics paired-end and LFR, Ion Proton exome, Oxford Nanopore, Pacific Biosciences, SOLiD, 10X Genomics GemCodeTM WGS, and Illumina exome and WGS paired-end, mate-pair, and synthetic long reads. Cell lines, DNA, and data from these individuals are publicly available. Therefore, we expect these data to be useful for revealing novel information about the human genome and improving sequencing technologies, SNP, indel, and structural variant calling, and de novo assembly.


PeerJ ◽  
2020 ◽  
Vol 8 ◽  
pp. e9338
Author(s):  
Robert Edgar

Mapping of reads to reference sequences is an essential step in a wide range of biological studies. The large size of datasets generated with next-generation sequencing technologies motivates the development of fast mapping software. Here, I describe URMAP, a new read mapping algorithm. URMAP is an order of magnitude faster than BWA with comparable accuracy on several validation tests. On a Genome in a Bottle (GIAB) variant calling test with 30× coverage 2×150 reads, URMAP achieves high accuracy (precision 0.998, sensitivity 0.982 and F-measure 0.990) with the strelka2 caller. However, GIAB reference variants are shown to be biased against repetitive regions which are difficult to map and may therefore pose an unrealistically easy challenge to read mappers and variant callers.


2021 ◽  
Vol 0 (0) ◽  
Author(s):  
Maurilio Monsu ◽  
Matteo Comin

Abstract Sequencing technologies has provided the basis of most modern genome sequencing studies due to its high base-level accuracy and relatively low cost. One of the most demanding step is mapping reads to the human reference genome. The reliance on a single reference human genome could introduce substantial biases in downstream analyses. Pangenomic graph reference representations offer an attractive approach for storing genetic variations. Moreover, it is possible to include known variants in the reference in order to make read mapping, variant calling, and genotyping variant-aware. Only recently a framework for variation graphs, vg [Garrison E, Adam MN, Siren J, et al. Variation graph toolkit improves read mapping by representing genetic variation in the reference. Nat Biotechnol 2018;36:875–9], have improved variation-aware alignment and variant calling in general. The major bottleneck of vg is its high cost of reads mapping to a variation graph. In this paper we study the problem of SNP calling on a variation graph and we present a fast reads alignment tool, named VG SNP-Aware. VG SNP-Aware is able align reads exactly to a variation graph and detect SNPs based on these aligned reads. The results show that VG SNP-Aware can efficiently map reads to a variation graph with a speedup of 40× with respect to vg and similar accuracy on SNPs detection.


2021 ◽  
Vol 12 ◽  
Author(s):  
Zachary Stephens ◽  
Dragana Milosevic ◽  
Benjamin Kipp ◽  
Stefan Grebe ◽  
Ravishankar K. Iyer ◽  
...  

Long read sequencing technologies have the potential to accurately detect and phase variation in genomic regions that are difficult to fully characterize with conventional short read methods. These difficult to sequence regions include several clinically relevant genes with highly homologous pseudogenes, many of which are prone to gene conversions or other types of complex structural rearrangements. We present PB-Motif, a new method for identifying rearrangements between two highly homologous genomic regions using PacBio long reads. PB-Motif leverages clustering and filtering techniques to efficiently report rearrangements in the presence of sequencing errors and other systematic artifacts. Supporting reads for each high-confidence rearrangement can then be used for copy number estimation and phased variant calling. First, we demonstrate PB-Motif's accuracy with simulated sequence rearrangements of PMS2 and its pseudogene PMS2CL using simulated reads sweeping over a range of sequencing error rates. We then apply PB-Motif to 26 clinical samples, characterizing CYP21A2 and its pseudogene CYP21A1P as part of a diagnostic assay for congenital adrenal hyperplasia. We successfully identify damaging variation and patient carrier status concordant with clinical diagnosis obtained from multiplex ligation-dependent amplification (MLPA) and Sanger sequencing. The source code is available at: github.com/zstephens/pb-motif.


Author(s):  
Nathan D. Olson ◽  
Justin Wagner ◽  
Jennifer McDaniel ◽  
Sarah H. Stephens ◽  
Samuel T. Westreich ◽  
...  

SummaryThe precisionFDA Truth Challenge V2 aimed to assess the state-of-the-art of variant calling in difficult-to-map regions and the Major Histocompatibility Complex (MHC). Starting with FASTQ files, 20 challenge participants applied their variant calling pipelines and submitted 64 variant callsets for one or more sequencing technologies (~35X Illumina, ~35X PacBio HiFi, and ~50X Oxford Nanopore Technologies). Submissions were evaluated following best practices for benchmarking small variants with the new GIAB benchmark sets and genome stratifications. Challenge submissions included a number of innovative methods for all three technologies, with graph-based and machine-learning methods scoring best for short-read and long-read datasets, respectively. New methods out-performed the 2016 Truth Challenge winners, and new machine-learning approaches combining multiple sequencing technologies performed particularly well. Recent developments in sequencing and variant calling have enabled benchmarking variants in challenging genomic regions, paving the way for the identification of previously unknown clinically relevant variants.


2020 ◽  
Vol 89 (3) ◽  
pp. e467
Author(s):  
Dawid Leciej ◽  
Karl-Heinz Herzig ◽  
Olaf Thalmann

Humans are constantly exposed to health risks inherent to the environment in which they live, thereby including non-human fauna. Zoonoses are infectious diseases caused by agents such as bacteria, parasites, or viruses being transmitted to humans from wild animals and livestock. The close proximity of animals and humans facilitate the spread of zoonoses, so it is intriguing to hypothesize that populations accustomed to different lifestyles will also vary in the prevalence of zoonotic agents. The Neolithic era in human history is characterised by a dramatic transition in lifestyle, from hunting and gathering to farming. Thus, with the changes in the reservoir of animal species humans were exposed to zoonotic agents potentially penetrating human populations. Due to the rapid development of sequencing technologies and methodology in ancient DNA research, it is now possible to generate complete genomes of ancient specimens and pinpoint those genomic regions or epigenetic signatures that might be influenced by past zoonotic transmissions. Unravelling such traces, particularly on a population-scale, will help to overcome the lack of generalisation that hampered previous research focusing exclusively on the model fossils in human evolution, and facilitate a better understanding of the aetiology of diseases, including those caused by zoonotic agents.


2021 ◽  
Author(s):  
Parsoa Khorsand ◽  
Fereydoun Hormozdiari

Abstract Large scale catalogs of common genetic variants (including indels and structural variants) are being created using data from second and third generation whole-genome sequencing technologies. However, the genotyping of these variants in newly sequenced samples is a nontrivial task that requires extensive computational resources. Furthermore, current approaches are mostly limited to only specific types of variants and are generally prone to various errors and ambiguities when genotyping complex events. We are proposing an ultra-efficient approach for genotyping any type of structural variation that is not limited by the shortcomings and complexities of current mapping-based approaches. Our method Nebula utilizes the changes in the count of k-mers to predict the genotype of structural variants. We have shown that not only Nebula is an order of magnitude faster than mapping based approaches for genotyping structural variants, but also has comparable accuracy to state-of-the-art approaches. Furthermore, Nebula is a generic framework not limited to any specific type of event. Nebula is publicly available at https://github.com/Parsoa/Nebula.


2021 ◽  
Vol 3 (1) ◽  
Author(s):  
Julen Mendieta-Esteban ◽  
Marco Di Stefano ◽  
David Castillo ◽  
Irene Farabella ◽  
Marc A Marti-Renom

Abstract Chromosome conformation capture (3C) technologies measure the interaction frequency between pairs of chromatin regions within the nucleus in a cell or a population of cells. Some of these 3C technologies retrieve interactions involving non-contiguous sets of loci, resulting in sparse interaction matrices. One of such 3C technologies is Promoter Capture Hi-C (pcHi-C) that is tailored to probe only interactions involving gene promoters. As such, pcHi-C provides sparse interaction matrices that are suitable to characterize short- and long-range enhancer–promoter interactions. Here, we introduce a new method to reconstruct the chromatin structural (3D) organization from sparse 3C-based datasets such as pcHi-C. Our method allows for data normalization, detection of significant interactions and reconstruction of the full 3D organization of the genomic region despite of the data sparseness. Specifically, it builds, with as low as the 2–3% of the data from the matrix, reliable 3D models of similar accuracy of those based on dense interaction matrices. Furthermore, the method is sensitive enough to detect cell-type-specific 3D organizational features such as the formation of different networks of active gene communities.


Sign in / Sign up

Export Citation Format

Share Document