MALVA: genotyping by Mapping-free ALlele detection of known VAriants

Mapping Intimacies ◽

10.1101/575126 ◽

2019 ◽

Cited By ~ 1

Author(s):

Giulia Bernardini ◽

Paola Bonizzoni ◽

Luca Denti ◽

Marco Previtali ◽

Alexander Schönhuth

Keyword(s):

Variant Calling ◽

Genome Project ◽

Human Populations ◽

Genome Data ◽

Variant Discovery ◽

Sequencing Technologies ◽

Order Of Magnitude ◽

Speed Up ◽

Similar Accuracy ◽

Genomic Regions

AbstractThe amount of genetic variation discovered and characterized in human populations is huge, and is growing rapidly with the widespread availability of modern sequencing technologies. Such a great deal of variation data, that accounts for human diversity, leads to various challenging computational tasks, including variant calling and genotyping of newly sequenced individuals. The standard pipelines for addressing these problems include read mapping, which is a computationally expensive procedure. A few mapping-free tools were proposed in recent years to speed up the genotyping process. While such tools have highly efficient run-times, they focus on isolated, bi-allelic SNPs, providing limited support for multi-allelic SNPs, indels, and genomic regions with high variant density.To address these issues, we introduceMALVA, a fast and lightweight mapping-free method to genotype an individual directly from a sample of reads.MALVAis the first mapping-free tool that is able to genotype multi-allelic SNPs and indels, even in high density genomic regions, and to effectively handle a huge number of variants such as those provided by the 1000 Genome Project. An experimental evaluation on whole-genome data shows thatMALVArequires one order of magnitude less time to genotype a donor than alignment-based pipelines, providing similar accuracy. Remarkably, on indels,MALVAprovides even better results than the most widely adopted variant discovery tools.

Download Full-text

LRez: C ++ API and toolkit for analyzing and managing Linked-Reads data

Bioinformatics Advances ◽

10.1093/bioadv/vbab022 ◽

2021 ◽

Author(s):

Pierre Morisse ◽

Claire Lemaitre ◽

Fabrice Legeai

Keyword(s):

Genome Assembly ◽

Low Cost ◽

Variant Calling ◽

Supplementary Information ◽

Supplementary Data ◽

High Quality ◽

Dna Molecule ◽

Sequencing Technologies ◽

Wide Range ◽

Genomic Regions

Abstract Motivation Linked-Reads technologies combine both the high-quality and low cost of short-reads sequencing and long-range information, through the use of barcodes tagging reads which originate from a common long DNA molecule. This technology has been employed in a broad range of applications including genome assembly, phasing and scaffolding, as well as structural variant calling. However, to date, no tool or API dedicated to the manipulation of Linked-Reads data exist. Results We introduce LRez, a C ++ API and toolkit which allows easy management of Linked-Reads data. LRez includes various functionalities, for computing numbers of common barcodes between genomic regions, extracting barcodes from BAM files, as well as indexing and querying BAM, FASTQ and gzipped FASTQ files to quickly fetch all reads or alignments containing a given barcode. LRez is compatible with a wide range of Linked-Reads sequencing technologies, and can thus be used in any tool or pipeline requiring barcode processing or indexing, in order to improve their performances. Availability and implementation LRez is implemented in C ++, supported on Unix-based platforms, and available under AGPL-3.0 License at https://github.com/morispi/LRez, and as a bioconda module. Supplementary information Supplementary data are available at Bioinformatics Advances

Download Full-text

Extensive sequencing of seven human genomes to characterize benchmark reference materials

10.1101/026468 ◽

2015 ◽

Cited By ~ 9

Author(s):

Justin M Zook ◽

David Catoe ◽

Jennifer McDaniel ◽

Lindsay Vang ◽

Noah Spies ◽

...

Keyword(s):

Human Genome ◽

Reference Materials ◽

De Novo ◽

Variant Calling ◽

Genome Project ◽

Genome Comparison ◽

Personal Genome ◽

Sequencing Data ◽

Sequencing Technologies ◽

Human Genomes

The Genome in a Bottle Consortium, hosted by the National Institute of Standards and Technology (NIST) is creating reference materials and data for human genome sequencing, as well as methods for genome comparison and benchmarking. Here, we describe a large, diverse set of sequencing data for seven human genomes; five are current or candidate NIST Reference Materials. The pilot genome, NA12878, has been released as NIST RM 8398. We also describe data from two Personal Genome Project trios, one of Ashkenazim Jewish ancestry and one of Chinese ancestry. The data come from 12 technologies: BioNano Genomics, Complete Genomics paired-end and LFR, Ion Proton exome, Oxford Nanopore, Pacific Biosciences, SOLiD, 10X Genomics GemCodeTM WGS, and Illumina exome and WGS paired-end, mate-pair, and synthetic long reads. Cell lines, DNA, and data from these individuals are publicly available. Therefore, we expect these data to be useful for revealing novel information about the human genome and improving sequencing technologies, SNP, indel, and structural variant calling, and de novo assembly.

Download Full-text

URMAP, an ultra-fast read mapper

PeerJ ◽

10.7717/peerj.9338 ◽

2020 ◽

Vol 8 ◽

pp. e9338

Author(s):

Robert Edgar

Keyword(s):

Variant Calling ◽

Mapping Algorithm ◽

Sequencing Technologies ◽

Mapping Software ◽

A Genome ◽

Biological Studies ◽

Wide Range ◽

Order Of Magnitude ◽

Comparable Accuracy ◽

Validation Tests

Mapping of reads to reference sequences is an essential step in a wide range of biological studies. The large size of datasets generated with next-generation sequencing technologies motivates the development of fast mapping software. Here, I describe URMAP, a new read mapping algorithm. URMAP is an order of magnitude faster than BWA with comparable accuracy on several validation tests. On a Genome in a Bottle (GIAB) variant calling test with 30× coverage 2×150 reads, URMAP achieves high accuracy (precision 0.998, sensitivity 0.982 and F-measure 0.990) with the strelka2 caller. However, GIAB reference variants are shown to be biased against repetitive regions which are difficult to map and may therefore pose an unrealistically easy challenge to read mappers and variant callers.

Download Full-text

Fast alignment of reads to a variation graph with application to SNP detection

Journal of Integrative Bioinformatics ◽

10.1515/jib-2021-0032 ◽

2021 ◽

Vol 0 (0) ◽

Author(s):

Maurilio Monsu ◽

Matteo Comin

Keyword(s):

Low Cost ◽

Variant Calling ◽

Read Mapping ◽

Base Level ◽

Sequencing Technologies ◽

Alignment Tool ◽

Sequencing Studies ◽

Major Bottleneck ◽

High Base ◽

Similar Accuracy

Abstract Sequencing technologies has provided the basis of most modern genome sequencing studies due to its high base-level accuracy and relatively low cost. One of the most demanding step is mapping reads to the human reference genome. The reliance on a single reference human genome could introduce substantial biases in downstream analyses. Pangenomic graph reference representations offer an attractive approach for storing genetic variations. Moreover, it is possible to include known variants in the reference in order to make read mapping, variant calling, and genotyping variant-aware. Only recently a framework for variation graphs, vg [Garrison E, Adam MN, Siren J, et al. Variation graph toolkit improves read mapping by representing genetic variation in the reference. Nat Biotechnol 2018;36:875–9], have improved variation-aware alignment and variant calling in general. The major bottleneck of vg is its high cost of reads mapping to a variation graph. In this paper we study the problem of SNP calling on a variation graph and we present a fast reads alignment tool, named VG SNP-Aware. VG SNP-Aware is able align reads exactly to a variation graph and detect SNPs based on these aligned reads. The results show that VG SNP-Aware can efficiently map reads to a variation graph with a speedup of 40× with respect to vg and similar accuracy on SNPs detection.

Download Full-text

PB-Motif—A Method for Identifying Gene/Pseudogene Rearrangements With Long Reads: An Application to CYP21A2 Genotyping

Frontiers in Genetics ◽

10.3389/fgene.2021.716586 ◽

2021 ◽

Vol 12 ◽

Author(s):

Zachary Stephens ◽

Dragana Milosevic ◽

Benjamin Kipp ◽

Stefan Grebe ◽

Ravishankar K. Iyer ◽

...

Keyword(s):

Phase Variation ◽

Variant Calling ◽

Error Rates ◽

Clinical Samples ◽

Sequencing Error ◽

Carrier Status ◽

Sequencing Errors ◽

Sequencing Technologies ◽

Long Reads ◽

Genomic Regions

Long read sequencing technologies have the potential to accurately detect and phase variation in genomic regions that are difficult to fully characterize with conventional short read methods. These difficult to sequence regions include several clinically relevant genes with highly homologous pseudogenes, many of which are prone to gene conversions or other types of complex structural rearrangements. We present PB-Motif, a new method for identifying rearrangements between two highly homologous genomic regions using PacBio long reads. PB-Motif leverages clustering and filtering techniques to efficiently report rearrangements in the presence of sequencing errors and other systematic artifacts. Supporting reads for each high-confidence rearrangement can then be used for copy number estimation and phased variant calling. First, we demonstrate PB-Motif's accuracy with simulated sequence rearrangements of PMS2 and its pseudogene PMS2CL using simulated reads sweeping over a range of sequencing error rates. We then apply PB-Motif to 26 clinical samples, characterizing CYP21A2 and its pseudogene CYP21A1P as part of a diagnostic assay for congenital adrenal hyperplasia. We successfully identify damaging variation and patient carrier status concordant with clinical diagnosis obtained from multiplex ligation-dependent amplification (MLPA) and Sanger sequencing. The source code is available at: github.com/zstephens/pb-motif.

Download Full-text

precisionFDA Truth Challenge V2: Calling variants from short- and long-reads in difficult-to-map regions

10.1101/2020.11.13.380741 ◽

2020 ◽

Cited By ~ 2

Author(s):

Nathan D. Olson ◽

Justin Wagner ◽

Jennifer McDaniel ◽

Sarah H. Stephens ◽

Samuel T. Westreich ◽

...

Keyword(s):

Machine Learning ◽

Variant Calling ◽

Learning Approaches ◽

Sequencing Technologies ◽

Innovative Methods ◽

Long Reads ◽

Oxford Nanopore ◽

Long Read ◽

Recent Developments ◽

Genomic Regions

SummaryThe precisionFDA Truth Challenge V2 aimed to assess the state-of-the-art of variant calling in difficult-to-map regions and the Major Histocompatibility Complex (MHC). Starting with FASTQ files, 20 challenge participants applied their variant calling pipelines and submitted 64 variant callsets for one or more sequencing technologies (~35X Illumina, ~35X PacBio HiFi, and ~50X Oxford Nanopore Technologies). Submissions were evaluated following best practices for benchmarking small variants with the new GIAB benchmark sets and genome stratifications. Challenge submissions included a number of innovative methods for all three technologies, with graph-based and machine-learning methods scoring best for short-read and long-read datasets, respectively. New methods out-performed the 2016 Truth Challenge winners, and new machine-learning approaches combining multiple sequencing technologies performed particularly well. Recent developments in sequencing and variant calling have enabled benchmarking variants in challenging genomic regions, paving the way for the identification of previously unknown clinically relevant variants.

Download Full-text

Zoonoses and their traces in ancient genomes – a possible indicator for ancient life-style changes?

Journal of Medical Science ◽

10.20883/medical.e467 ◽

2020 ◽

Vol 89 (3) ◽

pp. e467

Author(s):

Dawid Leciej ◽

Karl-Heinz Herzig ◽

Olaf Thalmann

Keyword(s):

Health Risks ◽

Animal Species ◽

Rapid Development ◽

Human Populations ◽

Sequencing Technologies ◽

Hunting And Gathering ◽

Life Style Changes ◽

Close Proximity ◽

Genomic Regions ◽

Epigenetic Signatures

Humans are constantly exposed to health risks inherent to the environment in which they live, thereby including non-human fauna. Zoonoses are infectious diseases caused by agents such as bacteria, parasites, or viruses being transmitted to humans from wild animals and livestock. The close proximity of animals and humans facilitate the spread of zoonoses, so it is intriguing to hypothesize that populations accustomed to different lifestyles will also vary in the prevalence of zoonotic agents. The Neolithic era in human history is characterised by a dramatic transition in lifestyle, from hunting and gathering to farming. Thus, with the changes in the reservoir of animal species humans were exposed to zoonotic agents potentially penetrating human populations. Due to the rapid development of sequencing technologies and methodology in ancient DNA research, it is now possible to generate complete genomes of ancient specimens and pinpoint those genomic regions or epigenetic signatures that might be influenced by past zoonotic transmissions. Unravelling such traces, particularly on a population-scale, will help to overcome the lack of generalisation that hampered previous research focusing exclusively on the model fossils in human evolution, and facilitate a better understanding of the aetiology of diseases, including those caused by zoonotic agents.

Download Full-text

Nebula: ultra-efficient mapping-free structural variant genotyper

Nucleic Acids Research ◽

10.1093/nar/gkab025 ◽

2021 ◽

Author(s):

Parsoa Khorsand ◽

Fereydoun Hormozdiari

Keyword(s):

Large Scale ◽

Structural Variants ◽

Sequencing Technologies ◽

Generic Framework ◽

Common Genetic Variants ◽

Order Of Magnitude ◽

Complex Events ◽

Comparable Accuracy ◽

Using Data ◽

Computational Resources

Abstract Large scale catalogs of common genetic variants (including indels and structural variants) are being created using data from second and third generation whole-genome sequencing technologies. However, the genotyping of these variants in newly sequenced samples is a nontrivial task that requires extensive computational resources. Furthermore, current approaches are mostly limited to only specific types of variants and are generally prone to various errors and ambiguities when genotyping complex events. We are proposing an ultra-efficient approach for genotyping any type of structural variation that is not limited by the shortcomings and complexities of current mapping-based approaches. Our method Nebula utilizes the changes in the count of k-mers to predict the genotype of structural variants. We have shown that not only Nebula is an order of magnitude faster than mapping based approaches for genotyping structural variants, but also has comparable accuracy to state-of-the-art approaches. Furthermore, Nebula is a generic framework not limited to any specific type of event. Nebula is publicly available at https://github.com/Parsoa/Nebula.

Download Full-text

Unbiased machine learning methods to predict the limitations of variant calling in homologous genomic regions using next-generation sequencing

Molecular Genetics and Metabolism ◽

10.1016/s1096-7192(21)00467-4 ◽

2021 ◽

Vol 132 ◽

pp. S250-S252

Author(s):

Feng Li ◽

Rohan Gnanaolivu ◽

Noemi Vidal-Folch ◽

Neiladri Saha ◽

Nipun Mistry ◽

...

Keyword(s):

Machine Learning ◽

Next Generation Sequencing ◽

Variant Calling ◽

Next Generation ◽

Learning Methods ◽

Machine Learning Methods ◽

Genomic Regions ◽

Generation Sequencing

Download Full-text

3D reconstruction of genomic regions from sparse interaction data

NAR Genomics and Bioinformatics ◽

10.1093/nargab/lqab017 ◽

2021 ◽

Vol 3 (1) ◽

Author(s):

Julen Mendieta-Esteban ◽

Marco Di Stefano ◽

David Castillo ◽

Irene Farabella ◽

Marc A Marti-Renom

Keyword(s):

3D Models ◽

Genomic Region ◽

Gene Promoters ◽

Chromosome Conformation ◽

The Matrix ◽

Chromatin Structural ◽

A Cell ◽

Similar Accuracy ◽

Genomic Regions ◽

Interaction Matrices

Abstract Chromosome conformation capture (3C) technologies measure the interaction frequency between pairs of chromatin regions within the nucleus in a cell or a population of cells. Some of these 3C technologies retrieve interactions involving non-contiguous sets of loci, resulting in sparse interaction matrices. One of such 3C technologies is Promoter Capture Hi-C (pcHi-C) that is tailored to probe only interactions involving gene promoters. As such, pcHi-C provides sparse interaction matrices that are suitable to characterize short- and long-range enhancer–promoter interactions. Here, we introduce a new method to reconstruct the chromatin structural (3D) organization from sparse 3C-based datasets such as pcHi-C. Our method allows for data normalization, detection of significant interactions and reconstruction of the full 3D organization of the genomic region despite of the data sparseness. Specifically, it builds, with as low as the 2–3% of the data from the matrix, reliable 3D models of similar accuracy of those based on dense interaction matrices. Furthermore, the method is sensitive enough to detect cell-type-specific 3D organizational features such as the formation of different networks of active gene communities.

Download Full-text