Comparison of Read Mapping and Variant Calling Tools for the Analysis of Plant NGS Data

Hanna Marie Schilbert; Andreas Rempel; Boas Pucker

doi:10.3390/plants9040439

Comparison of Read Mapping and Variant Calling Tools for the Analysis of Plant NGS Data

Plants ◽

10.3390/plants9040439 ◽

2020 ◽

Vol 9 (4) ◽

pp. 439 ◽

Cited By ~ 3

Author(s):

Hanna Marie Schilbert ◽

Andreas Rempel ◽

Boas Pucker

Keyword(s):

High Throughput Sequencing ◽

Performance Metrics ◽

Model Organism ◽

Variant Calling ◽

Reference Sequence ◽

Read Mapping ◽

The Past ◽

Sequencing Technologies ◽

Plant Sciences ◽

Ngs Data

High-throughput sequencing technologies have rapidly developed during the past years and have become an essential tool in plant sciences. However, the analysis of genomic data remains challenging and relies mostly on the performance of automatic pipelines. Frequently applied pipelines involve the alignment of sequence reads against a reference sequence and the identification of sequence variants. Since most benchmarking studies of bioinformatics tools for this purpose have been conducted on human datasets, there is a lack of benchmarking studies in plant sciences. In this study, we evaluated the performance of 50 different variant calling pipelines, including five read mappers and ten variant callers, on six real plant datasets of the model organism Arabidopsis thaliana. Sets of variants were evaluated based on various parameters including sensitivity and specificity. We found that all investigated tools are suitable for analysis of NGS data in plant research. When looking at different performance metrics, BWA-MEM and Novoalign were the best mappers and GATK returned the best results in the variant calling step.

Download Full-text

Comparison of read mapping and variant calling tools for the analysis of plant NGS data

10.1101/2020.03.10.986059 ◽

2020 ◽

Author(s):

Hanna Marie Schilbert ◽

Andreas Rempel ◽

Boas Pucker

Keyword(s):

High Throughput Sequencing ◽

Model Organism ◽

Variant Calling ◽

Reference Sequence ◽

Read Mapping ◽

The Past ◽

Sequencing Technologies ◽

Plant Sciences ◽

Ngs Data ◽

Real Plant

AbstractHigh-throughput sequencing technologies have rapidly developed during the past years and became an essential tool in plant sciences. However, the analysis of genomic data remains challenging and relies mostly on the performance of automatic pipelines. Frequently applied pipelines involve the alignment of sequence reads against a reference sequence and the identification of sequence variants. Since most benchmarking studies of bioinformatics tools for this purpose have been conducted on human datasets, there is a lack of benchmarking studies in plant sciences. In this study, we evaluated the performance of 50 different variant calling pipelines, including five read mappers and ten variant callers, on six real plant datasets of the model organism Arabidopsis thaliana. Sets of variants were evaluated based on various parameters including sensitivity and specificity. We found that all investigated tools are suitable for analysis of NGS data in plant research. When looking at different performance metrices, BWA-MEM and Novoalign were the best mappers and GATK returned the best results in the variant calling step.

Download Full-text

Assessing pollution of aquatic environments with diatoms’ DNA metabarcoding: experience and developments from France water framework directive networks

Metabarcoding and Metagenomics ◽

10.3897/mbmg.3.39646 ◽

2019 ◽

Vol 3 ◽

Cited By ~ 3

Author(s):

Vasselon Valentin ◽

Rimet Frédéric ◽

Domaizon Isabelle ◽

Monnier Olivier ◽

Reyjol Yorick ◽

...

Keyword(s):

Water Framework Directive ◽

High Throughput Sequencing ◽

Ecological Status ◽

Aquatic Environments ◽

Morphological Identification ◽

The Past ◽

Sequencing Technologies ◽

Dna Metabarcoding ◽

Morphological Approach ◽

Status Assessment

Ecological status assessment of watercourses is based on the calculation of quality indices using pollution sensitivity of targeted biological groups, including diatoms. The determination and quantification of diatom species is generally based on microscopic morphological identification, which requires expertise and is time-consuming and costly. In Europe, this morphological approach is legally imposed by standards and regulatory decrees by the Water Framework Directive (WFD). Over the past decade, a DNA-based molecular biology approach has newly been developed to identify species based on genetic criteria rather than morphological ones (i.e. DNA metabarcoding). In combination with high throughput sequencing technologies, metabarcoding makes it possible both to identify all species present in an environmental sample and to process several hundred samples in parallel. This article presents the results of two recent studies carried out on the WFD networks of rivers of Mayotte (2013–2018) and metropolitan France (2016–2018). These studies aimed at testing the potential application of metabarcoding for biomonitoring in the context of the WFD. We discuss the various methodological developments and optimisations that have been made to make the taxonomic inventories of diatoms produced by metabarcoding more reliable, particularly in terms of species quantification. We present the results of the application of this DNA approach on more than 500 river sites, comparing them with those obtained using the standardised morphological method. Finally, we discuss the potential of metabarcoding for routine application, its limits of application and propose some recommendations for future implementation in WFD.

Download Full-text

The chicken model organism for epigenomic research

Genome ◽

10.1139/gen-2020-0129 ◽

2020 ◽

Author(s):

Tasnim H. BEACON ◽

James R DAVIE

Keyword(s):

High Throughput Sequencing ◽

Chicken Genome ◽

Regulation Of Gene Expression ◽

Model Organism ◽

Bioinformatic Tools ◽

Sequencing Technologies ◽

Trace Back ◽

Human Orthologs ◽

Genomic Studies ◽

Chicken Model

The chicken model organism has advanced the areas of developmental biology, virology, immunology, oncology, epigenetic regulation of gene expression, conservation biology, and genomics of domestication. Further, the chicken model organism has aided in our understanding of human disease. Through the recent advances in high-throughput sequencing and bioinformatic tools, researchers have successfully identified sequences in the chicken genome that have human orthologs, improving mammalian genome annotation. In this review, we highlight the importance of chicken as an animal model in basic and pre-clinical research. We will present the importance of chicken in poultry epigenetics and in genomic studies that trace back to their ancestor, the last link between human and chicken tree of life. There are still many genes of unknown function in the chicken genome yet to be characterized. By taking advantage of recent sequencing technologies, it is possible to gain further insight into the chicken epigenome.

Download Full-text

‘There and back again’: revisiting the pathophysiological roles of human endogenous retroviruses in the post-genomic era

Philosophical Transactions of the Royal Society B Biological Sciences ◽

10.1098/rstb.2012.0504 ◽

2013 ◽

Vol 368 (1626) ◽

pp. 20120504 ◽

Cited By ~ 45

Author(s):

Gkikas Magiorkinis ◽

Robert Belshaw ◽

Aris Katzourakis

Keyword(s):

Human Genome ◽

High Throughput Sequencing ◽

Immune Escape ◽

Genomic Sequence ◽

Sequence Data ◽

Endogenous Retroviruses ◽

Human Endogenous Retroviruses ◽

The Past ◽

Sequencing Technologies ◽

Research Questions

Almost 8% of the human genome comprises endogenous retroviruses (ERVs). While they have been shown to cause specific pathologies in animals, such as cancer, their association with disease in humans remains controversial. The limited evidence is partly due to the physical and bioethical restrictions surrounding the study of transposons in humans, coupled with the major experimental and bioinformatics challenges surrounding the association of ERVs with disease in general. Two biotechnological landmarks of the past decade provide us with unprecedented research artillery: (i) the ultra-fine sequencing of the human genome and (ii) the emergence of high-throughput sequencing technologies. Here, we critically assemble research about potential pathologies of ERVs in humans. We argue that the time is right to revisit the long-standing questions of human ERV pathogenesis within a robust and carefully structured framework that makes full use of genomic sequence data. We also pose two thought-provoking research questions on potential pathophysiological roles of ERVs with respect to immune escape and regulation.

Download Full-text

Target variant detection in leukemia using unaligned RNA-Seq reads

10.1101/295808 ◽

2018 ◽

Cited By ~ 1

Author(s):

Eric Olivier Audemard ◽

Patrick Gendron ◽

Vincent-Philippe Lavallée ◽

Josée Hébert ◽

Guy Sauvageau ◽

...

Keyword(s):

Variant Calling ◽

The Cancer Genome Atlas ◽

Rna Seq ◽

Read Mapping ◽

Targeted Mutation ◽

Cancer Genome Atlas ◽

Computationally Intensive ◽

And Performance ◽

Next Generation Sequencing Ngs ◽

Ngs Data

AbstractMutations identified in each Acute Myeloid Leukemia (AML) patients are useful for prognosis and to select targeted therapies. Detection of such mutations by the analysis of Next-Generation Sequencing (NGS) data requires a computationally intensive read mapping step and application of several variant calling methods. Targeted mutation identification drastically shifts the usual tradeoff between accuracy and performance by concentrating all computations over a small portion of sequence space. Here, we present km, an efficient approach leveraging k-mer decomposition of reads to identify targeted mutations. Our approach is versatile, as it can detect single-base mutations, several types of insertions and deletions, as well as fusions. We used two independent AML cohorts (The Cancer Genome Atlas and Leucegene), to show that mutation detection bykmis fast, accurate and mainly limited by sequencing depth. Therefore,kmallows to establish fast diagnostics from NGS data, and could be suitable for clinical applications.

Download Full-text

MegaPath: sensitive and rapid pathogen detection using metagenomic NGS data

BMC Genomics ◽

10.1186/s12864-020-06875-6 ◽

2020 ◽

Vol 21 (S6) ◽

Author(s):

Chi-Ming Leung ◽

Dinghua Li ◽

Yan Xin ◽

Wai-Chun Law ◽

Yifan Zhang ◽

...

Keyword(s):

Pathogen Detection ◽

High Sensitivity ◽

Reference Sequence ◽

Close Relative ◽

Read Mapping ◽

Seeding Strategy ◽

Multiple Species ◽

Next Generation Sequencing Ngs ◽

Ngs Data ◽

Unique Species

Abstract Background Next-generation sequencing (NGS) enables unbiased detection of pathogens by mapping the sequencing reads of a patient sample to the known reference sequence of bacteria and viruses. However, for a new pathogen without a reference sequence of a close relative, or with a high load of mutations compared to its predecessors, read mapping fails due to a low similarity between the pathogen and reference sequence, which in turn leads to insensitive and inaccurate pathogen detection outcomes. Results We developed MegaPath, which runs fast and provides high sensitivity in detecting new pathogens. In MegaPath, we have implemented and tested a combination of polishing techniques to remove non-informative human reads and spurious alignments. MegaPath applies a global optimization to the read alignments and reassigns the reads incorrectly aligned to multiple species to a unique species. The reassignment not only significantly increased the number of reads aligned to distant pathogens, but also significantly reduced incorrect alignments. MegaPath implements an enhanced maximum-exact-match prefix seeding strategy and a SIMD-accelerated Smith-Waterman algorithm to run fast. Conclusions In our benchmarks, MegaPath demonstrated superior sensitivity by detecting eight times more reads from a low-similarity pathogen than other tools. Meanwhile, MegaPath ran much faster than the other state-of-the-art alignment-based pathogen detection tools (and compariable with the less sensitivity profile-based pathogen detection tools). The running time of MegaPath is about 20 min on a typical 1 Gb dataset.

Download Full-text

PhageTerm: a Fast and User-friendly Software to Determine Bacteriophage Termini and Packaging Mode using randomly fragmented NGS data

10.1101/108100 ◽

2017 ◽

Cited By ~ 2

Author(s):

Julian Garneau ◽

Florence Depardieu ◽

Louis-Charles Fortier ◽

David Bikard ◽

Marc Monot

Keyword(s):

High Throughput Sequencing ◽

Next Generation Sequencing Data ◽

Sequencing Data ◽

Link Type ◽

Sequencing Technologies ◽

Statistical Framework ◽

Fastq Format ◽

Viral Particles ◽

User Friendly ◽

Ngs Data

ABSTRACTBacteriophages are the most abundant viruses on earth and display an impressive genetic as well as morphologic diversity. Among those, the most common order of phages is the Caudovirales, whose viral particles packages linear double stranded DNA (dsDNA). In this study we investigated how the information gathered by high throughput sequencing technologies can be used to determine the DNA termini and packaging mechanisms of dsDNA phages. The wet-lab procedures traditionally used for this purpose rely on the identification and cloning of restriction fragment which can be delicate and cumbersome. Here, we developed a theoretical and statistical framework to analyze DNA termini and phage packaging mechanisms using next-generation sequencing data. Our methods, implemented in the PhageTerm software, work with sequencing reads in fastq format and the corresponding assembled phage genome.PhageTerm was validated on a set of phages with well-established packaging mechanisms representative of the termini diversity: 5’cos (lambda), 3’cos (HK97), pac (P1), headful without a pac site (T4), DTR (T7) and host fragment (Mu). In addition, we determined the termini of 9Clostridium difficilephages and 6 phages whose sequences where retrieved from the sequence read archive (SRA).A direct graphical interface is available as a Galaxy wrapper version athttps://galaxy.pasteur.frand a standalone version is accessible athttps://sourceforge.net/projects/phageterm/.

Download Full-text

From Forensics to Clinical Research: Expanding the Variant Calling Pipeline for the Precision ID mtDNA Whole Genome Panel

International Journal of Molecular Sciences ◽

10.3390/ijms222112031 ◽

2021 ◽

Vol 22 (21) ◽

pp. 12031

Author(s):

Filipe Cortes-Figueiredo ◽

Filipa S. Carvalho ◽

Ana Catarina Fonseca ◽

Friedemann Paul ◽

José M. Ferro ◽

...

Keyword(s):

Performance Metrics ◽

Nuclear Dna ◽

Variant Calling ◽

Threshold Level ◽

Reference Sequence ◽

Whole Genome ◽

Ion Torrent ◽

Thermo Fisher Scientific ◽

Ion Torrent Sequencing ◽

Improved Performance

Despite a multitude of methods for the sample preparation, sequencing, and data analysis of mitochondrial DNA (mtDNA), the demand for innovation remains, particularly in comparison with nuclear DNA (nDNA) research. The Applied Biosystems™ Precision ID mtDNA Whole Genome Panel (Thermo Fisher Scientific, USA) is an innovative library preparation kit suitable for degraded samples and low DNA input. However, its bioinformatic processing occurs in the enterprise Ion Torrent Suite™ Software (TSS), yielding BAM files aligned to an unorthodox version of the revised Cambridge Reference Sequence (rCRS), with a heteroplasmy threshold level of 10%. Here, we present an alternative customizable pipeline, the PrecisionCallerPipeline (PCP), for processing samples with the correct rCRS output after Ion Torrent sequencing with the Precision ID library kit. Using 18 samples (3 original samples and 15 mixtures) derived from the 1000 Genomes Project, we achieved overall improved performance metrics in comparison with the proprietary TSS, with optimal performance at a 2.5% heteroplasmy threshold. We further validated our findings with 50 samples from an ongoing independent cohort of stroke patients, with PCP finding 98.31% of TSS’s variants (TSS found 57.92% of PCP’s variants), with a significant correlation between the variant levels of variants found with both pipelines.

Download Full-text

Read mapping

it - Information Technology ◽

10.1515/itit-2015-0046 ◽

2016 ◽

Vol 58 (3) ◽

Author(s):

Steve Hoffmann ◽

Peter F. Stadler

Keyword(s):

High Throughput Sequencing ◽

Reference Genome ◽

Data Sets ◽

Biological Processes ◽

Sequencing Technology ◽

Matching Problem ◽

Read Mapping ◽

Mapping Problem ◽

Sequencing Technologies ◽

Exact Origin

AbstractThe Read Mapping problem asks for the exact origin of a nucleotide sequence in a reference genome. It translates to a conceptually simple approximate string matching problem. The practical difficulty, however, arises from the typical size of the data sets produced by modern high throughput sequencing technologies, from the biological processes involved in derivation of the query molecule from its genomic source, and from the technical processes of the sequencing technology itself.

Download Full-text

Index suffix–prefix overlaps by (w, k)-minimizer to generate long contigs for reads compression

Bioinformatics ◽

10.1093/bioinformatics/bty936 ◽

2018 ◽

Vol 35 (12) ◽

pp. 2066-2074 ◽

Cited By ~ 11

Author(s):

Yuansheng Liu ◽

Zuguo Yu ◽

Marcel E Dinger ◽

Jinyan Li

Keyword(s):

High Throughput Sequencing ◽

De Novo ◽

Reference Sequence ◽

Supplementary Information ◽

The Novel ◽

Rna Seq ◽

File Size ◽

Sequencing Technologies ◽

Efficient Storage ◽

Merging Process

Abstract Motivation Advanced high-throughput sequencing technologies have produced massive amount of reads data, and algorithms have been specially designed to contract the size of these datasets for efficient storage and transmission. Reordering reads with regard to their positions in de novo assembled contigs or in explicit reference sequences has been proven to be one of the most effective reads compression approach. As there is usually no good prior knowledge about the reference sequence, current focus is on the novel construction of de novo assembled contigs. Results We introduce a new de novo compression algorithm named minicom. This algorithm uses large k-minimizers to index the reads and subgroup those that have the same minimizer. Within each subgroup, a contig is constructed. Then some pairs of the contigs derived from the subgroups are merged into longer contigs according to a (w, k)-minimizer-indexed suffix–prefix overlap similarity between two contigs. This merging process is repeated after the longer contigs are formed until no pair of contigs can be merged. We compare the performance of minicom with two reference-based methods and four de novo methods on 18 datasets (13 RNA-seq datasets and 5 whole genome sequencing datasets). In the compression of single-end reads, minicom obtained the smallest file size for 22 of 34 cases with significant improvement. In the compression of paired-end reads, minicom achieved 20–80% compression gain over the best state-of-the-art algorithm. Our method also achieved a 10% size reduction of compressed files in comparison with the best algorithm under the reads-order preserving mode. These excellent performances are mainly attributed to the exploit of the redundancy of the repetitive substrings in the long contigs. Availability and implementation https://github.com/yuansliu/minicom Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text