Joint variant andde novomutation identification on pedigrees from high-throughput sequencing data

Mapping Intimacies ◽

10.1101/001958 ◽

2014 ◽

Author(s):

John G Cleary ◽

Ross Braithwaite ◽

Kurt Gaastra ◽

Brian S Hilbush ◽

Stuart Inglis ◽

...

Keyword(s):

High Throughput Sequencing ◽

De Novo ◽

Cost Optimization ◽

Snp Array ◽

Variant Calling ◽

Ground Truth ◽

Next Generation Sequencing Data ◽

Sequencing Data ◽

Mendelian Segregation ◽

High Throughput Sequencing Data

The analysis of whole-genome or exome sequencing data from trios and pedigrees has being successfully applied to the identification of disease-causing mutations. However, most methods used to identify and genotype genetic variants from next-generation sequencing data ignore the relationships between samples, resulting in significant Mendelian errors, false positives and negatives. Here we present a Bayesian network framework that jointly analyses data from all members of a pedigree simultaneously using Mendelian segregation priors, yet providing the ability to detectde novomutations in offspring, and is scalable to large pedigrees. We evaluated our method by simulations and analysis of WGS data from a 17 individual, 3-generation CEPH pedigree sequenced to 50X average depth. Compared to singleton calling, our family caller produced more high quality variants and eliminated spurious calls as judged by common quality metrics such as Ti/Tv, Het/Hom ratios, and dbSNP/SNP array data concordance. We developed a ground truth dataset to further evaluate our calls by identifying recombination cross-overs in the pedigree and testing variants for consistency with the inferred phasing, and we show that our method significantly outperforms singleton and population variant calling in pedigrees. We identify all previously validatedde novomutations in NA12878, concurrent with a 7X precision improvement. Our results show that our method is scalable to large genomics and human disease studies and allows cost optimization by rational sequencing capacity distribution.

Download Full-text

PathoQC: Computationally Efficient Read Preprocessing and Quality Control for High-Throughput Sequencing Data Sets

Cancer Informatics ◽

10.4137/cin.s13890 ◽

2014 ◽

Vol 13s1 ◽

pp. CIN.S13890 ◽

Cited By ~ 1

Author(s):

Changjin Hong ◽

Solaiappan Manimaran ◽

William Evan Johnson

Keyword(s):

Quality Control ◽

High Throughput ◽

High Performance ◽

High Throughput Sequencing ◽

Next Generation Sequencing Data ◽

Data Sets ◽

Sequencing Data ◽

Computationally Efficient ◽

High Throughput Sequencing Data ◽

Downstream Analysis

Quality control and read preprocessing are critical steps in the analysis of data sets generated from high-throughput genomic screens. In the most extreme cases, improper preprocessing can negatively affect downstream analyses and may lead to incorrect biological conclusions. Here, we present PathoQC, a streamlined toolkit that seamlessly combines the benefits of several popular quality control software approaches for preprocessing next-generation sequencing data. PathoQC provides a variety of quality control options appropriate for most high-throughput sequencing applications. PathoQC is primarily developed as a module in the PathoScope software suite for metagenomic analysis. However, PathoQC is also available as an open-source Python module that can run as a stand-alone application or can be easily integrated into any bioinformatics workflow. PathoQC achieves high performance by supporting parallel computation and is an effective tool that removes technical sequencing artifacts and facilitates robust downstream analysis. The PathoQC software package is available at http://sourceforge.net/projects/PathoScope/ .

Download Full-text

DeNovoCNN: A deep learning approach to de novo variant calling in next generation sequencing data

10.1101/2021.09.20.461072 ◽

2021 ◽

Author(s):

Gelana Khazeeva ◽

Karolis Sablauskas ◽

Bart van der Sanden ◽

Wouter Steyaert ◽

Michael Kwint ◽

...

Keyword(s):

Exome Sequencing ◽

De Novo ◽

Genetic Disorders ◽

Variant Calling ◽

Next Generation Sequencing Data ◽

Sequencing Data ◽

Accurate Identification ◽

Whole Exome ◽

De Novo Variant ◽

Generation Sequencing

De novo mutations (DNMs) are an important cause of genetic disorders. The accurate identification of DNMs from sequencing data is therefore fundamental to rare disease research and diagnostics. Unfortunately, identifying reliable DNMs remains a major challenge due to sequence errors, uneven coverage, and mapping artifacts. Here, we developed a deep convolutional neural network (CNN) DNM caller (DeNovoCNN), that encodes alignment of sequence reads for a trio as 160×164 resolution images. DeNovoCNN was trained on DNMs of whole exome sequencing (WES) of 2003 trios achieving on average 99.2% recall and 93.8% precision. We find that DeNovoCNN has increased recall/sensitivity and precision compared to existing de novo calling approaches (GATK, DeNovoGear, Samtools) based on the Genome in a Bottle reference dataset. Sanger validations of DNMs called in both exome and genome datasets confirm that DeNovoCNN outperforms existing methods. Most importantly, we show that DeNovoCNN is robust against different exome sequencing and analyses approaches, thereby allowing it to be applied on other datasets. DeNovoCNN is freely available and can be run on existing alignment (BAM/CRAM) and variant calling (VCF) files from WES and WGS without a need for variant recalling.

Download Full-text

Inference of viral quasispecies with a paired de Bruijn graph

Bioinformatics ◽

10.1093/bioinformatics/btaa782 ◽

2020 ◽

Author(s):

Borja Freire ◽

Susana Ladra ◽

Jose R Paramá ◽

Leena Salmela

Keyword(s):

High Throughput Sequencing ◽

De Novo ◽

Supplementary Information ◽

De Bruijn Graph ◽

Viral Quasispecies ◽

Sequencing Data ◽

De Bruijn Graphs ◽

Sequencing Errors ◽

High Throughput Sequencing Data ◽

De Bruijn

Abstract Motivation RNA viruses exhibit a high mutation rate and thus they exist in infected cells as a population of closely related strains called viral quasispecies. The viral quasispecies assembly problem asks to characterize the quasispecies present in a sample from high-throughput sequencing data. We study the de novo version of the problem, where reference sequences of the quasispecies are not available. Current methods for assembling viral quasispecies are either based on overlap graphs or on de Bruijn graphs. Overlap graph-based methods tend to be accurate but slow, whereas de Bruijn graph-based methods are fast but less accurate. Results We present viaDBG, which is a fast and accurate de Bruijn graph-based tool for de novo assembly of viral quasispecies. We first iteratively correct sequencing errors in the reads, which allows us to use large k-mers in the de Bruijn graph. To incorporate the paired-end information in the graph, we also adapt the paired de Bruijn graph for viral quasispecies assembly. These features enable the use of long-range information in contig construction without compromising the speed of de Bruijn graph-based approaches. Our experimental results show that viaDBG is both accurate and fast, whereas previous methods are either fast or accurate but not both. In particular, viaDBG has comparable or better accuracy than SAVAGE, while being at least nine times faster. Furthermore, the speed of viaDBG is comparable to PEHaplo but viaDBG is able to retrieve also low abundance quasispecies, which are often missed by PEHaplo. Availability and implementation viaDBG is implemented in C++ and it is publicly available at https://bitbucket.org/bfreirec1/viadbg. All datasets used in this article are publicly available at https://bitbucket.org/bfreirec1/data-viadbg/. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

De Novo Assembly of High-Throughput Sequencing Data with Cloud Computing and New Operations on String Graphs

2012 IEEE Fifth International Conference on Cloud Computing ◽

10.1109/cloud.2012.123 ◽

2012 ◽

Cited By ~ 5

Author(s):

Yu-Jung Chang ◽

Chien-Chih Chen ◽

Jan-Ming Ho ◽

Chuen-Liang Chen

Keyword(s):

Cloud Computing ◽

High Throughput ◽

De Novo Assembly ◽

High Throughput Sequencing ◽

De Novo ◽

Sequencing Data ◽

String Graphs ◽

High Throughput Sequencing Data

Download Full-text

Benchmarking Variant Identification Tools for Plant Diversity Discovery

10.21203/rs.2.9666/v2 ◽

2019 ◽

Author(s):

Xing Wu ◽

Christopher Heffelfinger ◽

Hongyu Zhao ◽

Stephen L. Dellaporta

Keyword(s):

Next Generation Sequencing ◽

High Throughput Sequencing ◽

Crop Improvement ◽

Variant Calling ◽

Next Generation Sequencing Data ◽

Next Generation ◽

Sequencing Data ◽

Variant Discovery ◽

Variant Filtering ◽

Generation Sequencing

Abstract Background The ability to accurately and comprehensively identify genomic variations is critical for plant studies utilizing high-throughput sequencing. Most bioinformatics tools for processing next-generation sequencing data were originally developed and tested in human studies, raising questions as to their efficacy for plant research. A detailed evaluation of the entire variant calling pipeline, including alignment, variant calling, variant filtering, and imputation was performed on different programs using both simulated and real plant genomic datasets. Results A comparison of SOAP2, Bowtie2, and BWA-MEM found that BWA-MEM was consistently able to align the most reads with high accuracy, whereas Bowtie2 had the highest overall accuracy. Comparative results of GATK HaplotypCaller versus SAMtools mpileup indicated that the choice of variant caller affected precision and recall differentially depending on the levels of diversity, sequence coverage and genome complexity. A cross-reference experiment of S. lycopersicum and S. pennellii reference genomes revealed the inadequacy of single reference genome for variant discovery that includes distantly-related plant individuals. Machine-learning-based variant filtering strategy outperformed the traditional hard-cutoff strategy resulting in higher number of true positive variants and fewer false positive variants. A 2-step imputation method, which utilized a set of high-confidence SNPs as the reference panel, showed up to 60% higher accuracy than direct LD-based imputation. Conclusions Programs in the variant discovery pipeline have different performance on plant genomic dataset. Choice of the programs is subjected to the goal of the study and available resources. This study serves as an important guiding information for plant biologists utilizing next-generation sequencing data for diversity characterization and crop improvement.

Download Full-text

ANGSD-wrapper: utilities for analyzing next generation sequencing data

10.7287/peerj.preprints.1472 ◽

2016 ◽

Author(s):

Arun Durvasula ◽

Paul J Hoffman ◽

Tyler V Kent ◽

Chaochih Liu ◽

Thomas J Y Kono ◽

...

Keyword(s):

High Throughput ◽

High Throughput Sequencing ◽

Molecular Ecology ◽

Principal Component ◽

Next Generation Sequencing Data ◽

Sequencing Data ◽

Genome Data ◽

High Throughput Sequencing Data ◽

Genome Wide ◽

User Friendly

High throughput sequencing has changed many aspects of population genetics, molecular ecology, and related fields, affecting both experimental design and data analysis. The software package ANGSD allows users to perform a number of population genetic analyses on high-throughput sequencing data. ANGSD uses probabilistic approaches to calculate genome-wide descriptive statistics. The package makes use of genotype likelihood estimates rather than SNP calls and is specifically designed to produce more accurate results for samples with low sequencing depth. ANGSD makes use of full genome data while handling a wide array of sampling and experimental designs. Here we present ANGSD-wrapper, a set of wrapper scripts that provide a user-friendly interface for running ANGSD and visualizing results. ANGSD-wrapper supports multiple types of analyses including esti- mates of nucleotide sequence diversity and performing neutrality tests, principal component analysis, estimation of admixture proportions for individuals samples, and calculation of statistics that quantify recent introgression. ANGSD-wrapper also provides interactive graphing of ANGSD results to enhance data exploration. We demonstrate the usefulness of ANGSD-wrapper by analyzing resequencing data from populations of wild and domesticated Zea. ANGSD-wrapper is freely available from https://github.com/mojaveazure/angsd-wrapper.

Download Full-text

Genome-wide profiling of heritable and de novo STR variations

10.1101/077727 ◽

2016 ◽

Cited By ~ 7

Author(s):

Thomas Willems ◽

Dina Zielinski ◽

Assaf Gordon ◽

Melissa Gymrek ◽

Yaniv Erlich

Keyword(s):

Tandem Repeats ◽

High Throughput Sequencing ◽

De Novo ◽

Genetic Diseases ◽

Whole Genome Sequencing Data ◽

Sequencing Data ◽

High Throughput Sequencing Data ◽

Genome Wide ◽

A Genome ◽

Short Tandem

AbstractShort tandem repeats (STRs) are highly variable elements that play a pivotal role in multiple genetic diseases, population genetics applications, and forensic casework. However, STRs have proven problematic to genotype from high-throughput sequencing data. Here, we describe HipSTR, a novel haplotype-based method for robustly genotyping, haplotyping, and phasing STRs from whole genome sequencing data and report a genome-wide analysis and validation of de novo STR mutations.

Download Full-text

MitoFlex: an efficient, high-performance toolkit for animal mitogenome assembly, annotation, and visualization

Bioinformatics ◽

10.1093/bioinformatics/btab111 ◽

2021 ◽

Author(s):

Jun-Yu Li ◽

Wei-Xuan Li ◽

An-Tai Wang ◽

Zhang Yu

Keyword(s):

Mitochondrial Genome ◽

High Performance ◽

High Throughput Sequencing ◽

De Novo ◽

Supplementary Information ◽

Sequencing Data ◽

Protein Coding ◽

High Throughput Sequencing Data ◽

Genome Analysis Toolkit ◽

Overall Performance

Abstract Summary MitoFlex is a linux-based mitochondrial genome analysis toolkit, which provides a complete workflow of raw data filtering, de novo assembly, mitochondrial genome identification and annotation for animal high throughput sequencing data. The overall performance was compared between MitoFlex and its analogue MitoZ, in terms of protein coding gene recovery, memory consumption and processing speed. Availability MitoFlex is available at https://github.com/Prunoideae/MitoFlex under GPLv3 license. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

iSVP: an integrated structural variant calling pipeline from high-throughput sequencing data

BMC Systems Biology ◽

10.1186/1752-0509-7-s6-s8 ◽

2013 ◽

Vol 7 (Suppl 6) ◽

pp. S8 ◽

Cited By ~ 21

Author(s):

Takahiro Mimori ◽

Naoki Nariai ◽

Kaname Kojima ◽

Mamoru Takahashi ◽

Akira Ono ◽

...

Keyword(s):

High Throughput ◽

High Throughput Sequencing ◽

Variant Calling ◽

Sequencing Data ◽

Structural Variant ◽

High Throughput Sequencing Data

Download Full-text

Accuracy and Reproducibility of Somatic Point Mutation Calling in Clinical-Type Targeted Sequencing Data

10.1101/2019.12.31.891952 ◽

2019 ◽

Author(s):

Ali Karimnezhad ◽

Gareth A. Palidwor ◽

Kednapa Thavorn ◽

David J. Stewart ◽

Pearl A. Campbell ◽

...

Keyword(s):

False Positive ◽

High Throughput Sequencing ◽

Variant Calling ◽

High Sensitivity ◽

Ground Truth ◽

Targeted Sequencing ◽

Sequencing Data ◽

Sequencing Platform ◽

Ion Torrent Pgm ◽

Fresh Frozen

AbstractBackgroundTreating cancer depends in part on identifying the mutations driving each patient’s disease. Many clinical laboratories are adopting high-throughput sequencing for assaying patients’ tumours, applying targeted panels to formalin-fixed paraffin-embedded tumour tissues to detect clinically-relevant mutations. While there have been some benchmarking and best practices studies of this scenario, much variant-calling work focuses on whole-genome or whole-exome studies, with fresh or fresh-frozen tissue. Thus, definitive guidance on best choices for sequencing platforms, sequencing strategies, and variant calling for clinical variant detection is still being developed.ResultsBecause ground truth for clinical specimens is rarely known, we used the well-characterized Coriell cell lines GM12878 and GM12877 to generate data. We prepared samples to mimic as closely as possible clinical biopsies, including formalin fixation and paraffin embedding. We evaluated two well-known targeted sequencing panels, Illumina’s TruSight 170 panel and the Oncomine Focus panel. Sequencing was performed on an Illumina NextSeq500 and an Ion Torrent PGM respectively. We performed multiple biological replicates of each assay, to test reproducibility. Finally, we applied five different public and freely-available somatic single-nucleotide variant (SNV) callers to the data, MuTect2, SAMtools, VarScan2, Pisces and VarDict. Although the TruSight 170 and Oncomine Focus panels cover different amounts of the genome, we did not observe major differences in variant calling success within the regions that each covers. We observed substantial discrepancies between the five variant callers. All had high sensitivity, detecting known SNVs, but highly varying and non-overlapping false positive detections. Harmonizing variant caller parameters or intersecting the results of multiple variant callers reduced disagreements. However, intersecting results from biological replicates was even better at eliminating false positives.ConclusionsReproducibility and accuracy of targeted clinical sequencing results depends less on sequencing platform and panel than on downstream bioinformatics and biological variability. Differences in variant callers’ default parameters are a greater influence on algorithm disagreement than other differences between the algorithms. Contrary to typical clinical practice, we recommend analyzing replicate samples, as this greatly decreases false positive calls.

Download Full-text