Performance Assessment of Variant Calling Pipelines using Human Whole Exome Sequencing and Simulated data

Mapping Intimacies ◽

10.1101/359109 ◽

2018 ◽

Author(s):

Manojkumar Kumaran ◽

Umadevi Subramanian ◽

Bharanidharan Devarajan

Keyword(s):

Exome Sequencing ◽

Whole Exome Sequencing ◽

Reference Genome ◽

Variant Calling ◽

Simulated Data ◽

Variant Call ◽

Human Reference Genome ◽

Indel Detection ◽

Whole Exome ◽

Clinical Variants

AbstractThe whole exome sequencing (WES) is a time-consuming technology in the identification of clinical variants and it demands the accurate variant caller tools. The currently available tools compromise accuracy in predicting the specific types of variants. Thus, it is important to find out the possible combination of best aligner-variant caller tools for detecting SNVs and InDels separately. Moreover, many important aspects of InDel detection are not overlooked while comparing the performance of tools. One such aspect is the detection of InDels with respect to base pair length. To assess the performance of variant (especially InDels) caller in combination with different aligners, 20 automated pipelines were developed and evaluated using gold reference variant dataset (NA12878) from Genome in a Bottle (GiaB) consortium of human whole exome sequencing. Additionally, the simulated exome data from two human reference genome sequences (GRCh37 and GRCh38) were used to compare the performance of the pipelines. By analyzing various performance metrices, we observed that BWA and Novoalign aligners performed better with DeepVariant and SAMtools callers for detecting SNVs, and with DeepVariant and GATK for Indels. Altogether, DeepVariant with BWA and Novoalign performed best. Further, we showed that merging the top performing pipelines improved the accurate variant call set. Collectively, this study would help the investigators to effectively improve the sensitivity and accuracy in detecting specific variants.

Download Full-text

Performance assessment of variant calling pipelines using human whole exome sequencing and simulated data

BMC Bioinformatics ◽

10.1186/s12859-019-2928-9 ◽

2019 ◽

Vol 20 (1) ◽

Cited By ~ 6

Author(s):

Manojkumar Kumaran ◽

Umadevi Subramanian ◽

Bharanidharan Devarajan

Keyword(s):

Performance Assessment ◽

Exome Sequencing ◽

Whole Exome Sequencing ◽

Variant Calling ◽

Simulated Data ◽

Whole Exome

Download Full-text

Multiple Variant Calling Pipelines in Wheat Whole Exome Sequencing

International Journal of Molecular Sciences ◽

10.3390/ijms221910400 ◽

2021 ◽

Vol 22 (19) ◽

pp. 10400

Author(s):

H. Busra Cagirici ◽

Bala Ani Akpinar ◽

Taner Z. Sen ◽

Hikmet Budak

Keyword(s):

Exome Sequencing ◽

Whole Exome Sequencing ◽

Variant Calling ◽

High Sensitivity ◽

Whole Exome ◽

The Impact ◽

Multiple Reference ◽

Reference Genomes ◽

Selection Of ◽

Variant Identification

The highly challenging hexaploid wheat (Triticum aestivum) genome is becoming ever more accessible due to the continued development of multiple reference genomes, a factor which aids in the plight to better understand variation in important traits. Although the process of variant calling is relatively straightforward, selection of the best combination of the computational tools for read alignment and variant calling stages of the analysis and efficient filtering of the false variant calls are not always easy tasks. Previous studies have analyzed the impact of methods on the quality metrics in diploid organisms. Given that variant identification in wheat largely relies on accurate mining of exome data, there is a critical need to better understand how different methods affect the analysis of whole exome sequencing (WES) data in polyploid species. This study aims to address this by performing whole exome sequencing of 48 wheat cultivars and assessing the performance of various variant calling pipelines at their suggested settings. The results show that all the pipelines require filtering to eliminate false-positive calls. The high consensus among the reference SNPs called by the best-performing pipelines suggests that filtering provides accurate and reproducible results. This study also provides detailed comparisons for high sensitivity and precision at individual and population levels for the raw and filtered SNP calls.

Download Full-text

Novel metrics to measure coverage in whole exome sequencing datasets reveal local and global non-uniformity

10.1101/051888 ◽

2016 ◽

Author(s):

Qingyu Wang ◽

Cooduvalli S. Shashikant ◽

Matthew Jensen ◽

Naomi S. Altman ◽

Santhosh Girirajan

Keyword(s):

Exome Sequencing ◽

Whole Exome Sequencing ◽

Gc Content ◽

Variant Calling ◽

Segmental Duplications ◽

Repeat Elements ◽

Whole Exome ◽

Detailed Assessment ◽

Major Shortcoming ◽

Low Coverage

ABSTRACTWhole Exome Sequencing (WES) is a powerful clinical diagnostic tool for discovering the genetic basis of many diseases. A major shortcoming of WES is uneven coverage of sequence reads over the exome targets contributing to many low coverage regions, which hinders accurate variant calling. In this study, we devised two novel metrics, Cohort Coverage Sparseness (CCS) and Unevenness (UE) Scores for a detailed assessment of the distribution of coverage of sequence reads. Employing these metrics we revealed non-uniformity of coverage and low coverage regions in the WES data generated by three different platforms. This non-uniformity of coverage is both local (coverage of a given exon across different platforms) and global (coverage of all exons across the genome in the given platform). The low coverage regions encompassing functionally important genes were often associated with high GC content, repeat elements and segmental duplications. While a majority of the problems associated with WES are due to the limitations of the capture methods, further refinements in WES technologies have the potential to enhance its clinical applications.

Download Full-text

MSI-WES: a simple approach for microsatellite instability testing using whole exome sequencing

Future Oncology ◽

10.2217/fon-2021-0132 ◽

2021 ◽

Author(s):

Henry O Ebili ◽

Adedeji OJ Agboola ◽

Emad Rakha

Keyword(s):

Next Generation Sequencing ◽

Microsatellite Instability ◽

Exome Sequencing ◽

Whole Exome Sequencing ◽

Methylation Status ◽

Molecular Targets ◽

Next Generation ◽

Variant Call ◽

Whole Exome ◽

Generation Sequencing

Aim: To demonstrate that MSI-WES is an accurate testing method for microsatellite instability (MSI). Materials & methods: Microsatellite-based indels were counted in the variant call-formatted whole exome sequencing (WES) data of 441 gastric cancer cases using Unix-based algorithms, and the counts expressed as a fraction of the genome sequenced to obtain next-generation sequencing-based MSI indices. Results: The next-generation sequencing-based MSI indices showed a near-perfect concordance with PCR-based MSI status, and moderate to good correlations with the molecular targets of MSI index, MLH1 expression and MLH1 methylation status, at a level comparable to the strengths of correlation between PCR-based MSI status and molecular targets of MSI index/ MLH1 expression and methylation. Conclusion: MSI-WES is a valid, adequate and sensitive approach for testing MSI in cancer.

Download Full-text

CopyDetective: Detection threshold–aware copy number variant calling in whole-exome sequencing data

GigaScience ◽

10.1093/gigascience/giaa118 ◽

2020 ◽

Vol 9 (11) ◽

Cited By ~ 1

Author(s):

Sarah Sandmann ◽

Marius Wöste ◽

Aniek O de Graaf ◽

Birgit Burkhardt ◽

Joop H Jansen ◽

...

Keyword(s):

Exome Sequencing ◽

Whole Exome Sequencing ◽

Copy Number ◽

Detection Threshold ◽

Variant Calling ◽

Superior Performance ◽

Quality Analysis ◽

Data Set ◽

Detection Thresholds ◽

Whole Exome

Abstract Background Copy number variants (CNVs) are known to play an important role in the development and progression of several diseases. However, detection of CNVs with whole-exome sequencing (WES) experiments is challenging. Usually, additional experiments have to be performed. Findings We developed a novel algorithm for somatic CNV calling in matched WES data called “CopyDetective". Different from other approaches, CNV calling with CopyDetective consists of a 2-step procedure: first, quality analysis is performed, determining individual detection thresholds for every sample. Second, actual CNV calling on the basis of the previously determined thresholds is performed. Our algorithm evaluates the change in variant allele frequency of polymorphisms and reports the fraction of affected cells for every CNV. Analyzing 4 WES data sets (n = 100) we observed superior performance of CopyDetective compared with ExomeCNV, VarScan2, ControlFREEC, ExomeDepth, and CNV-seq. Conclusions Individual detection thresholds reveal that not every WES data set is equally apt for CNV calling. Initial quality analyses, determining individual detection thresholds—as realized by CopyDetective—can and should be performed prior to actual variant calling.

Download Full-text

Development and performance of a targeted whole exome sequencing enrichment kit for the dog (Canis Familiaris Build 3.1)

Scientific Reports ◽

10.1038/srep05597 ◽

2014 ◽

Vol 4 (1) ◽

Cited By ~ 16

Author(s):

Bart J. G. Broeckx ◽

Frank Coopman ◽

Geert E. C. Verhoeven ◽

Valérie Bavegems ◽

Sarah De Keulenaer ◽

...

Keyword(s):

Exome Sequencing ◽

Whole Exome Sequencing ◽

Reference Genome ◽

Association Studies ◽

Disease Association ◽

Illumina Hiseq ◽

Protein Coding ◽

Whole Exome ◽

And Performance ◽

Average Sequencing Depth

Abstract Whole exome sequencing is a technique that aims to selectively sequence all exons of protein-coding genes. A canine whole exome sequencing enrichment kit was designed based on the latest canine reference genome (build 3.1.72). Its performance was tested by sequencing 2 exome captures, each consisting of 4 pre-capture pooled, barcoded Illumina libraries on an Illumina HiSeq 2500. At an average sequencing depth of 102x, 83 to 86% of the target regions were completely sequenced with a minimum coverage of five and 90% of the reads mapped on the target regions. Additionally, it is shown that the reproducibility within and between captures is high and that pooling four samples per capture is a valid option. Overall, we have demonstrated the strong performance of this WES enrichment kit and are confident it will be a valuable tool in future disease association studies.

Download Full-text

A MT-TL1 variant identified by whole exome sequencing in an individual with intellectual disability, epilepsy, and spastic tetraparesis

European Journal of Human Genetics ◽

10.1038/s41431-021-00900-2 ◽

2021 ◽

Author(s):

Elke de Boer ◽

◽

Charlotte W. Ockeloen ◽

Leslie Matalonga ◽

Rita Horvath ◽

...

Keyword(s):

Intellectual Disability ◽

Exome Sequencing ◽

Whole Exome Sequencing ◽

Variant Calling ◽

Maternal Line ◽

Feeding Difficulties ◽

Severe Intellectual Disability ◽

Whole Exome ◽

Mtdna Variants ◽

Spastic Tetraparesis

AbstractThe genetic etiology of intellectual disability remains elusive in almost half of all affected individuals. Within the Solve-RD consortium, systematic re-analysis of whole exome sequencing (WES) data from unresolved cases with (syndromic) intellectual disability (n = 1,472 probands) was performed. This re-analysis included variant calling of mitochondrial DNA (mtDNA) variants, although mtDNA is not specifically targeted in WES. We identified a functionally relevant mtDNA variant in MT-TL1 (NC_012920.1:m.3291T > C; NC_012920.1:n.62T > C), at a heteroplasmy level of 22% in whole blood, in a 23-year-old male with severe intellectual disability, epilepsy, episodic headaches with emesis, spastic tetraparesis, brain abnormalities, and feeding difficulties. Targeted validation in blood and urine supported pathogenicity, with heteroplasmy levels of 23% and 58% in index, and 4% and 17% in mother, respectively. Interestingly, not all phenotypic features observed in the index have been previously linked to this MT-TL1 variant, suggesting either broadening of the m.3291T > C-associated phenotype, or presence of a co-occurring disorder. Hence, our case highlights the importance of underappreciated mtDNA variants identifiable from WES data, especially for cases with atypical mitochondrial phenotypes and their relatives in the maternal line.

Download Full-text

BALSA: Integrated secondary analysis for whole-genome and whole-exome sequencing, accelerated by GPU

10.7287/peerj.preprints.373v1 ◽

2014 ◽

Author(s):

Ruibang Luo ◽

Yiu-Lun Wong ◽

Wai-Chun Law ◽

Lap-Kei Lee ◽

Chi-Man Liu ◽

...

Keyword(s):

Exome Sequencing ◽

Whole Exome Sequencing ◽

Secondary Analysis ◽

Variant Calling ◽

Statistical Testing ◽

Next Generation Sequencing Data ◽

Whole Genome ◽

Sequencing Data ◽

Accurate Analysis ◽

Whole Exome

This paper reports an integrated solution, called BALSA, for the secondary analysis of next generation sequencing data; it exploits the computational power of GPU and an intricate memory management to give a fast and accurate analysis. From raw reads to variants (including SNPs and Indels), BALSA, using just a single computing node with a commodity GPU board, takes 5.5 hours to process 50-fold whole genome sequencing (~750 million 100bp paired-end reads), or just 25 minutes for 210-fold whole exome sequencing. BALSA’s speed is rooted at its parallel algorithms to effectively exploit a GPU to speed up processes like alignment, realignment and statistical testing. BALSA incorporates a 16-genotype model to support the calling of SNPs and Indels and achieves competitive variant calling accuracy and sensitivity when compared to the ensemble of six popular variant callers. BALSA also supports efficient identification of somatic SNVs and CNVs; experiments showed that BALSA recovers all the previously validated somatic SNVs and CNVs, and it is more sensitive for somatic Indel detection. BALSA outputs variants in VCF format. A pileup-like SNAPSHOT format, while maintaining the same fidelity as BAM in variant calling, enables efficient storage and indexing, and facilitates the App development of downstream analyses. BALSA is available at: http://sourceforge.net/p/balsa

Download Full-text

Intersect-then-combine approach: improving the performance of somatic variant calling in whole exome sequencing data using multiple aligners and callers

Genome Medicine ◽

10.1186/s13073-017-0425-1 ◽

2017 ◽

Vol 9 (1) ◽

Cited By ~ 28

Author(s):

Maurizio Callari ◽

Stephen-John Sammut ◽

Leticia De Mattos-Arruda ◽

Alejandra Bruna ◽

Oscar M. Rueda ◽

...

Keyword(s):

Exome Sequencing ◽

Whole Exome Sequencing ◽

Variant Calling ◽

Combine Approach ◽

Sequencing Data ◽

Somatic Variant ◽

Exome Sequencing Data ◽

Whole Exome ◽

Whole Exome Sequencing Data

Download Full-text

Improved Variant Calling Accuracy by Merging Replicates in Whole-Exome Sequencing Studies

BioMed Research International ◽

10.1155/2014/319534 ◽

2014 ◽

Vol 2014 ◽

pp. 1-7 ◽

Cited By ~ 4

Author(s):

Yanfeng Zhang ◽

Bingshan Li ◽

Chun Li ◽

Qiuyin Cai ◽

Wei Zheng ◽

...

Keyword(s):

Exome Sequencing ◽

Whole Exome Sequencing ◽

Large Scale ◽

Comprehensive Evaluation ◽

Large Population ◽

Variant Calling ◽

Population Based ◽

Sequencing Data ◽

Whole Exome ◽

Lower Depth

In large scale population-based whole-exome sequencing (WES) studies, there are some samples occasionally sequenced two or more times due to a variety of reasons. To investigate how to efficiently utilize these duplicated sequencing data, we conducted comprehensive evaluation of variant calling strategies. 92 samples subjected to WES twice were selected from a large population study. These 92 duplicated samples were divided into two groups: group H consisting of the higher sequencing depth for each subject and group L consisting of the lower depth for each subject. The merged samples for each subject were put in a third group M. Using the GATK multisample toolkit, we compared variant calling accuracy among three strategies. Hierarchical clustering analysis indicated that the two replicates for each subject showed high homogeneity. The comparative analyses on the basis of heterozygous-homozygous ratio (Hete/Homo), transition-transversion ratio (Ti/Tv), and overlapping rate with the 1000 Genomes Project consistently showed that the data quality of the SNPs detected from the M group was more accurate than that of SNPs detected from the H and L groups. These results suggested that merging homogeneous duplicated exomes instead of using one of them could improve variant calling accuracy.

Download Full-text