NPSV: A simulation-driven approach to genotyping structural variants in whole-genome sequencing data

Michael D Linderman; Crystal Paudyal; Musab Shakeel; William Kelley; Ali Bashir; Bruce D Gelb

doi:10.1093/gigascience/giab046

NPSV: A simulation-driven approach to genotyping structural variants in whole-genome sequencing data

GigaScience ◽

10.1093/gigascience/giab046 ◽

2021 ◽

Vol 10 (7) ◽

Author(s):

Michael D Linderman ◽

Crystal Paudyal ◽

Musab Shakeel ◽

William Kelley ◽

Ali Bashir ◽

...

Keyword(s):

Next Generation Sequencing ◽

De Novo ◽

Training Data ◽

Next Generation Sequencing Data ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Next Generation ◽

Structural Variants ◽

Sequencing Data ◽

Generation Sequencing

Abstract Background Structural variants (SVs) play a causal role in numerous diseases but are difficult to detect and accurately genotype (determine zygosity) in whole-genome next-generation sequencing data. SV genotypers that assume that the aligned sequencing data uniformly reflect the underlying SV or use existing SV call sets as training data can only partially account for variant and sample-specific biases. Results We introduce NPSV, a machine learning–based approach for genotyping previously discovered SVs that uses next-generation sequencing simulation to model the combined effects of the genomic region, sequencer, and alignment pipeline on the observed SV evidence. We evaluate NPSV alongside existing SV genotypers on multiple benchmark call sets. We show that NPSV consistently achieves or exceeds state-of-the-art genotyping accuracy across SV call sets, samples, and variant types. NPSV can specifically identify putative de novo SVs in a trio context and is robust to offset SV breakpoints. Conclusions Growing SV databases and the increasing availability of SV calls from long-read sequencing make stand-alone genotyping of previously identified SVs an increasingly important component of genome analyses. By treating potential biases as a “black box” that can be simulated, NPSV provides a framework for accurately genotyping a broad range of SVs in both targeted and genome-scale applications.

Download Full-text

De Novo Genome Assembly of Next-Generation Sequencing Data

Compendium of Plant Genomes - The Brassica rapa Genome ◽

10.1007/978-3-662-47901-8_4 ◽

2015 ◽

pp. 41-51

Author(s):

Min Liu ◽

Dongyuan Liu ◽

Hongkun Zheng

Keyword(s):

Next Generation Sequencing ◽

Genome Assembly ◽

De Novo ◽

Next Generation Sequencing Data ◽

Next Generation ◽

Sequencing Data ◽

De Novo Genome Assembly ◽

Generation Sequencing

Download Full-text

Extraction of Mitochondrial Genome from Whole Genome Next Generation Sequencing Data and Unveiling of Forensically Relevant Markers

Russian Journal of Genetics ◽

10.1134/s1022795420080128 ◽

2020 ◽

Vol 56 (8) ◽

pp. 982-991

Author(s):

S. Rauf ◽

N. Zahra ◽

S. S. Malik ◽

S. A. e Zahra ◽

K. Sughra ◽

...

Keyword(s):

Next Generation Sequencing ◽

Mitochondrial Genome ◽

Next Generation Sequencing Data ◽

Whole Genome ◽

Next Generation ◽

Sequencing Data ◽

Generation Sequencing

Download Full-text

An ensemble strategy that significantly improves de novo assembly of microbial genomes from metagenomic next-generation sequencing data

Nucleic Acids Research ◽

10.1093/nar/gkv002 ◽

2015 ◽

Vol 43 (7) ◽

pp. e46-e46 ◽

Cited By ~ 125

Author(s):

Xutao Deng ◽

Samia N. Naccache ◽

Terry Ng ◽

Scot Federman ◽

Linlin Li ◽

...

Keyword(s):

Next Generation Sequencing ◽

De Novo Assembly ◽

De Novo ◽

Next Generation Sequencing Data ◽

De Bruijn Graph ◽

Next Generation ◽

Sequencing Data ◽

Short Reads ◽

Ensemble Strategy ◽

Generation Sequencing

Abstract Next-generation sequencing (NGS) approaches rapidly produce millions to billions of short reads, which allow pathogen detection and discovery in human clinical, animal and environmental samples. A major limitation of sequence homology-based identification for highly divergent microorganisms is the short length of reads generated by most highly parallel sequencing technologies. Short reads require a high level of sequence similarities to annotated genes to confidently predict gene function or homology. Such recognition of highly divergent homologues can be improved by reference-free (de novo) assembly of short overlapping sequence reads into larger contigs. We describe an ensemble strategy that integrates the sequential use of various de Bruijn graph and overlap-layout-consensus assemblers with a novel partitioned sub-assembly approach. We also proposed new quality metrics that are suitable for evaluating metagenome de novo assembly. We demonstrate that this new ensemble strategy tested using in silico spike-in, clinical and environmental NGS datasets achieved significantly better contigs than current approaches.

Download Full-text

Replicate whole-genome next-generation sequencing data derived from Caucasian donor saliva samples

Data in Brief ◽

10.1016/j.dib.2021.107349 ◽

2021 ◽

pp. 107349

Author(s):

Marcus Høy Hansen ◽

Charlotte Guldborg Nyvold

Keyword(s):

Next Generation Sequencing ◽

Next Generation Sequencing Data ◽

Whole Genome ◽

Next Generation ◽

Sequencing Data ◽

Generation Sequencing

Download Full-text

Whole‐Genome Sequencing Analysis Using Next‐Generation Sequencing Data

Current Protocols Essential Laboratory Techniques ◽

10.1002/cpet.2 ◽

2016 ◽

Vol 12 (1) ◽

Author(s):

Chi Kent Ho ◽

Xiaohui Cui ◽

Sharon Grubner ◽

Christopher A. Larson ◽

Ying Wei ◽

...

Keyword(s):

Next Generation Sequencing ◽

Whole Genome Sequencing ◽

Genome Sequencing ◽

Next Generation Sequencing Data ◽

Whole Genome ◽

Sequencing Analysis ◽

Next Generation ◽

Sequencing Data ◽

Generation Sequencing

Download Full-text

Detection of somatic structural variants from short-read next-generation sequencing data

10.1101/840751 ◽

2019 ◽

Author(s):

Tingting Gong ◽

Vanessa M Hayes ◽

Eva KF Chan

Keyword(s):

Next Generation Sequencing ◽

Cancer Genomics ◽

Next Generation Sequencing Data ◽

Next Generation ◽

Structural Variants ◽

Sequencing Data ◽

Short Read ◽

Factors Affecting ◽

Ngs Data ◽

Generation Sequencing

AbstractSomatic structural variants (SVs) play a significant role in cancer development and evolution, but are notoriously more difficult to detect than small variants from short-read next-generation sequencing (NGS) data. This is due to a combination of challenges attributed to the purity of tumour samples, tumour heterogeneity, limitations of short-read information from NGS, and sequence alignment ambiguities. In spite of active development of SV detection tools (callers) over the past few years, each method has inherent advantages and limitations. In this review, we highlight some of the important factors affecting somatic SV detection and compared the performance of eight commonly used SV callers. In particular, we focus on the extent of change in sensitivity and precision for detecting different SV types and size ranges from samples with differing variant allele frequencies and sequencing depths of coverage. We highlight the reasons for why some SV callers perform well in some settings but not others, allowing our evaluation findings to be extended beyond the eight SV callers examined in this paper. As the importance of large structural variants become increasingly recognised in cancer genomics, this paper provides a timely review on some of the most impactful factors influencing somatic SV detection and guidance on selecting an appropriate SV caller.

Download Full-text

Effects of GC Bias in Next-Generation-Sequencing Data on De Novo Genome Assembly

PLoS ONE ◽

10.1371/journal.pone.0062856 ◽

2013 ◽

Vol 8 (4) ◽

pp. e62856 ◽

Cited By ~ 121

Author(s):

Yen-Chun Chen ◽

Tsunglin Liu ◽

Chun-Hui Yu ◽

Tzen-Yuh Chiang ◽

Chi-Chuan Hwang

Keyword(s):

Next Generation Sequencing ◽

Genome Assembly ◽

De Novo ◽

Next Generation Sequencing Data ◽

Next Generation ◽

Sequencing Data ◽

De Novo Genome Assembly ◽

Gc Bias ◽

Generation Sequencing

Download Full-text

Detection of somatic structural variants from short-read next-generation sequencing data

Briefings in Bioinformatics ◽

10.1093/bib/bbaa056 ◽

2020 ◽

Cited By ~ 1

Author(s):

Tingting Gong ◽

Vanessa M Hayes ◽

Eva K F Chan

Keyword(s):

Next Generation Sequencing ◽

Cancer Genomics ◽

Next Generation Sequencing Data ◽

Next Generation ◽

Structural Variants ◽

Sequencing Data ◽

Short Read ◽

Factors Affecting ◽

Ngs Data ◽

Generation Sequencing

Abstract Somatic structural variants (SVs), which are variants that typically impact >50 nucleotides, play a significant role in cancer development and evolution but are notoriously more difficult to detect than small variants from short-read next-generation sequencing (NGS) data. This is due to a combination of challenges attributed to the purity of tumour samples, tumour heterogeneity, limitations of short-read information from NGS and sequence alignment ambiguities. In spite of active development of SV detection tools (callers) over the past few years, each method has inherent advantages and limitations. In this review, we highlight some of the important factors affecting somatic SV detection and compared the performance of seven commonly used SV callers. In particular, we focus on the extent of change in sensitivity and precision for detecting different SV types and size ranges from samples with differing variant allele frequencies and sequencing depths of coverage. We highlight the reasons for why some SV callers perform well in some settings but not others, allowing our evaluation findings to be extended beyond the seven SV callers examined in this paper. As the importance of large SVs become increasingly recognized in cancer genomics, this paper provides a timely review on some of the most impactful factors influencing somatic SV detection that should be considered when choosing SV callers.

Download Full-text

Correction: PeSV-Fisher: Identification of Somatic and Non-Somatic Structural Variants Using Next Generation Sequencing Data

PLoS ONE ◽

10.1371/annotation/6444bc1a-3501-482e-8cbe-a27ebc66b153 ◽

2013 ◽

Vol 8 (11) ◽

Author(s):

GeÃ²rgia EscaramÃs ◽

Cristian Tornador ◽

Laia Bassaganyas ◽

Raquel Rabionet ◽

Jose M. C. Tubio ◽

...

Keyword(s):

Next Generation Sequencing ◽

Next Generation Sequencing Data ◽

Next Generation ◽

Structural Variants ◽

Sequencing Data ◽

Generation Sequencing

Download Full-text

SimFFPE and FilterFFPE: improving structural variant calling in FFPE samples

GigaScience ◽

10.1093/gigascience/giab065 ◽

2021 ◽

Vol 10 (9) ◽

Cited By ~ 1

Author(s):

Lanying Wei ◽

Martin Dugas ◽

Sarah Sandmann

Keyword(s):

Next Generation Sequencing ◽

Variant Calling ◽

Real Data ◽

Next Generation Sequencing Data ◽

Next Generation ◽

Structural Variants ◽

Sequencing Data ◽

Structural Variant ◽

Ffpe Samples ◽

Generation Sequencing

Abstract Background Artifact chimeric reads are enriched in next-generation sequencing data generated from formalin-fixed paraffin-embedded (FFPE) samples. Previous work indicated that these reads are characterized by erroneous split-read support that is interpreted as evidence of structural variants. Thus, a large number of false-positive structural variants are detected. To our knowledge, no tool is currently available to specifically call or filter structural variants in FFPE samples. To overcome this gap, we developed 2 R packages: SimFFPE and FilterFFPE. Results SimFFPE is a read simulator, specifically designed for next-generation sequencing data from FFPE samples. A mixture of characteristic artifact chimeric reads, as well as normal reads, is generated. FilterFFPE is a filtration algorithm, removing artifact chimeric reads from sequencing data while keeping real chimeric reads. To evaluate the performance of FilterFFPE, we performed structural variant calling with 3 common tools (Delly, Lumpy, and Manta) with and without prior filtration with FilterFFPE. After applying FilterFFPE, the mean positive predictive value improved from 0.27 to 0.48 in simulated samples and from 0.11 to 0.27 in real samples, while sensitivity remained basically unchanged or even slightly increased. Conclusions FilterFFPE improves the performance of SV calling in FFPE samples. It was validated by analysis of simulated and real data.

Download Full-text