MERIT: a Mutation Error Rate Identification Toolkit for Ultra-deep Sequencing Applications

Mapping Intimacies ◽

10.1101/184291 ◽

2017 ◽

Cited By ~ 1

Author(s):

Mohammad Hadigol ◽

Hossein Khiabanian

Keyword(s):

Deep Sequencing ◽

High Throughput Sequencing ◽

Clonal Evolution ◽

False Negative ◽

Error Rates ◽

Sequencing Data ◽

Genomic Context ◽

Nucleotide Incorporation ◽

Double Base

AbstractRapid progress in high-throughput sequencing (HTS) has enabled the molecular characterization of mutational landscapes in heterogeneous populations and has improved our understanding of clonal evolution processes. Analyzing the sensitivity of detecting genomic mutations in HTS requires comprehensive profiling of sequencing artifacts. To this end, we introduce MERIT, designed for in-depth quantification of erroneous substitutions and small insertions and deletions, specifically for ultra-deep applications. MERIT incorporates an all-inclusive variant caller and considers genomic context, including the nucleotides immediately at 5′ and 3′, thereby establishing error rates for 96 possible substitutions as well as four singlebase and 16 double-base indels. We apply MERIT to ultra-deep sequencing data (1,300,000×) and show a significant relationship between error rates and genomic contexts. We devise an in silico approach to determine the optimal sequencing depth, where errors occur at rates similar to those of true mutations. Finally, we assess nucleotide-incorporation fidelity of four high-fidelity DNA polymerases in clinically relevant loci, and demonstrate how fixed detection thresholds may result in substantial false positive as well as false negative calls.

Download Full-text

Development of a User-Friendly Pipeline for Mutational Analyses of HIV Using Ultra-Accurate Maximum-Depth Sequencing

Viruses ◽

10.3390/v13071338 ◽

2021 ◽

Vol 13 (7) ◽

pp. 1338

Author(s):

Morgan E. Meissner ◽

Emily J. Julik ◽

Jonathan P. Badalamenti ◽

William G. Arndt ◽

Lauren J. Mills ◽

...

Keyword(s):

Error Rates ◽

Maximum Depth ◽

Sequencing Data ◽

Background Error ◽

High Background ◽

Immunodeficiency Virus ◽

User Friendly ◽

Viral Mutagenesis ◽

Hiv 1

Human immunodeficiency virus type 2 (HIV-2) accumulates fewer mutations during replication than HIV type 1 (HIV-1). Advanced studies of HIV-2 mutagenesis, however, have historically been confounded by high background error rates in traditional next-generation sequencing techniques. In this study, we describe the adaptation of the previously described maximum-depth sequencing (MDS) technique to studies of both HIV-1 and HIV-2 for the ultra-accurate characterization of viral mutagenesis. We also present the development of a user-friendly Galaxy workflow for the bioinformatic analyses of sequencing data generated using the MDS technique, designed to improve replicability and accessibility to molecular virologists. This adapted MDS technique and analysis pipeline were validated by comparisons with previously published analyses of the frequency and spectra of mutations in HIV-1 and HIV-2 and is readily expandable to studies of viral mutation across the genomes of both viruses. Using this novel sequencing pipeline, we observed that the background error rate was reduced 100-fold over standard Illumina error rates, and 10-fold over traditional unique molecular identifier (UMI)-based sequencing. This technical advancement will allow for the exploration of novel and previously unrecognized sources of viral mutagenesis in both HIV-1 and HIV-2, which will expand our understanding of retroviral diversity and evolution.

Download Full-text

debar, a sequence-by-sequence denoiser for COI-5P DNA barcode data

10.1101/2021.01.04.425285 ◽

2021 ◽

Author(s):

Cameron M. Nugent ◽

Tyler A. Elliott ◽

Sujeevan Ratnasingham ◽

Paul D. N. Hebert ◽

Sarah J. Adamowicz

Keyword(s):

High Throughput Sequencing ◽

Dna Barcode ◽

R Package ◽

Error Rates ◽

Real World Data ◽

Species Discovery ◽

Consensus Sequences ◽

In Silico Studies ◽

Coi Sequences

AbstractDNA barcoding and metabarcoding are now widely used to advance species discovery and biodiversity assessments. High-throughput sequencing (HTS) has expanded the volume and scope of these analyses, but elevated error rates introduce noise into sequence records that can inflate estimates of biodiversity. Denoising —the separation of biological signal from instrument (technical) noise—of barcode and metabarcode data currently employs abundance-based methods which do not capitalize on the highly conserved structure of the cytochrome c oxidase subunit I (COI) region employed as the animal barcode. This manuscript introduces debar, an R package that utilizes a profile hidden Markov model to denoise indel errors in COI sequences introduced by instrument error. In silico studies demonstrated that debar recognized 95% of artificially introduced indels in COI sequences. When applied to real-world data, debar reduced indel errors in circular consensus sequences obtained with the Sequel platform by 75%, and those generated on the Ion Torrent S5 by 94%. The false correction rate was less than 0.1%, indicating that debar is receptive to the majority of true COI variation in the animal kingdom. In conclusion, the debar package improves DNA barcode and metabarcode workflows by aiding the generation of more accurate sequences aiding the characterization of species diversity.

Download Full-text

Identification and characterization of Coronaviridae genomes from Vietnamese bats and rats based on conserved protein domains

Virus Evolution ◽

10.1093/ve/vey035 ◽

2018 ◽

Vol 4 (2) ◽

Cited By ~ 15

Author(s):

My V T Phan ◽

Tue Ngo Tri ◽

Pham Hong Anh ◽

Stephen Baker ◽

Paul Kellam ◽

...

Keyword(s):

Deep Sequencing ◽

Protein Domains ◽

Sequencing Data ◽

Essential Information ◽

Diagnostic Assays ◽

Local Diversity ◽

Virus Surveillance ◽

Deep Sequencing Technology ◽

Classification Tool

Abstract The Coronaviridae family of viruses encompasses a group of pathogens with a zoonotic potential as observed from previous outbreaks of the severe acute respiratory syndrome coronavirus and Middle East respiratory syndrome coronavirus. Accordingly, it seems important to identify and document the coronaviruses in animal reservoirs, many of which are uncharacterized and potentially missed by more standard diagnostic assays. A combination of sensitive deep sequencing technology and computational algorithms is essential for virus surveillance, especially for characterizing novel- or distantly related virus strains. Here, we explore the use of profile Hidden Markov Model-defined Pfam protein domains (Pfam domains) encoded by new sequences as a Coronaviridae sequence classification tool. The encoded domains are used first in a triage to identify potential Coronaviridae sequences and then processed using a Random Forest method to classify the sequences to the Coronaviridae genus level. The application of this algorithm on Coronaviridae genomes assembled from agnostic deep sequencing data from surveillance of bats and rats in Dong Thap province (Vietnam) identified thirty-four Alphacoronavirus and eleven Betacoronavirus genomes. This collection of bat and rat coronaviruses genomes provided essential information on the local diversity of coronaviruses and substantially expanded the number of coronavirus full genomes available from bat and rats and may facilitate further molecular studies on this group of viruses.

Download Full-text

Characterization of segmental duplications and large inversions using Linked-Reads

10.1101/394528 ◽

2018 ◽

Cited By ~ 4

Author(s):

Fatih Karaoglanoglu ◽

Camir Ricketts ◽

Marzieh Eslami Rasekh ◽

Ezgi Ebren ◽

Iman Hajirasouliha ◽

...

Keyword(s):

High Throughput Sequencing ◽

Segmental Duplications ◽

Sequencing Data ◽

Full Spectrum ◽

Genomic Structural Variation ◽

Split Read ◽

Long Read ◽

Novel Algorithms ◽

Insertion Locus

AbstractMany algorithms aimed at characterizing genomic structural variation (SV) have been developed since the inception of high-throughput sequencing. However, the full spectrum of SVs in the human genome is not yet assessed. Most of the existing methods focus on discovery and genotyping of deletions, insertions, and mobile elements. Detection of balanced SVs with no gain or loss of genomic segments (e.g., inversions) is particularly a challenging task. Long read sequencing has been leveraged to find short inversions but there is still a need to develop methods to detect large genomic inversions. Furthermore, currently there are no algorithms to predict the insertion locus of large interspersed segmental duplications.Here we propose novel algorithms to characterize large (>40Kbp) interspersed segmental duplications and (>80Kbp) inversions using Linked-Read sequencing data. Linked-Read sequencing provides long range information, where Illumina reads are tagged with barcodes that can be used to assign short reads to pools of larger (30-50 Kbp) molecules. Our methods rely on split molecule sequence signature that we have previously described [11]. Similar to the split read, split molecules refer to large segments of DNA that span an SV breakpoint. Therefore, when mapped to the reference genome, the mapping of these segments would be discontinuous. We redesign our earlier algorithm, VALOR, to specifically leverage Linked-Read sequencing data to discover large inversions and characterize interspersed segmental duplications. We implement our new algorithms in a new software package, called VALOR2.AvailabilityVALOR2 is available at https://github.com/BilkentCompGen/valor.

Download Full-text

The landscape of actionable genomic alterations in cell-free circulating tumor DNA from 21,807 advanced cancer patients

10.1101/233205 ◽

2017 ◽

Cited By ~ 2

Author(s):

Oliver A. Zill ◽

Kimberly C. Banks ◽

Stephen R. Fairclough ◽

Stefanie A. Mortimer ◽

James V. Vowles ◽

...

Keyword(s):

Deep Sequencing ◽

Clonal Evolution ◽

Patient Treatment ◽

Large Set ◽

Sequencing Analysis ◽

Cancer Genes ◽

Sequencing Data ◽

Mutual Exclusivity ◽

Advanced Cancer Patients ◽

Driver Genes

AbstractCell-free DNA (cfDNA) sequencing provides a non-invasive method for obtaining actionable genomic information to guide personalized cancer treatment, but the presence of multiple alterations in circulation related to treatment and tumor heterogeneity pose analytical challenges. We present the somatic mutation landscape of 70 cancer genes from cfDNA deep-sequencing analysis of 21,807 patients with treated, late-stage cancers across >50 cancer types. Patterns and prevalence of cfDNA alterations in major driver genes for non-small cell lung, breast, and colorectal cancer largely recapitulated those from tumor tissue sequencing compendia (TCGA and COSMIC), with the principle differences in alteration prevalence being due to patient treatment. This highly sensitive cfDNA sequencing assay revealed numerous subclonal tumor-derived alterations, expected as a result of clonal evolution, but leading to an apparent departure from mutual exclusivity in treatment-naïve tumors. To facilitate interpretation of this added complexity, we developed methods to identify cfDNA copy-number driver alterations and cfDNA clonality. Upon applying these methods, robust mutual exclusivity was observed among predicted truncal driver cfDNA alterations, in effect distinguishing tumor-initiating alterations from secondary alterations. Treatment-associated resistance, including both novel alterations and parallel evolution, was common in the cfDNA cohort and was enriched in patients with targetable driver alterations. Together these retrospective analyses of a large set of cfDNA deep-sequencing data reveal subclonal structures and emerging resistance in advanced solid tumors.

Download Full-text

High-throughput single-cell DNA sequencing of AML tumors with droplet microfluidics

10.1101/203158 ◽

2017 ◽

Cited By ~ 2

Author(s):

Maurizio Pellegrino ◽

Adam Sciambi ◽

Sebastian Treusch ◽

Robert Durruthy-Durruthy ◽

Kaustubh Gokhale ◽

...

Keyword(s):

Single Cell ◽

Clonal Evolution ◽

Droplet Microfluidics ◽

Next Generation Sequencing Data ◽

Sequencing Data ◽

Single Cell Sequencing ◽

Selection For ◽

Genetic Profiles ◽

Generation Sequencing

ABSTRACTTo enable the characterization of genetic heterogeneity in tumor cell populations, we developed a novel microfluidic approach that barcodes amplified genomic DNA from thousands of individual cancer cells confined to droplets. The barcodes are then used to reassemble the genetic profiles of cells from next generation sequencing data. Using this approach, we sequenced longitudinally collected AML tumor populations from two patients and genotyped up to 62 disease relevant loci across more than 16,000 individual cells. Targeted single-cell sequencing was able to sensitively identify tumor cells during complete remission and uncovered complex clonal evolution within AML tumors that was not observable with bulk sequencing. We anticipate that this approach will make feasible the routine analysis of heterogeneity in AML leading to improved stratification and therapy selection for the disease.

Download Full-text

JuLI: accurate detection of DNA fusions in clinical sequencing for precision oncology

10.1101/521039 ◽

2019 ◽

Author(s):

Hyun-Tae Shin ◽

Nayoung K. D. Kim ◽

Jae Won Yun ◽

Boram Lee ◽

Sungkyu Kyung ◽

...

Keyword(s):

High Throughput Sequencing ◽

False Negative ◽

Detection Algorithm ◽

Clinical Samples ◽

Whole Genome Sequencing Data ◽

Precision Oncology ◽

Sequencing Data ◽

Clinical Sequencing ◽

Accurate Detection ◽

High Depth

ABSTRACTAccurate detection of genomic fusions by high-throughput sequencing in clinical samples with inadequate tumor purity and formalin-fixed paraffin embedded (FFPE) tissue is an essential task in precise oncology. We developed the fusion detection algorithm Junction Location Identifier (JuLI) for optimization of high-depth clinical sequencing. We implemented novel filtering steps to minimize false positives and a joint calling function to increase sensitivity in clinical setting. We comprehensively validated the algorithm using high-depth sequencing data from cancer cell lines and clinical samples and whole genome sequencing data from NA12878. We showed that JuLI outperformed state-of-the-art fusion callers in cases with high-depth clinical sequencing and rescued a driver fusion from false negative in plasma cell-free DNA. JuLI is freely available via GitHub (https://github.com/sgilab/JuLI).

Download Full-text

Strengths and Biases of High-Throughput Sequencing Data in the Characterization of Freshwater Ciliate Microbiomes

Microbial Ecology ◽

10.1007/s00248-016-0912-8 ◽

2016 ◽

Vol 73 (4) ◽

pp. 865-875 ◽

Cited By ~ 6

Author(s):

Vittorio Boscaro ◽

Alessia Rossi ◽

Claudia Vannini ◽

Franco Verni ◽

Sergei I. Fokin ◽

...

Keyword(s):

High Throughput ◽

High Throughput Sequencing ◽

Sequencing Data ◽

High Throughput Sequencing Data

Download Full-text

Characterization of DNA lesions associated with cell-free DNA by targeted deep sequencing

BMC Medical Genomics ◽

10.1186/s12920-021-01040-8 ◽

2021 ◽

Vol 14 (1) ◽

Author(s):

Seung-Ho Shin ◽

Woong-Yang Park ◽

Donghyun Park

Keyword(s):

Dna Damage ◽

Deep Sequencing ◽

Error Rates ◽

Dna Lesions ◽

Sequencing Data ◽

Substitution Error ◽

Cytosine Deamination ◽

Free Dna ◽

Targeted Deep Sequencing ◽

Dna Fragment

Abstract Background Recently, a next-generation sequencing (NGS)-based method has been used for the successful detection of circulating tumor DNA (ctDNA) in various cancer types. Thus, the use of NGS on liquid biopsies will improve cancer diagnosis and prognosis. However, the low-allelic fraction of ctDNA poses a challenge for the sensitive and specific detection of tumor variants in cell-free DNA (cfDNA). To distinguish true variants from false positives, the characteristics of errors that occur during sample preparation and sequencing need to be elucidated. Methods We generated capture-based targeted deep sequencing data from plasma cfDNA and peripheral blood leucocyte (PBL) gDNA to profile background errors. To reveal cfDNA-associated DNA lesions, background error profiles from two sample types were compared in each nucleotide substitution class. Results In this study, we determined the prevalence of single nucleotide substitutions in cfDNA sequencing data to identify DNA damage preferentially associated with cfDNA. On comparing sequencing errors between cfDNA and cellular genomic DNA (gDNA), we observed that the total substitution error rates in cfDNA were significantly higher than those in gDNA. When the substitution errors were divided into 12 substitution error classes, C:G>T:A substitution errors constituted the largest difference between cfDNA and gDNA samples. When the substitution error rates were estimated based on the location of DNA-fragment substitutions, the differences in error rates of most substitution classes between cfDNA and gDNA samples were observed only at the ends of the DNA fragments. In contrast, C:G>T:A substitution errors in the cfDNA samples were not particularly associated with DNA-fragment ends. All observations were verified in an independent dataset. Conclusions Our data suggested that cytosine deamination increased in cfDNA compared to that in cellular gDNA. Such an observation might be due to the attenuation of DNA damage repair before the release of cfDNA and/or the accumulation of cytosine deamination after it. These findings can contribute to a better understanding of cfDNA-associated DNA damage, which will enable the accurate analysis of somatic variants present in cfDNA at an extremely low frequency.

Download Full-text

Characterization of the mitochondrial genome ofArge bellaWei & Du sp. nov. (Hymenoptera: Argidae)

PeerJ ◽

10.7717/peerj.6131 ◽

2018 ◽

Vol 6 ◽

pp. e6131 ◽

Cited By ~ 3

Author(s):

Shiyu Du ◽

Gengyun Niu ◽

Tommi Nyman ◽

Meicai Wei

Keyword(s):

Mitochondrial Genome ◽

High Throughput Sequencing ◽

Complete Mitochondrial Genome ◽

Nucleotide Composition ◽

Sequencing Data ◽

Protein Coding ◽

High Throughput Sequencing Data ◽

Rna Genes ◽

Ancestral Type

We describeArge bellaWei & Du sp. nov., a large and beautiful species of Argidae from south China, and report its mitochondrial genome based on high-throughput sequencing data. We present the gene order, nucleotide composition of protein-coding genes (PCGs), and the secondary structures of RNA genes. The nearly complete mitochondrial genome ofA. bellahas a length of 15,576 bp and a typical set of 37 genes (22 tRNAs, 13 PCGs, and 2 rRNAs). Three tRNAs are rearranged in theA. bellamitochondrial genome as compared to the ancestral type in insects:trnMandtrnQare shuffled, whiletrnWis translocated from thetrnW-trnC-trnYcluster to a location downstream oftrnI. All PCGs are initiated by ATN codons, and terminated with TAA, TA or T as stop codons. All tRNAs have a typical cloverleaf secondary structure, except fortrnS1. H821 ofrrnSand H976 ofrrnLare redundant. A phylogenetic analysis based on mitochondrial genome sequences ofA. bella, 21 other symphytan species, two apocritan representatives, and four outgroup taxa supports the placement of Argidae as sister to the Pergidae within the symphytan superfamily Tenthredinoidea.

Download Full-text