scholarly journals Measurement error and variant-calling in deep Illumina sequencing of HIV

2018 ◽  
Author(s):  
Mark Howison ◽  
Mia Coetzer ◽  
Rami Kantor

ABSTRACTMotivationNext-generation deep sequencing of viral genomes, particularly on the Illumina platform, is increasingly applied in HIV research. Yet, there is no standard protocol or method used by the research community to account for measurement errors that arise during sample preparation and sequencing. Correctly calling high and low frequency variants while controlling for erroneous variant calls is an important precursor to downstream interpretation, such as studying the emergence of HIV drug-resistance mutations, which in turn has clinical applications and can improve patient care.ResultsWe developed a new variant-calling pipeline, hivmmer, for Illumina sequences from HIV viral genomes. First, we validated hivmmer by comparing it to other variant-calling pipelines on real HIV plasmid data sets, which have known sequences. We found that hivmmer achieves a lower rate of erroneous variant calls, and that all methods agree on the frequency of correctly called variants. Next, we compared the methods on an HIV plasmid data set that was sequenced using an amplicon-tagging protocol called Primer ID, which is designed to reduce errors and amplification bias during library preparation. We show that the Primer ID consensus does indeed have fewer erroneous variant calls compared to the variant-calling pipelines, and that hivmmer more closely approaches this low error rate compared to the other pipelines. Surprisingly, the frequency estimates from the Primer ID consensus do not differ significantly from those of the variant-calling pipelines. Finally, we built a predictive model for classifying errors in the hivmmer alignment, and show that it achieves high accuracy for identifying erroneous variant calls.Availabilityhivmmer is freely available for non-commercial use from https://github.com/mhowison/[email protected]

1992 ◽  
Vol 14 (1) ◽  
pp. 25-27 ◽  
Author(s):  
J. S. O. Odonde

Experimental data always contains measurement errors (or noise, in signal processing). This paper is concerned with the removal of outliers from a data set consisting of only a handful of points. The data set has a unimodal probability distribution function, the mode is thus a reliable estimate of the central tendency. The approach is nonparametric; for the data set (xi, yi) only the ordinates (yi) are used. The abscissa (xi) are reparametrized to the variable i = 1, N.The data is bounded using a calculated mode and a new measure: the mean absolute deviation from the mode. This does not seem to have been reported before. The mean is removed and low frequency filtering is performed in the frequency domain, after which the mean is reintroduced.


PeerJ ◽  
2020 ◽  
Vol 8 ◽  
pp. e10121
Author(s):  
Idowu B. Olawoye ◽  
Simon D.W. Frost ◽  
Christian T. Happi

Next generation sequencing technologies are becoming more accessible and affordable over the years, with entire genome sequences of several pathogens being deciphered in few hours. However, there is the need to analyze multiple genomes within a short time, in order to provide critical information about a pathogen of interest such as drug resistance, mutations and genetic relationship of isolates in an outbreak setting. Many pipelines that currently do this are stand-alone workflows and require huge computational requirements to analyze multiple genomes. We present an automated and scalable pipeline called BAGEP for monomorphic bacteria that performs quality control on FASTQ paired end files, scan reads for contaminants using a taxonomic classifier, maps reads to a reference genome of choice for variant detection, detects antimicrobial resistant (AMR) genes, constructs a phylogenetic tree from core genome alignments and provide interactive short nucleotide polymorphism (SNP) visualization across core genomes in the data set. The objective of our research was to create an easy-to-use pipeline from existing bioinformatics tools that can be deployed on a personal computer. The pipeline was built on the Snakemake framework and utilizes existing tools for each processing step: fastp for quality trimming, snippy for variant calling, Centrifuge for taxonomic classification, Abricate for AMR gene detection, snippy-core for generating whole and core genome alignments, IQ-TREE for phylogenetic tree construction and vcfR for an interactive heatmap visualization which shows SNPs at specific locations across the genomes. BAGEP was successfully tested and validated with Mycobacterium tuberculosis (n = 20) and Salmonella enterica serovar Typhi (n = 20) genomes which are about 4.4 million and 4.8 million base pairs, respectively. Running these test data on a 8 GB RAM, 2.5 GHz quad core laptop took 122 and 61 minutes on respective data sets to complete the analysis. BAGEP is a fast, calls accurate SNPs and an easy to run pipeline that can be executed on a mid-range laptop; it is freely available on: https://github.com/idolawoye/BAGEP.


2015 ◽  
Author(s):  
Michael Lawrence ◽  
Melanie A Huntley ◽  
Eric Stawiski ◽  
Art Owen ◽  
Thomas D Wu ◽  
...  

The accurate identification of low-frequency variants in tumors remains an unsolved problem. To support characterization of the issues in a realistic setting, we have developed software tools and a reference dataset for diagnosing variant calling pipelines. The dataset contains millions of variants at frequencies ranging from 0.05 to 1.0. To generate the dataset, we performed whole-genome sequencing of a mixture of two Corriel cell lines, NA19240 and NA12878, the mothers of YRI (Y) and CEU (C) HapMap trios, respectively. The cells were mixed in three different proportions, 10Y/90C, 50Y/50C and 90Y/10C, in an effort to simulate the heterogeneity found in tumor samples. We sequenced three biological replicates for each mixture, yielding approximately 1.4 billion reads per mixture for an average of 64X coverage. Using the published genotypes as our reference, we evaluate the performance of a general variant calling algorithm, constructed as a demonstration of our flexible toolset, and make comparisons to a standard GATK pipeline. We estimate the overall FDR to be 0.028 and the FNR (when coverage exceeds 20X) to be 0.019 in the 50Y/50C mixture. Interestingly, even with these relatively well studied individuals, we predict over 475,000 new variants, validating in well-behaved coding regions at a rate of 0.97, that were not included in the published genotypes.


Genes ◽  
2021 ◽  
Vol 12 (4) ◽  
pp. 507
Author(s):  
Bernd Timo Hermann ◽  
Sebastian Pfeil ◽  
Nicole Groenke ◽  
Samuel Schaible ◽  
Robert Kunze ◽  
...  

Detection of genetic variants in clinically relevant genomic hot-spot regions has become a promising application of next-generation sequencing technology in precision oncology. Effective personalized diagnostics requires the detection of variants with often very low frequencies. This can be achieved by targeted, short-read sequencing that provides high sequencing depths. However, rare genetic variants can contain crucial information for early cancer detection and subsequent treatment success, an inevitable level of background noise usually limits the accuracy of low frequency variant calling assays. To address this challenge, we developed DEEPGENTM, a variant calling assay intended for the detection of low frequency variants within liquid biopsy samples. We processed reference samples with validated mutations of known frequencies (0%–0.5%) to determine DEEPGENTM’s performance and minimal input requirements. Our findings confirm DEEPGENTM’s effectiveness in discriminating between signal and noise down to 0.09% variant allele frequency and an LOD(90) at 0.18%. A superior sensitivity was also confirmed by orthogonal comparison to a commercially available liquid biopsy-based assay for cancer detection.


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Fatemeh Amini ◽  
Felipe Restrepo Franco ◽  
Guiping Hu ◽  
Lizhi Wang

AbstractRecent advances in genomic selection (GS) have demonstrated the importance of not only the accuracy of genomic prediction but also the intelligence of selection strategies. The look ahead selection algorithm, for example, has been found to significantly outperform the widely used truncation selection approach in terms of genetic gain, thanks to its strategy of selecting breeding parents that may not necessarily be elite themselves but have the best chance of producing elite progeny in the future. This paper presents the look ahead trace back algorithm as a new variant of the look ahead approach, which introduces several improvements to further accelerate genetic gain especially under imperfect genomic prediction. Perhaps an even more significant contribution of this paper is the design of opaque simulators for evaluating the performance of GS algorithms. These simulators are partially observable, explicitly capture both additive and non-additive genetic effects, and simulate uncertain recombination events more realistically. In contrast, most existing GS simulation settings are transparent, either explicitly or implicitly allowing the GS algorithm to exploit certain critical information that may not be possible in actual breeding programs. Comprehensive computational experiments were carried out using a maize data set to compare a variety of GS algorithms under four simulators with different levels of opacity. These results reveal how differently a same GS algorithm would interact with different simulators, suggesting the need for continued research in the design of more realistic simulators. As long as GS algorithms continue to be trained in silico rather than in planta, the best way to avoid disappointing discrepancy between their simulated and actual performances may be to make the simulator as akin to the complex and opaque nature as possible.


2021 ◽  
Vol 3 (1) ◽  
Author(s):  
Gundula Povysil ◽  
Monika Heinzl ◽  
Renato Salazar ◽  
Nicholas Stoler ◽  
Anton Nekrutenko ◽  
...  

Abstract Duplex sequencing is currently the most reliable method to identify ultra-low frequency DNA variants by grouping sequence reads derived from the same DNA molecule into families with information on the forward and reverse strand. However, only a small proportion of reads are assembled into duplex consensus sequences (DCS), and reads with potentially valuable information are discarded at different steps of the bioinformatics pipeline, especially reads without a family. We developed a bioinformatics toolset that analyses the tag and family composition with the purpose to understand data loss and implement modifications to maximize the data output for the variant calling. Specifically, our tools show that tags contain polymerase chain reaction and sequencing errors that contribute to data loss and lower DCS yields. Our tools also identified chimeras, which likely reflect barcode collisions. Finally, we also developed a tool that re-examines variant calls from raw reads and provides different summary data that categorizes the confidence level of a variant call by a tier-based system. With this tool, we can include reads without a family and check the reliability of the call, that increases substantially the sequencing depth for variant calling, a particular important advantage for low-input samples or low-coverage regions.


2010 ◽  
Vol 55 (3) ◽  
pp. 1114-1119 ◽  
Author(s):  
Jia Liu ◽  
Michael D. Miller ◽  
Robert M. Danovich ◽  
Nathan Vandergrift ◽  
Fangping Cai ◽  
...  

ABSTRACTRaltegravir is highly efficacious in the treatment of HIV-1 infection. The prevalence and impact on virologic outcome of low-frequency resistant mutations among HIV-1-infected patients not previously treated with raltegravir have not been fully established. Samples from HIV treatment-experienced patients entering a clinical trial of raltegravir treatment were analyzed using a parallel allele-specific sequencing (PASS) assay that assessed six primary and six secondary integrase mutations. Patients who achieved and sustained virologic suppression (success patients,n= 36) and those who experienced virologic rebound (failure patients,n= 35) were compared. Patients who experienced treatment failure had twice as many raltegravir-associated resistance mutations prior to initiating treatment as those who achieved sustained virologic success, but the difference was not statistically significant. The frequency of nearly all detected resistance mutations was less than 1% of viral population, and the frequencies of mutations between the success and failure groups were similar. Expansion of pre-existing mutations (one primary and five secondary) was observed in 16 treatment failure patients in whom minority resistant mutations were detected at baseline, suggesting that they might play a role in the development of drug resistance. Two or more mutations were found in 13 patients (18.3%), but multiple mutations were not present in any single viral genome by linkage analysis. Our study demonstrates that low-frequency primary RAL-resistant mutations were uncommon, while minority secondary RAL-resistant mutations were more frequently detected in patients naïve to raltegravir. Additional studies in larger populations are warranted to fully understand the clinical implications of these mutations.


2016 ◽  
Vol 311 (3) ◽  
pp. F539-F547 ◽  
Author(s):  
Minhtri K. Nguyen ◽  
Dai-Scott Nguyen ◽  
Minh-Kevin Nguyen

Because changes in the plasma water sodium concentration ([Na+]pw) are clinically due to changes in the mass balance of Na+, K+, and H2O, the analysis and treatment of the dysnatremias are dependent on the validity of the Edelman equation in defining the quantitative interrelationship between the [Na+]pw and the total exchangeable sodium (Nae), total exchangeable potassium (Ke), and total body water (TBW) (Edelman IS, Leibman J, O'Meara MP, Birkenfeld LW. J Clin Invest 37: 1236–1256, 1958): [Na+]pw = 1.11(Nae + Ke)/TBW − 25.6. The interrelationship between [Na+]pw and Nae, Ke, and TBW in the Edelman equation is empirically determined by accounting for measurement errors in all of these variables. In contrast, linear regression analysis of the same data set using [Na+]pw as the dependent variable yields the following equation: [Na+]pw = 0.93(Nae + Ke)/TBW + 1.37. Moreover, based on the study by Boling et al. (Boling EA, Lipkind JB. 18: 943–949, 1963), the [Na+]pw is related to the Nae, Ke, and TBW by the following linear regression equation: [Na+]pw = 0.487(Nae + Ke)/TBW + 71.54. The disparities between the slope and y-intercept of these three equations are unknown. In this mathematical analysis, we demonstrate that the disparities between the slope and y-intercept in these three equations can be explained by how the osmotically inactive Na+ and K+ storage pool is quantitatively accounted for. Our analysis also indicates that the osmotically inactive Na+ and K+ storage pool is dynamically regulated and that changes in the [Na+]pw can be predicted based on changes in the Nae, Ke, and TBW despite dynamic changes in the osmotically inactive Na+ and K+ storage pool.


2018 ◽  
Author(s):  
Daniel P Cooke ◽  
David C Wedge ◽  
Gerton Lunter

Haplotype-based variant callers, which consider physical linkage between variant sites, are currently among the best tools for germline variation discovery and genotyping from short-read sequencing data. However, almost all such tools were designed specifically for detecting common germline variation in diploid populations, and give sub-optimal results in other scenarios. Here we present Octopus, a versatile haplotype-based variant caller that uses a polymorphic Bayesian genotyping model capable of modeling sequencing data from a range of experimental designs within a unified haplotype-aware framework. We show that Octopus accurately calls de novo mutations in parent-offspring trios and germline variants in individuals, including SNVs, indels, and small complex replacements such as microinversions. In addition, using a carefully designed synthetic-tumour data set derived from clean sequencing data from a sample with known germline haplotypes, and observed mutations in large cohort of tumour samples, we show that Octopus accurately characterizes germline and somatic variation in tumours, both with and without a paired normal sample. Sequencing reads and prior information are combined to phase called genotypes of arbitrary ploidy, including those with somatic mutations. Octopus also outputs realigned evidence BAMs to aid validation and interpretation.


2021 ◽  
Vol 4 ◽  
Author(s):  
Stefano Markidis

Physics-Informed Neural Networks (PINN) are neural networks encoding the problem governing equations, such as Partial Differential Equations (PDE), as a part of the neural network. PINNs have emerged as a new essential tool to solve various challenging problems, including computing linear systems arising from PDEs, a task for which several traditional methods exist. In this work, we focus first on evaluating the potential of PINNs as linear solvers in the case of the Poisson equation, an omnipresent equation in scientific computing. We characterize PINN linear solvers in terms of accuracy and performance under different network configurations (depth, activation functions, input data set distribution). We highlight the critical role of transfer learning. Our results show that low-frequency components of the solution converge quickly as an effect of the F-principle. In contrast, an accurate solution of the high frequencies requires an exceedingly long time. To address this limitation, we propose integrating PINNs into traditional linear solvers. We show that this integration leads to the development of new solvers whose performance is on par with other high-performance solvers, such as PETSc conjugate gradient linear solvers, in terms of performance and accuracy. Overall, while the accuracy and computational performance are still a limiting factor for the direct use of PINN linear solvers, hybrid strategies combining old traditional linear solver approaches with new emerging deep-learning techniques are among the most promising methods for developing a new class of linear solvers.


Sign in / Sign up

Export Citation Format

Share Document