Long fragments achieve lower base quality in Illumina paired-end sequencing

Mapping Intimacies ◽

10.1101/397158 ◽

2018 ◽

Author(s):

Ge Tan ◽

Lennart Opitz ◽

Ralph Schlapbach ◽

Hubert Rehrauer

Keyword(s):

Fragment Length ◽

Length Distribution ◽

Error Rates ◽

Average Error ◽

Library Preparation ◽

Sequencing Data ◽

Average Error Rate ◽

Lower Base ◽

Illumina Data ◽

Paired End Sequencing

AbstractIllumina’s technology provides high quality reads of DNA fragments with error rates below 1/1000 per base. Runs typically generate a millions of reads where the vast majority of the reads has also an average error rate below 1/1000. However, some paired-end sequencing data show the presence of a subpopulation of reads where the second read has lower average qualities. We show that the fragment length is a major driver of increased error rates in the R2 reads. Fragments above 500 nt tend to yield lower base qualities and higher error rates than shorter fragments. We demonstrate the fragment length dependency of the R2 read qualities using publicly available Illumina data generated by various library protocols, in different labs and using different sequencer models. Our finding extends the understanding of the Illumina read quality and has implications on error models for Illumina reads. It also sheds a light on the importance of the fragmentation during library preparation and the resulting fragment length distribution.

Download Full-text

Localization–compensation algorithm based on the Mean kShift and the Kalman filter

Modern Physics Letters B ◽

10.1142/s0217984915400205 ◽

2015 ◽

Vol 29 (06n07) ◽

pp. 1540020

Author(s):

Dong Myung Lee ◽

Tae Wan Kim ◽

Yun-Hae Kim

Keyword(s):

Kalman Filter ◽

Error Rate ◽

Mean Shift ◽

Mobility Model ◽

Mobile Node ◽

Error Rates ◽

Average Error ◽

Compensation Algorithm ◽

Average Error Rate ◽

The Mean

In this paper, we propose a localization simulator based on the random walk/waypoint mobility model and a hybrid-type location–compensation algorithm using the Mean kShift/Kalman filter (MSKF) to enhance the precision of the estimated location value of mobile modules. From an analysis of our experimental results, the proposed algorithm using the MSKF can better compensate for the error rates, the average error rate per estimated distance moved by the mobile node ( Err _ Rate DV ) and the error rate per estimated trace value of the mobile node ( Err _ Rate TV ) than the Mean shift or Kalman filter up to a maximum of 29% in a random mobility environment for the three scenarios.

Download Full-text

Automated Classification of Selected Data Elements from Free-text Diagnostic Reports for Clinical Research

Methods of Information in Medicine ◽

10.3414/me15-02-0019 ◽

2016 ◽

Vol 55 (04) ◽

pp. 373-380 ◽

Cited By ~ 10

Author(s):

Matthias Ganzinger ◽

Karsten Senghas ◽

Stefan Riezler ◽

Petra Knaup ◽

Martin Löpprich ◽

...

Keyword(s):

Multiple Myeloma ◽

Error Rate ◽

Multiclass Classification ◽

Error Rates ◽

University Hospital ◽

Average Error ◽

Free Text ◽

Average Error Rate ◽

Data Elements

SummaryObjectives: In the Multiple Myeloma clinical registry at Heidelberg University Hospital, most data are extracted from discharge letters. Our aim was to analyze if it is possible to make the manual documentation process more efficient by using methods of natural language processing for multiclass classification of free-text diagnostic reports to automatically document the diagnosis and state of disease of myeloma patients. The first objective was to create a corpus consisting of free-text diagnosis paragraphs of patients with multiple myeloma from German diagnostic reports, and its manual annotation of relevant data elements by documentation specialists. The second objective was to construct and evaluate a framework using different NLP methods to enable automatic multiclass classification of relevant data elements from free-text diagnostic reports.Methods: The main diagnoses paragraph was extracted from the clinical report of one third randomly selected patients of the multiple myeloma research database from Heidelberg University Hospital (in total 737 selected patients). An EDC system was setup and two data entry specialists performed independently a manual documentation of at least nine specific data elements for multiple myeloma characterization. Both data entries were compared and assessed by a third specialist and an annotated text corpus was created. A framework was constructed, consisting of a self-developed package to split multiple diagnosis sequences into several subsequences, four different preprocessing steps to normalize the input data and two classifiers: a maximum entropy classifier (MEC) and a support vector machine (SVM). In total 15 different pipelines were examined and assessed by a ten-fold cross-validation, reiterated 100 times. For quality indication the average error rate and the average F1-score were conducted. For significance testing the approximate randomization test was used.Results: The created annotated corpus consists of 737 different diagnoses paragraphs with a total number of 865 coded diagnosis. The dataset is publicly available in the supplementary online files for training and testing of further NLP methods. Both classifiers showed low average error rates (MEC: 1.05; SVM: 0.84) and high F1-scores (MEC: 0.89; SVM: 0.92). However the results varied widely depending on the classified data ele -ment. Preprocessing methods increased this effect and had significant impact on the classification, both positive and negative. The automatic diagnosis splitter increased the average error rate significantly, even if the F1-score decreased only slightly.Conclusions: The low average error rates and high average F1-scores of each pipeline demonstrate the suitability of the investigated NPL methods. However, it was also shown that there is no best practice for an automatic classification of data elements from free-text diagnostic reports.

Download Full-text

Massive influence of DNA isolation and library preparation approaches on palaeogenomic sequencing data

10.1101/075911 ◽

2016 ◽

Cited By ~ 8

Author(s):

Axel Barlow ◽

Gloria G. Fortes ◽

Love Dalén ◽

Ron Pinhasi ◽

Boris Gasparyan ◽

...

Keyword(s):

Sample Preparation ◽

Dna Isolation ◽

Length Distribution ◽

Gc Content ◽

Nucleotide Composition ◽

Library Preparation ◽

Sequencing Data ◽

Dna Yield ◽

Specific Effects ◽

Laboratory Procedures

ABSTRACTThe ability to access genomic information from ancient samples has provided many important biological insights. Generating such palaeogenomic data requires specialised methodologies, and a variety of procedures for all stages of sample preparation have been proposed. However, the specific effects and biases introduced by alternative laboratory procedures is insufficiently understood. Here, we investigate the effects of three DNA isolation and two library preparation protocols on palaeogenomic data obtained from four Pleistocene subfossil bones. We find that alternative methodologies can significantly and substantially affect total DNA yield, the mean length and length distribution of recovered fragments, nucleotide composition, and the total amount of usable data generated. Furthermore, we also detect significant interaction effects between these stages of sample preparation on many of these factors. Effects and biases introduced in the laboratory can be sufficient to confound estimates of DNA degradation, sample authenticity and genomic GC content, and likely also estimates of genetic diversity and population structure. Future palaeogenomic studies need to carefully consider the effects of laboratory procedures during both experimental design and data analysis, particularly when studies involve multiple datasets generated using a mixture of methodologies.

Download Full-text

Real Randomized Benchmarking

Quantum ◽

10.22331/q-2018-08-22-85 ◽

2018 ◽

Vol 2 ◽

pp. 85 ◽

Cited By ~ 11

Author(s):

A. K. Hashagen ◽

S. T. Flammia ◽

D. Gross ◽

J. J. Wallman

Keyword(s):

Error Rate ◽

Fault Tolerant ◽

Error Rates ◽

Average Error ◽

Separate Determination ◽

The Real ◽

Fine Grained ◽

Average Error Rate ◽

Clifford Group ◽

Group A

Randomized benchmarking provides a tool for obtaining precise quantitative estimates of the average error rate of a physical quantum channel. Here we define real randomized benchmarking, which enables a separate determination of the average error rate in the real and complex parts of the channel. This provides more fine-grained information about average error rates with approximately the same cost as the standard protocol. The protocol requires only averaging over the real Clifford group, a subgroup of the full complex Clifford group, and makes use of the fact that it forms an orthogonal 2-design. It therefore allows benchmarking of fault-tolerant gates for an encoding which does not contain the full Clifford group transversally. Furthermore, our results are especially useful when considering quantum computations on rebits (or real encodings of complex computations), in which case the real Clifford group now plays the role of the complex Clifford group when studying stabilizer circuits.

Download Full-text

Development of a User-Friendly Pipeline for Mutational Analyses of HIV Using Ultra-Accurate Maximum-Depth Sequencing

Viruses ◽

10.3390/v13071338 ◽

2021 ◽

Vol 13 (7) ◽

pp. 1338

Author(s):

Morgan E. Meissner ◽

Emily J. Julik ◽

Jonathan P. Badalamenti ◽

William G. Arndt ◽

Lauren J. Mills ◽

...

Keyword(s):

Error Rates ◽

Maximum Depth ◽

Sequencing Data ◽

Background Error ◽

High Background ◽

Immunodeficiency Virus ◽

User Friendly ◽

Viral Mutagenesis ◽

Hiv 1

Human immunodeficiency virus type 2 (HIV-2) accumulates fewer mutations during replication than HIV type 1 (HIV-1). Advanced studies of HIV-2 mutagenesis, however, have historically been confounded by high background error rates in traditional next-generation sequencing techniques. In this study, we describe the adaptation of the previously described maximum-depth sequencing (MDS) technique to studies of both HIV-1 and HIV-2 for the ultra-accurate characterization of viral mutagenesis. We also present the development of a user-friendly Galaxy workflow for the bioinformatic analyses of sequencing data generated using the MDS technique, designed to improve replicability and accessibility to molecular virologists. This adapted MDS technique and analysis pipeline were validated by comparisons with previously published analyses of the frequency and spectra of mutations in HIV-1 and HIV-2 and is readily expandable to studies of viral mutation across the genomes of both viruses. Using this novel sequencing pipeline, we observed that the background error rate was reduced 100-fold over standard Illumina error rates, and 10-fold over traditional unique molecular identifier (UMI)-based sequencing. This technical advancement will allow for the exploration of novel and previously unrecognized sources of viral mutagenesis in both HIV-1 and HIV-2, which will expand our understanding of retroviral diversity and evolution.

Download Full-text

Estimating sequencing error rates using families

BioData Mining ◽

10.1186/s13040-021-00259-6 ◽

2021 ◽

Vol 14 (1) ◽

Author(s):

Kelley Paskov ◽

Jae-Yoon Jung ◽

Brianna Chrisman ◽

Nate T. Stockham ◽

Peter Washington ◽

...

Keyword(s):

Whole Genome Sequencing ◽

Exome Sequencing ◽

Genome Sequencing ◽

Variant Calling ◽

Error Rates ◽

Sequencing Error ◽

Whole Genome ◽

Sequencing Data ◽

Sequencing Platform ◽

Whole Exome

Abstract Background As next-generation sequencing technologies make their way into the clinic, knowledge of their error rates is essential if they are to be used to guide patient care. However, sequencing platforms and variant-calling pipelines are continuously evolving, making it difficult to accurately quantify error rates for the particular combination of assay and software parameters used on each sample. Family data provide a unique opportunity for estimating sequencing error rates since it allows us to observe a fraction of sequencing errors as Mendelian errors in the family, which we can then use to produce genome-wide error estimates for each sample. Results We introduce a method that uses Mendelian errors in sequencing data to make highly granular per-sample estimates of precision and recall for any set of variant calls, regardless of sequencing platform or calling methodology. We validate the accuracy of our estimates using monozygotic twins, and we use a set of monozygotic quadruplets to show that our predictions closely match the consensus method. We demonstrate our method’s versatility by estimating sequencing error rates for whole genome sequencing, whole exome sequencing, and microarray datasets, and we highlight its sensitivity by quantifying performance increases between different versions of the GATK variant-calling pipeline. We then use our method to demonstrate that: 1) Sequencing error rates between samples in the same dataset can vary by over an order of magnitude. 2) Variant calling performance decreases substantially in low-complexity regions of the genome. 3) Variant calling performance in whole exome sequencing data decreases with distance from the nearest target region. 4) Variant calls from lymphoblastoid cell lines can be as accurate as those from whole blood. 5) Whole-genome sequencing can attain microarray-level precision and recall at disease-associated SNV sites. Conclusion Genotype datasets from families are powerful resources that can be used to make fine-grained estimates of sequencing error for any sequencing platform and variant-calling methodology.

Download Full-text

Image Processing Based on Three-Dimensional Stereo Vision

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.556-562.5017 ◽

2014 ◽

Vol 556-562 ◽

pp. 5017-5020

Author(s):

Ting Ting Wang

Keyword(s):

Image Processing ◽

Binocular Vision ◽

Stereo Vision ◽

Error Rate ◽

Three Dimensional ◽

Processing Method ◽

Stereoscopic Vision ◽

Average Error ◽

Traditional Model ◽

Average Error Rate

Three-dimensional stereo vision technology has the capability of overcoming drawbacks influencing by light, posture and occluder. A novel image processing method is proposed based on three-dimensional stereoscopic vision, which optimizes model on the basis of camera binocular vision and in improvement of adding constraints to traditional model, moreover ensures accuracy of later location and recognition. To verify validity of the proposed method, firstly marking experiments are conducted to achieve fruit location, with the result of average error rate of 0.65%; and then centroid feature experiments are achieved with error from 5.77mm to 68.15mm and reference error rate from 1.44% to 5.68%, average error rate of 3.76% while the distance changes from 300mm to 1200mm. All these data of experiments demonstrate that proposed method meets the requirements of three-dimensional imageprocessing.

Download Full-text

REscan: inferring repeat expansions and structural variation in paired-end short read sequencing data

Bioinformatics ◽

10.1093/bioinformatics/btaa753 ◽

2020 ◽

Author(s):

Russell Lewis McLaughlin

Keyword(s):

Structural Variation ◽

Sequence Data ◽

Neurological Diseases ◽

Repeat Expansion ◽

Sequencing Data ◽

Short Read ◽

Short Read Sequencing ◽

Repeat Expansions ◽

Paired End Sequencing

Abstract Motivation Repeat expansions are an important class of genetic variation in neurological diseases. However, the identification of novel repeat expansions using conventional sequencing methods is a challenge due to their typical lengths relative to short sequence reads and difficulty in producing accurate and unique alignments for repetitive sequence. However, this latter property can be harnessed in paired-end sequencing data to infer the possible locations of repeat expansions and other structural variation. Results This article presents REscan, a command-line utility that infers repeat expansion loci from paired-end short read sequencing data by reporting the proportion of reads orientated towards a locus that do not have an adequately mapped mate. A high REscan statistic relative to a population of data suggests a repeat expansion locus for experimental follow-up. This approach is validated using genome sequence data for 259 cases of amyotrophic lateral sclerosis, of which 24 are positive for a large repeat expansion in C9orf72, showing that REscan statistics readily discriminate repeat expansion carriers from non-carriers. Availabilityand implementation C source code at https://github.com/rlmcl/rescan (GNU General Public Licence v3).

Download Full-text

Maximum entropy and average error rates in digital communication systems

Proceedings of COMSIG '94 - 1994 South African Symposium on Communications and Signal Processing ◽

10.1109/comsig.1994.512353 ◽

2002 ◽

Cited By ~ 1

Author(s):

F. Solms ◽

P. van Rooyen ◽

J. Kunicki

Keyword(s):

Maximum Entropy ◽

Communication Systems ◽

Digital Communication ◽

Error Rates ◽

Average Error ◽

Digital Communication Systems

Download Full-text

Unraveling city-specific signature and identifying sample origin locations for the data from CAMDA MetaSUB challenge

Biology Direct ◽

10.1186/s13062-020-00284-1 ◽

2021 ◽

Vol 16 (1) ◽

Author(s):

Runzhi Zhang ◽

Alejandro R. Walker ◽

Susmita Datta

Keyword(s):

Machine Learning ◽

Error Rates ◽

Average Error ◽

Importance Score ◽

Learning Methods ◽

Principal Coordinates Analysis ◽

Principal Coordinates ◽

Total Variability ◽

The Common ◽

Feature Selecting

Abstract Background Composition of microbial communities can be location-specific, and the different abundance of taxon within location could help us to unravel city-specific signature and predict the sample origin locations accurately. In this study, the whole genome shotgun (WGS) metagenomics data from samples across 16 cities around the world and samples from another 8 cities were provided as the main and mystery datasets respectively as the part of the CAMDA 2019 MetaSUB “Forensic Challenge”. The feature selecting, normalization, three methods of machine learning, PCoA (Principal Coordinates Analysis) and ANCOM (Analysis of composition of microbiomes) were conducted for both the main and mystery datasets. Results Features selecting, combined with the machines learning methods, revealed that the combination of the common features was effective for predicting the origin of the samples. The average error rates of 11.93 and 30.37% of three machine learning methods were obtained for main and mystery datasets respectively. Using the samples from main dataset to predict the labels of samples from mystery dataset, nearly 89.98% of the test samples could be correctly labeled as “mystery” samples. PCoA showed that nearly 60% of the total variability of the data could be explained by the first two PCoA axes. Although many cities overlapped, the separation of some cities was found in PCoA. The results of ANCOM, combined with importance score from the Random Forest, indicated that the common “family”, “order” of the main-dataset and the common “order” of the mystery dataset provided the most efficient information for prediction respectively. Conclusions The results of the classification suggested that the composition of the microbiomes was distinctive across the cities, which could be used to identify the sample origins. This was also supported by the results from ANCOM and importance score from the RF. In addition, the accuracy of the prediction could be improved by more samples and better sequencing depth.

Download Full-text