scholarly journals Do Read Errors Matter for Genome Assembly?

2015 ◽  
Author(s):  
Ilan Shomorony ◽  
Thomas Courtade ◽  
David Tse

AbstractWhile most current high-throughput DNA sequencing technologies generate short reads with low error rates, emerging sequencing technologies generate long reads with high error rates. A basic question of interest is the tradeoff between read length and error rate in terms of the information needed for the perfect assembly of the genome. Using an adversarial erasure error model, we make progress on this problem by establishing a critical read length, as a function of the genome and the error rate, above which perfect assembly is guaranteed. For several real genomes, including those from the GAGE dataset, we verify that this critical read length is not significantly greater than the read length required for perfect assembly from reads without errors.

2017 ◽  
Vol 31 (10) ◽  
pp. 1549-1561 ◽  
Author(s):  
Ana Carolina Proença da Fonseca ◽  
Claudio Mastronardi ◽  
Angad Johar ◽  
Mauricio Arcos-Burgos ◽  
Gilberto Paz-Filho

Author(s):  
Shinichi Morishita ◽  
Kazuki Ichikawa ◽  
Gene Myers

Abstract Motivation Long tandem repeat expansions of more than 1000 nt have been suggested to be associated with diseases, but remain largely unexplored in individual human genomes because read lengths have been too short. However, new long-read sequencing technologies can produce single reads of 10,000 nt or more that can span such repeat expansions, although these long reads have high error rates, of 10%-20%, which complicates the detection of repetitive elements. Moreover, most traditional algorithms for finding tandem repeats are designed to find short tandem repeats (< 1000 nt) and cannot effectively handle the high error rate of long reads in a reasonable amount of time. Results Here, we report an efficient algorithm for solving this problem that takes advantage of the length of the repeat. Namely, a long tandem repeat has hundreds or thousands of approximate copies of the repeated unit, so despite the error rate, many short k-mers will be error-free in many copies of the unit. We exploited this characteristic to develop a method for first estimating regions that could contain a tandem repeat, by analyzing the k-mer frequency distributions of fixed-size windows across the target read, followed by an algorithm that assembles the k-mers of a putative region into the consensus repeat unit by greedily traversing a de Bruijn graph. Experimental results indicated that the proposed algorithm largely outperformed Tandem Repeats Finder (TRF), a widely used program for finding tandem repeats, in terms of sensitivity. Software availability https://github.com/morisUtokyo/mTR


2021 ◽  
Vol 4 ◽  
Author(s):  
Benjamin Callahan

An important advance in DNA sequencing has been the development of long-read sequencing technologies that produce sequencing reads of tens to hundreds of kilobases in length. However, these technologies typically have high (~8%) per-base error rates. Recently, an effectively new technology I call highly-accurate long-read sequencing has been developed, that allows for the generation of multi-kilobase reads with extremely high per-base accuracies (>99.9%). I will present and evaluate two such technologies, PacBio HiFi and LoopSeq SLR sequencing, and discuss potential metabarcoding applications of highly-accurate long-read amplicon sequencing in general.


2021 ◽  
Author(s):  
Xiao Luo ◽  
Xiongbin Kang ◽  
Alexander Schoenhuth

Haplotype-aware diploid genome assembly is crucial in genomics, precision medicine, and many other disciplines. Long-read sequencing technologies have greatly improved genome assembly thanks to advantages of read length. However, current long-read assemblers usually introduce disturbing biases or fail to capture the haplotype diversity of the diploid genome. Here, we present phasebook, a novel approach for reconstructing the haplotypes of diploid genomes from long reads de novo. Benchmarking experiments demonstrate that our method outperforms other approaches in terms of haplotype coverage by large margins, while preserving competitive performance or even achieving advantages in terms of all other aspects relevant for genome assembly.


2019 ◽  
Vol 304 ◽  
pp. 64-73
Author(s):  
Anna Woźniak ◽  
◽  
Michał Boroń ◽  
Renata Zbieć-Piekarska ◽  
Magdalena Spólnicka ◽  
...  

The turn of the 20th and 21st centuries marks the beginning of high-throughput DNA sequencing methods, which, owing to increasing efficiency and gradual cost reduction, have led to the revolutionization of biomedical research. This article discusses the most popular next generation sequencing technologies and their practical application in forensic genetic analysis.


2009 ◽  
Vol 1 (1) ◽  
pp. 1091-1094
Author(s):  
A R A Rahman ◽  
Shihui Foo ◽  
Sanket Goel

BMC Genomics ◽  
2012 ◽  
Vol 13 (1) ◽  
pp. 16 ◽  
Author(s):  
Michael P Mullen ◽  
Christopher J Creevey ◽  
Donagh P Berry ◽  
Matt S McCabe ◽  
David A Magee ◽  
...  

2012 ◽  
Vol 2012 ◽  
pp. 1-18 ◽  
Author(s):  
Silvio Garofalo ◽  
Marisa Cornacchione ◽  
Alfonso Di Costanzo

The introduction of DNA microarrays and DNA sequencing technologies in medical genetics and diagnostics has been a challenge that has significantly transformed medical practice and patient management. Because of the great advancements in molecular genetics and the development of simple laboratory technology to identify the mutations in the causative genes, also the diagnostic approach to epilepsy has significantly changed. However, the clinical use of molecular cytogenetics and high-throughput DNA sequencing technologies, which are able to test an entire genome for genetic variants that are associated with the disease, is preparing a further revolution in the near future. Molecular Karyotype and Next-Generation Sequencing have the potential to identify causative genes or loci also in sporadic or non-familial epilepsy cases and may well represent the transition from a genetic to a genomic approach to epilepsy.


Sign in / Sign up

Export Citation Format

Share Document