Long-read sequencing technology indicates genome-wide effects of non-B DNA on polymerization speed and error rate

Mapping Intimacies ◽

10.1101/237461 ◽

2017 ◽

Author(s):

Wilfried M. Guiblet ◽

Marzia A. Cremona ◽

Monika Cechova ◽

Robert S. Harris ◽

Iva Kejnovska ◽

...

Keyword(s):

Human Genome ◽

Single Molecule ◽

Tandem Repeats ◽

Neurological Diseases ◽

Error Rates ◽

Polymerization Kinetics ◽

Sequencing Error ◽

Dna Polymerization ◽

Sequencing Errors ◽

Genome Wide

ABSTRACTDNA conformation may deviate from the classical B-form in ~13% of the human genome. Non-B DNA regulates many cellular processes; however, its effects on DNA polymerization speed and accuracy have not been investigated genome-wide. Such an inquiry is critical for understanding neurological diseases and cancer genome instability. Here we present the first simultaneous examination of DNA polymerization kinetics and errors in the human genome sequenced with Single-Molecule-Real-Time technology. We show that polymerization speed differs between non-B and B-DNA: it decelerates at G-quadruplexes and fluctuates periodically at disease-causing tandem repeats. Analyzing polymerization kinetics profiles, we predict and validate experimentally non-B DNA formation for a novel motif. We demonstrate that several non-B motifs affect sequencing errors (e.g., G-quadruplexes increase error rates) and that sequencing errors are positively associated with polymerase slowdown. Finally, we show that highly divergent G4 motifs have pronounced polymerization slowdown and high sequencing error rates, suggesting similar mechanisms for sequencing errors and germline mutations.

Download Full-text

Minimizer-space de Bruijn graphs

10.1101/2021.06.09.447586 ◽

2021 ◽

Author(s):

Barış Ekim ◽

Bonnie Berger ◽

Rayan Chikhi

Keyword(s):

Human Genome ◽

Dna Sequences ◽

Graphical Representation ◽

Error Rates ◽

Sequencing Error ◽

Sequencing Data ◽

De Bruijn Graphs ◽

Human Genome Assembly ◽

Long Read ◽

Metagenome Assembly

DNA sequencing data continues to progress towards longer reads with increasingly lower sequencing error rates. We focus on the problem of assembling such reads into genomes, which poses challenges in terms of accuracy and computational resources when using cutting-edge assembly approaches, e.g. those based on overlapping reads using minimizer sketches. Here, we introduce the concept of minimizer-space sequencing data analysis, where the minimizers rather than DNA nucleotides are the atomic tokens of the alphabet. By projecting DNA sequences into ordered lists of minimizers, our key idea is to enumerate what we call k-min-mers, that are k-mers over a larger alphabet consisting of minimizer tokens. Our approach, mdBG or minimizer-dBG, achieves orders-of magnitude improvement in both speed and memory usage over existing methods without much loss of accuracy. We demonstrate three uses cases of mdBG: human genome assembly, metagenome assembly, and the representation of large pangenomes. For assembly, we implemented mdBG in software we call rust-mdbg, resulting in ultra-fast, low memory and highly-contiguous assembly of PacBio HiFi reads. A human genome is assembled in under 10 minutes using 8 cores and 10 GB RAM, and 60 Gbp of metagenome reads are assembled in 4 minutes using 1 GB RAM. For pangenome graphs, we newly allow a graphical representation of a collection of 661,405 bacterial genomes as an mdBG and successfully search it (in minimizer-space) for anti-microbial resistance (AMR) genes. We expect our advances to be essential to sequence analysis, given the rise of long-read sequencing in genomics, metagenomics and pangenomics.

Download Full-text

Robust detection of tandem repeat expansions from long DNA reads

10.1101/356931 ◽

2018 ◽

Cited By ~ 1

Author(s):

Satomi Mitsuhashi ◽

Martin C Frith ◽

Takeshi Mizuguchi ◽

Satoko Miyatake ◽

Tomoko Toyota ◽

...

Keyword(s):

Tandem Repeat ◽

Tandem Repeats ◽

Genetic Diseases ◽

Error Rates ◽

Robust Detection ◽

Sequencing Errors ◽

Tandem Repeat Sequences ◽

Long Read ◽

Repeat Expansions ◽

The Many

AbstractTandemly repeated sequences are highly mutable and variable features of genomes. Tandem repeat expansions are responsible for a growing list of human diseases, even though it is hard to determine tandem repeat sequences with current DNA sequencing technology. Recent long-read technologies are promising, because the DNA reads are often longer than the repetitive regions, but are hampered by high error rates. Here, we report robust detection of human repeat expansions from careful alignments of long (PacBio and nanopore) reads to a reference genome. Our method (tandem-genotypes) is robust to systematic sequencing errors, inexact repeats with fuzzy boundaries, and low sequencing coverage. By comparing to healthy controls, we can prioritize pathological expansions within the top 10 out of 700000 tandem repeats in the genome. This may help to elucidate the many genetic diseases whose causes remain unknown.

Download Full-text

Improved assembly and variant detection of a haploid human genome using single-molecule, high-fidelity long reads

10.1101/635037 ◽

2019 ◽

Cited By ~ 7

Author(s):

Mitchell R. Vollger ◽

Glennis A. Logsdon ◽

Peter A. Audano ◽

Arvis Sulovari ◽

David Porubsky ◽

...

Keyword(s):

Human Genome ◽

Single Molecule ◽

Tandem Repeats ◽

De Novo ◽

Sequence Data ◽

Gene Annotation ◽

Hydatidiform Mole ◽

High Fidelity ◽

Human Genomes ◽

Long Read

AbstractThe sequence and assembly of human genomes using long-read sequencing technologies has revolutionized our understanding of structural variation and genome organization. We compared the accuracy, continuity, and gene annotation of genome assemblies generated from either high-fidelity (HiFi) or continuous long-read (CLR) datasets from the same complete hydatidiform mole human genome. We find that the HiFi sequence data assemble an additional 10% of duplicated regions and more accurately represent the structure of tandem repeats, as validated with orthogonal analyses. As a result, an additional 5 Mbp of pericentromeric sequences are recovered in the HiFi assembly, resulting in a 2.5-fold increase in the NG50 within 1 Mbp of the centromere (HiFi 480.6 kbp, CLR 191.5 kbp). Additionally, the HiFi genome assembly was generated in significantly less time with fewer computational resources than the CLR assembly. Although the HiFi assembly has significantly improved continuity and accuracy in many complex regions of the genome, it still falls short of the assembly of centromeric DNA and the largest regions of segmental duplication using existing assemblers. Despite these shortcomings, our results suggest that HiFi may be the most effective stand-alone technology for de novo assembly of human genomes.

Download Full-text

Acceleration of Nucleotide Semi-Global Alignment with Adaptive Banded Dynamic Programming

10.1101/130633 ◽

2017 ◽

Cited By ~ 9

Author(s):

Hajime Suzuki ◽

Masahiro Kasahara

Keyword(s):

Dynamic Programming ◽

Single Molecule ◽

Computation Time ◽

Error Rates ◽

Nucleotide Sequences ◽

Sequencing Error ◽

Local Alignment ◽

Global Alignment ◽

Alignment Algorithm ◽

Short Read

AbstractMotivationPairwise alignment of nucleotide sequences has previously been carried out using the seed- and-extend strategy, where we enumerate seeds (shared patterns) between sequences and then extend the seeds by Smith-Waterman-like semi-global dynamic programming to obtain full pairwise alignments. With the advent of massively parallel short read sequencers, algorithms and data structures for efficiently finding seeds have been extensively explored. However, recent advances in single-molecule sequencing technologies have enabled us to obtain millions of reads, each of which is orders of magnitude longer than those output by the short-read sequencers, demanding a faster algorithm for the extension step that accounts for most of the computation time required for pairwise local alignment. Our goal is to design a faster extension algorithm suitable for single-molecule sequencers with high sequencing error rates (e.g., 10-15%) and with more frequent insertions and deletions than substitutions.ResultsWe propose an adaptive banded dynamic programming algorithm for calculating pairwise semi-global alignment of nucleotide sequences that allows a relatively high insertion or deletion rate while keeping band width relatively low (e.g., 32 or 64 cells) regardless of sequence lengths. Our new algorithm eliminated mutual dependences between elements in a vector, allowing an efficient Single-Instruction-Multiple-Data parallelization. We experimentally demonstrate that our algorithm runs approximately 5× faster than the extension alignment algorithm in NCBI BLAST+ while retaining similar sensitivity (recall).We also show that our extension algorithm is more sensitive than the extension alignment routine in DALIGNER, while the computation time is comparable.AvailabilityThe implementation of the algorithm and the benchmarking scripts are available at https://github.com/ocxtal/[email protected]

Download Full-text

Overlap detection on long, error-prone sequencing reads via smooth q-gram

Bioinformatics ◽

10.1093/bioinformatics/btaa252 ◽

2020 ◽

Vol 36 (19) ◽

pp. 4838-4845

Author(s):

Yan Song ◽

Haixu Tang ◽

Haoyu Zhang ◽

Qin Zhang

Keyword(s):

Single Molecule ◽

De Novo ◽

Error Rates ◽

Supplementary Information ◽

Sequencing Error ◽

Fragment Assembly ◽

Detection Algorithms ◽

Third Generation Sequencing ◽

Oxford Nanopore ◽

Assembly Algorithms

Abstract Motivation Third generation sequencing techniques, such as the Single Molecule Real Time technique from PacBio and the MinION technique from Oxford Nanopore, can generate long, error-prone sequencing reads which pose new challenges for fragment assembly algorithms. In this paper, we study the overlap detection problem for error-prone reads, which is the first and most critical step in the de novo fragment assembly. We observe that all the state-of-the-art methods cannot achieve an ideal accuracy for overlap detection (in terms of relatively low precision and recall) due to the high sequencing error rates, especially when the overlap lengths between reads are relatively short (e.g. <2000 bases). This limitation appears inherent to these algorithms due to their usage of q-gram-based seeds under the seed-extension framework. Results We propose smooth q-gram, a variant of q-gram that captures q-gram pairs within small edit distances and design a novel algorithm for detecting overlapping reads using smooth q-gram-based seeds. We implemented the algorithm and tested it on both PacBio and Nanopore sequencing datasets. Our benchmarking results demonstrated that our algorithm outperforms the existing q-gram-based overlap detection algorithms, especially for reads with relatively short overlapping lengths. Availability and implementation The source code of our implementation in C++ is available at https://github.com/FIGOGO/smoothq. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Correcting values of DNA sequence similarity for errors in sequencing

10.1101/237990 ◽

2017 ◽

Author(s):

Timothy J. Hackmann

Keyword(s):

Ribosomal Dna ◽

Dna Sequences ◽

Sequence Similarity ◽

Single Equation ◽

Error Rates ◽

Sequencing Error ◽

Original Sequence ◽

Sequencing Errors ◽

Correct Sequence ◽

Similarity Thresholds

AbstractThe similarity between two DNA sequences is one of the most important measures in bioinformatics, but errors introduced during sequencing make values of similarity lower than they should be. Here we develop a method to correct raw sequence similarity for sequencing errors and estimate the original sequence similarity. Our method is simple and consists of a single equation with terms for 1) raw sequence similarity and 2) error rates (e.g., from Phred quality scores). We show the importance of this correction for 16S ribosomal DNA sequences from bacterial communities, where 97% similarity is a frequent threshold for clustering sequences for analysis. At that threshold and typical error rate of 0.2%, correcting for error increases similarity by 0.36 percentage points. This result shows that, if uncorrected, sequencing error would increase similarity thresholds and generate false clusters for analysis. Our method could be used to adjust thresholds for cluster-based analyses. Alternatively, because it requires no clustering to correct sequence similarity, it could usher in a new age of analyzing ribosomal DNA sequences without clustering.

Download Full-text

PB-Motif—A Method for Identifying Gene/Pseudogene Rearrangements With Long Reads: An Application to CYP21A2 Genotyping

Frontiers in Genetics ◽

10.3389/fgene.2021.716586 ◽

2021 ◽

Vol 12 ◽

Author(s):

Zachary Stephens ◽

Dragana Milosevic ◽

Benjamin Kipp ◽

Stefan Grebe ◽

Ravishankar K. Iyer ◽

...

Keyword(s):

Phase Variation ◽

Variant Calling ◽

Error Rates ◽

Clinical Samples ◽

Sequencing Error ◽

Carrier Status ◽

Sequencing Errors ◽

Sequencing Technologies ◽

Long Reads ◽

Genomic Regions

Long read sequencing technologies have the potential to accurately detect and phase variation in genomic regions that are difficult to fully characterize with conventional short read methods. These difficult to sequence regions include several clinically relevant genes with highly homologous pseudogenes, many of which are prone to gene conversions or other types of complex structural rearrangements. We present PB-Motif, a new method for identifying rearrangements between two highly homologous genomic regions using PacBio long reads. PB-Motif leverages clustering and filtering techniques to efficiently report rearrangements in the presence of sequencing errors and other systematic artifacts. Supporting reads for each high-confidence rearrangement can then be used for copy number estimation and phased variant calling. First, we demonstrate PB-Motif's accuracy with simulated sequence rearrangements of PMS2 and its pseudogene PMS2CL using simulated reads sweeping over a range of sequencing error rates. We then apply PB-Motif to 26 clinical samples, characterizing CYP21A2 and its pseudogene CYP21A1P as part of a diagnostic assay for congenital adrenal hyperplasia. We successfully identify damaging variation and patient carrier status concordant with clinical diagnosis obtained from multiplex ligation-dependent amplification (MLPA) and Sanger sequencing. The source code is available at: github.com/zstephens/pb-motif.

Download Full-text

Inferring heterozygosity from ancient and low coverage genomes

10.1101/046748 ◽

2016 ◽

Cited By ~ 2

Author(s):

Athanasios Kousathanas ◽

Christoph Leuenberger ◽

Vivian Link ◽

Christian Sell ◽

Joachim Burger ◽

...

Keyword(s):

Error Rates ◽

Reference Sequence ◽

Model Organisms ◽

Single Individual ◽

Post Mortem ◽

High Coverage ◽

Sequencing Errors ◽

Genome Wide ◽

Two Samples ◽

Low Coverage

ABSTRACTWhile genetic diversity can be quantified accurately from high coverage sequencing, it is often desirable to obtain such estimates from low coverage data, either to save costs or because of low DNA quality as observed for ancient samples. Here we introduce a method to accurately infer heterozygosity probabilistically from very low coverage sequences of a single individual. The method relaxes the infinite sites assumption of previous methods, does not require a reference sequence and takes into account both variable sequencing errors and potential post-mortem damage. It is thus also applicable to non-model organisms and ancient genomes. Since error rates as reported by sequencing machines are generally distorted and require recalibration, we also introduce a method to infer accurately recalibration parameter in the presence of post-mortem damage. This method does also not require knowledge about the underlying genome sequence, but instead works from haploid data (e.g. from the X-chromosome from mammalian males) and integrates over the unknown genotypes. Using extensive simulations we show that a few Mb of haploid data is sufficient for accurate recalibration even at average coverages as low as 1-3x. At similar coverages, out method also produces very accurate estimates of heterozygosity down to 10−4 within windows of about 1Mb. We further illustrate the usefulness of our approach by inferring genome-wide patterns of diversity for several ancient human samples and found that 3,000-5,000 samples showed diversity patterns comparable to modern humans. In contrast, two European hunter-gatherer samples exhibited not only considerably lower levels of diversity than modern samples, but also highly distinct distributions of diversity along their genomes. Interestingly, these distributions were also very differently between the two samples, supporting earlier conclusions of a highly diverse and structured population in Europe prior to the arrival of farming.

Download Full-text

A virtual sequencer reveals the dephasing patterns in error-correction code DNA sequencing

National Science Review ◽

10.1093/nsr/nwaa227 ◽

2020 ◽

Author(s):

Wenxiong Zhou ◽

Li Kang ◽

Haifeng Duan ◽

Shuo Qiao ◽

Louis Tao ◽

...

Keyword(s):

Error Correction ◽

Single Molecule ◽

Read Length ◽

Sequencing Error ◽

Key Factors ◽

Average Read Length ◽

Error Correction Code ◽

Sequencing Errors ◽

Sequencing Chemistry ◽

Dual Base

Abstract An error-correction code (ECC) sequencing approach has recently been reported to effectively reduce sequencing errors by interrogating a DNA fragment with three orthogonal degenerate sequencing-by-synthesis (SBS) reactions. However, similar to other non-single-molecule SBS methods, the reaction will gradually lose its synchronization within a molecular colony in ECC sequencing. This phenomenon, called dephasing, causes sequencing error, and in ECC sequencing, induces distinctive dephasing patterns. To understand the characteristic dephasing patterns of the dual-base flowgram in ECC sequencing and to generate a correction algorithm, we built a virtual sequencer in silico. Starting from first principles and based on sequencing chemical reactions, we simulated ECC sequencing results, identified the key factors of dephasing in ECC sequencing chemistry, and designed an effective dephasing algorithm. The results show that our dephasing algorithm is applicable to sequencing signals with at least 500 cycles, or 1,000-bp average read length, with acceptably low error rate for further parity-checks and ECC deduction. Our virtual sequencer with our dephasing algorithm can further be extended to a dichromatic form of ECC sequencing, allowing for a potentially much more accurate sequencing approach.

Download Full-text

Epigenetic Patterns in a Complete Human Genome

10.1101/2021.05.26.443420 ◽

2021 ◽

Author(s):

Ariel Gershman ◽

Michael E.G. Sauria ◽

Paul W Hook ◽

Savannah Hoyt ◽

Roham Razaghi ◽

...

Keyword(s):

Human Genome ◽

Single Molecule ◽

Epigenetic Regulation ◽

Active Sites ◽

Tandem Repeats ◽

Methylation Status ◽

Reference Sequence ◽

Segmental Duplications ◽

Sequencing Data ◽

Epigenetic Patterns

The completion of the first telomere-to-telomere human genome, T2T-CHM13, enables exploration of the full epigenome, removing limitations previously imposed by the missing reference sequence. Existing epigenetic studies omit unassembled and unmappable genomic regions (e.g. centromeres, pericentromeres, acrocentric chromosome arms, subtelomeres, segmental duplications, tandem repeats). Leveraging the new assembly, we were able to measure enrichment of epigenetic marks with short reads using k-mer assisted mapping methods. This granted array-level enrichment information to characterize the epigenetic regulation of these satellite repeats. Using nanopore sequencing data, we generated base level maps of the most complete human methylome ever produced. We examined methylation patterns in satellite DNA and revealed organized patterns of methylation along individual molecules. When exploring the centromeric epigenome, we discovered a distinctive dip in centromere methylation consistent with active sites of kinetochore assembly. Through long-read chromatin accessibility measurements (nanoNOMe) paired to CUT&RUN data, we found the hypomethylated region was extremely inaccessible and paired to CENP-A/B binding. With long-reads we interrogated allele-specific, long-range epigenetic patterns in complex macro-satellite arrays such as those involved in X chromosome inactivation. Using the single molecule measurements we can clustered reads based on methylation status alone distinguishing epigenetically heterogeneous and homogeneous areas. The analysis provides a framework to investigate the most elusive regions of the human genome, applying both long and short-read technology to grant new insights into epigenetic regulation.

Download Full-text