A de novo DNA Sequencing and Variant Calling Algorithm for Nanopores

Mapping Intimacies ◽

10.1101/019448 ◽

2015 ◽

Author(s):

Tamas Szalay ◽

Jene A Golovchenko

Keyword(s):

Single Molecule ◽

Statistical Models ◽

De Novo ◽

Variant Calling ◽

High Accuracy ◽

Nanopore Sequencing ◽

M13 Bacteriophage ◽

Assembly Pipeline ◽

Calling Algorithm ◽

Novel Algorithm

The single-molecule accuracy of nanopore sequencing has been an area of rapid academic and commercial advancement, but remains insufficient for the de novo analysis of genomes. We introduce here a novel algorithm for the error correction of nanopore data, utilizing statistical models of the physical system in order to obtain high accuracy de novo sequences at a range of coverage depths. We demonstrate the technique by sequencing M13 bacteriophage DNA to 99% accuracy at moderate coverage as well as its use in an assembly pipeline by sequencing λ DNA at a range of coverages. We also show the algorithm’s ability to accurately classify sequence variants at far lower coverage than existing methods.

Download Full-text

Haplotype-aware variant calling enables high accuracy in nanopore long-reads using deep neural networks

10.1101/2021.03.04.433952 ◽

2021 ◽

Author(s):

Kishwar Shafin ◽

Trevor Pesout ◽

Pi-Chuan Chang ◽

Maria Nattestad ◽

Alexey Kolesnikov ◽

...

Keyword(s):

De Novo ◽

Sequence Data ◽

Variant Calling ◽

High Accuracy ◽

Superior Performance ◽

Read Length ◽

Single Nucleotide Variants ◽

Single Nucleotide ◽

Short Read ◽

Long Read

Long-read sequencing has the potential to transform variant detection by reaching currently difficult-to-map regions and routinely linking together adjacent variations to enable read based phasing. Third-generation nanopore sequence data has demonstrated a long read length, but current interpretation methods for its novel pore-based signal have unique error profiles, making accurate analysis challenging. Here, we introduce a haplotype-aware variant calling pipeline PEPPER-Margin-DeepVariant that produces state-of-the-art variant calling results with nanopore data. We show that our nanopore-based method outperforms the short-read-based single nucleotide variant identification method at the whole genome-scale and produces high-quality single nucleotide variants in segmental duplications and low-mappability regions where short-read based genotyping fails. We show that our pipeline can provide highly-contiguous phase blocks across the genome with nanopore reads, contiguously spanning between 85% to 92% of annotated genes across six samples. We also extend PEPPER-Margin-DeepVariant to PacBio HiFi data, providing an efficient solution with superior performance than the current WhatsHap-DeepVariant standard. Finally, we demonstrate de novo assembly polishing methods that use nanopore and PacBio HiFi reads to produce diploid assemblies with high accuracy (Q35+ nanopore-polished and Q40+ PacBio-HiFi-polished).

Download Full-text

AsmMix: A pipeline for high quality diploid de novo assembly

10.1101/2021.01.15.426893 ◽

2021 ◽

Author(s):

Pei Wu ◽

Chao Liu ◽

Ou Wang ◽

Xia Zhao ◽

Fang Chen ◽

...

Keyword(s):

Single Molecule ◽

De Novo ◽

Variant Calling ◽

The Other ◽

Second Step ◽

Small Scale ◽

Mixing Process ◽

High Quality ◽

Single Molecule Sequencing ◽

Long Read

AbstractIn this paper, we report a pipeline, AsmMix, which is capable of producing both contiguous and high-quality diploid genomes. The pipeline consists of two steps. In the first step, two sets of assemblies are generated: one is based on co-barcoded reads, which are highly accurate and haplotype-resolved but contain many gaps, the other assembly is based on single-molecule sequencing reads, which is contiguous but error-prone. In the second step, those two sets of assemblies are compared and integrated into a haplotype-resolved assembly with fewer errors. We test our pipeline using a dataset of human genome NA24385, perform variant calling from those assemblies and then compare against GIAB Benchmark. We show that AsmMix pipeline could produce highly contiguous, accurate, and haplotype-resolved assemblies. Especially the assembly mixing process could effectively reduce small-scale errors in the long read assembly.

Download Full-text

Towards High Accuracy De Novo Nanopore Sequencing

Biophysical Journal ◽

10.1016/j.bpj.2017.11.1000 ◽

2018 ◽

Vol 114 (3) ◽

pp. 179a

Author(s):

Matthew T. Noakes ◽

Henry Brinkerhoff ◽

Andrew H. Laszlo ◽

Ian M. Derrington ◽

Kyle W. Langford ◽

...

Keyword(s):

De Novo ◽

High Accuracy ◽

Nanopore Sequencing

Download Full-text

Efficient long single molecule sequencing for cost effective and accurate sequencing, haplotyping, and de novo assembly

10.1101/324392 ◽

2018 ◽

Author(s):

Ou Wang ◽

Robert Chin ◽

Xiaofang Cheng ◽

Michelle Ka Wu ◽

Qing Mao ◽

...

Keyword(s):

Single Molecule ◽

Genome Assembly ◽

De Novo ◽

Low Cost ◽

Variant Calling ◽

Cost Effective ◽

High Quality ◽

Single Molecule Sequencing ◽

Single Tube ◽

Complex Structural

Obtaining accurate sequences from long DNA molecules is very important for genome assembly and other applications. Here we describe single tube long fragment read (stLFR), a technology that enables this a low cost. It is based on adding the same barcode sequence to sub-fragments of the original long DNA molecule (DNA co-barcoding). To achieve this efficiently, stLFR uses the surface of microbeads to create millions of miniaturized barcoding reactions in a single tube. Using a combinatorial process up to 3.6 billion unique barcode sequences were generated on beads, enabling practically non-redundant co-barcoding with 50 million barcodes per sample. Using stLFR, we demonstrate efficient unique co-barcoding of over 8 million 20-300 kb genomic DNA fragments. Analysis of the genome of the human genome NA12878 with stLFR demonstrated high quality variant calling and phasing into contigs up to N50 34 Mb. We also demonstrate detection of complex structural variants and complete diploid de novo assembly of NA12878. These analyses were all performed using single stLFR libraries and their construction did not significantly add to the time or cost of whole genome sequencing (WGS) library preparation. stLFR represents an easily automatable solution that enables high quality sequencing, phasing, SV detection, scaffolding, cost-effective diploid de novo genome assembly, and other long DNA sequencing applications.

Download Full-text

Rapid de novo assembly of the European eel genome from nanopore sequencing reads

10.1101/101907 ◽

2017 ◽

Cited By ~ 6

Author(s):

Hans J. Jansen ◽

Michael Liem ◽

Susanne A. Jong-Raadsen ◽

Sylvie Dufour ◽

Finn-Arne Weltzien ◽

...

Keyword(s):

De Novo ◽

Structural Quality ◽

European Eel ◽

Nanopore Sequencing ◽

Light Weight ◽

Computational Process ◽

High Quality ◽

Oxford Nanopore ◽

Eukaryotic Genomes ◽

Novel Algorithm

AbstractWe have sequenced the genome of the endangered European eel using the MinION by Oxford Nanopore, and assembled these data using a novel algorithm specifically designed for large eukaryotic genomes. For this 860 Mbp genome, the entire computational process takes two days on a single CPU. The resulting genome assembly significantly improves on a previous draft based on short reads only, both in terms of contiguity (N50 1.2 Mbp) and structural quality. This combination of affordable nanopore sequencing and light-weight assembly promises to make high-quality genomic resources accessible for many non-model plants and animals.

Download Full-text

The complex architecture of plant transgene insertions

10.1101/282772 ◽

2018 ◽

Cited By ~ 1

Author(s):

Florian Jupe ◽

Todd P. Michael ◽

Angeline C. Rivkin ◽

Mark Zander ◽

S. Timothy Motley ◽

...

Keyword(s):

Dna Methylation ◽

Single Molecule ◽

Large Scale ◽

Genome Engineering ◽

De Novo ◽

Plant Genome ◽

Nanopore Sequencing ◽

Dna Arrays ◽

In Planta ◽

Actual Length

AbstractOver the last 35 years the soil bacterium Agrobacterium tumefaciens has been the workhorse tool for plant genome engineering. Replacement of native tumor-inducing (Ti) plasmid elements with customizable cassettes enabled insertion of a sequence of interest called Transfer DNA (T-DNA) into any plant genome. Although these T-DNA transfer mechanisms are well understood, detailed understanding of structure and epigenomic status of insertion events was limited by current technologies. To fill this gap, we analyzed transgenic Arabidopsis thaliana lines from three widely used collections (SALK, SAIL and WISC) with two single molecule technologies, optical genome mapping and nanopore sequencing. Optical maps for four randomly selected T-DNA lines revealed between one and seven insertions/rearrangements, and for the first time the actual length of individual transgene insertions from 27 to 236 kilobases. De novo nanopore sequencing-based genome assemblies for two segregating lines resolved T-DNA structures up to 36 kb into the insertions and revealed large-scale T-DNA associated translocations and exchange of chromosome arm ends. The multiple internally rearranged nature of T-DNA arrays made full assembly impossible, even with long nanopore reads. For the current TAIR10 reference genome, nanopore contigs corrected 83% of non-centromeric misassemblies. This unprecedented nucleotide-level definition of T-DNA insertions enabled the mapping of epigenome data. We identify variable small RNA transgene targeting and DNA methylation. SALK_059379 T-DNA insertions were enriched for 24nt siRNAs and contained dense cytosine DNA methylation. Transgene silencing via the RNA-directed DNA methylation pathway was confirmed by in planta assays. In contrast, SAIL_232 T-DNA insertions are predominantly targeted by 21/22nt siRNAs, with DNA methylation and silencing limited to a reporter, but not the resistance gene. With the emergence of genome editing technologies that rely on Agrobacterium for gene delivery, this study provides new insights into the structural impact of engineering plant genomes and demonstrates the utility of state-of-the-art long-range sequencing technologies to rapidly identify unanticipated genomic changes.

Download Full-text

A reference bacterial genome dataset generated on the MinION™ portable single-molecule nanopore sequencer

10.1101/009613 ◽

2014 ◽

Author(s):

Josh Quick ◽

Aaron Quinlan ◽

Nicholas Loman

Keyword(s):

Single Molecule ◽

De Novo ◽

Sequence Data ◽

Bacterial Genome ◽

Model Organism ◽

Variant Calling ◽

Laptop Computer ◽

Early Access ◽

Dna Strands ◽

K 12

Background: The MinION™ is a new, portable single-molecule sequencer developed by Oxford Nanopore Technologies. It measures four inches in length and is powered from the USB 3.0 port of a laptop computer. By measuring the change in current produced when DNA strands translocate through and interact with a charged protein nanopore the device is able to deduce the underlying nucleotide sequence. Findings: We present a read dataset from whole-genome shotgun sequencing of the model organism Escherichia coli K-12 substr. MG1655 generated on a MinION™ device during the early-access MinION Access Program (MAP). Sequencing runs of the MinION™ are presented, one generated using R7 chemistry (released in July 2014) and one using R7.3 (released in September 2014). Conclusions: Base-called sequence data are provided to demonstrate the nature of data produced by the MinION™ platform and to encourage the development of customised methods for alignment, consensus and variant calling, de novo assembly and scaffolding. FAST5 files containing event data within the HDF5 container format are provided to assist with the development of improved base-calling methods. Datasets are provided through the GigaDB database at http://gigadb.org/dataset/100102

Download Full-text

poreTally: run and publish de novo Nanopore assembler benchmarks

10.1101/424184 ◽

2018 ◽

Author(s):

Carlos de Lannoy ◽

Judith Risse ◽

Dick de Ridder

Keyword(s):

Best Practices ◽

De Novo ◽

Nanopore Sequencing ◽

Base Calling ◽

Novel Approach ◽

Tool Performance ◽

Assembly Pipeline ◽

Nucleic Acid Analysis ◽

Sequencing Platforms ◽

Assembly Tool

AbstractNanopore sequencing is a novel approach to nucleic acid analysis that generates long, error-prone reads. Since device components, base calling software and best practices for sample preparation are updated frequently and extensively, the nature of the produced data also changes frequently. As a result, peer-reviewed publications on de novo assembly pipeline benchmarking efforts are quickly rendered outdated by the next major improvement to the sequencing platforms. To provide the user community with a faster, more flexible alternative to peer-reviewed benchmark papers for de novo assembly tool performance we constructed poreTally, a comprehensive benchmarking tool. poreTally automatically assembles a given read set using several often-used assembly pipelines, analyzes the resulting assemblies for correctness and continuity, and finally generates a quality report. Results can immediately be shared with peers in a Github/Gitlab repository. Furthermore, we aim to give a more inclusive overview of assembly pipeline performance than any individual research group can, by offering users the possibility to submit their results to a collective benchmarking effort. poreTally is available on Github.

Download Full-text

PBHoover and CigarRoller: a method for confident haploid variant calling on Pacific Biosciences data and its application to heterogeneous population analysis

10.1101/360370 ◽

2018 ◽

Author(s):

Sarah Ramirez-Busby ◽

Afif Elghraoui ◽

Yeon Bin Kim ◽

Kellie Kim ◽

Faramarz Valafar

Keyword(s):

Single Molecule ◽

Error Rate ◽

De Novo ◽

Population Analysis ◽

Variant Calling ◽

High Sensitivity ◽

Sequencing Depth ◽

Smrt Sequencing ◽

Link Type ◽

Low Coverage

AbstractMotivationSingle Molecule Real-Time (SMRT) sequencing has important and underutilized advantages that amplification-based platforms lack. Lack of systematic error (e.g. GC-bias), completede novoassembly (including large repetitive regions) without scaffolding, can be mentioned. SMRT sequencing, however suffers from high random error rate and low sequencing depth (older chemistries). Here, we introduce PBHoover, software that uses a heuristic calling algorithm in order to make base calls with high certainty in low coverage regions. This software is also capable of mixed population detection with high sensitivity. PBHoover’s CigarRoller attachment improves sequencing depth in low-coverage regions through CIGAR-string correction.ResultsWe tested both modules on 348M.tuberculosisclinical isolates sequenced on C1 or C2 chemistries. On average, CigarRoller improved percentage of usable read count from 68.9% to 99.98% in C1 runs and from 50% to 99% in C2 runs. Using the greater depth provided by CigarRoller, PBHoover was able to make base and variant calls 99.95% concordant with Sanger calls (QV33). PBHoover also detected antibiotic-resistant subpopulations that went undetected by Sanger. Using C1 chemistry, subpopulations as small as 9% of the total colony can be detected by PBHoover. This provides the most sensitive amplification-free molecular method for heterogeneity analysis and is in line with phenotypic methods’ sensitivity. This sensitivity significantly improves with the greater depth and lower error rate of the newer chemistries.Availability and ImplementationExecutables are freely available under GNU GPL v3+ athttp://www.gitlab.com/LPCDRP/pbhooverandhttp://www.gitlab.com/LPCDRP/CigarRoller. PBHoover is also available on bioconda:https://anaconda.org/bioconda/[email protected]

Download Full-text

MBRS-47. RAPID MOLECULAR SUBGROUPING OF MEDULLOBLASTOMA BASED ON DNA METHYLATION BY NANOPORE SEQUENCING

Neuro-Oncology ◽

10.1093/neuonc/noaa222.556 ◽

2020 ◽

Vol 22 (Supplement_3) ◽

pp. iii406-iii406

Author(s):

Julien Masliah-Planchon ◽

Elodie Girard ◽

Philipp Euskirchen ◽

Christine Bourneix ◽

Delphine Lequin ◽

...

Keyword(s):

Dna Methylation ◽

Single Molecule ◽

Nanopore Sequencing ◽

Molecular Subgroup ◽

Group Assignment ◽

Group 4 ◽

Methylation Assay ◽

Tumor Group ◽

Long Read ◽

Group 3

Abstract Medulloblastoma (MB) can be classified into four molecular subgroups (WNT group, SHH group, group 3, and group 4). The gold standard of assignment of molecular subgroup through DNA methylation profiling uses Illumina EPIC array. However, this tool has some limitation in terms of cost and timing, in order to get the results soon enough for clinical use. We present an alternative DNA methylation assay based on nanopore sequencing efficient for rapid, cheaper, and reliable subgrouping of clinical MB samples. Low-depth whole genome with long-read single-molecule nanopore sequencing was used to simultaneously assess copy number profile and MB subgrouping based on DNA methylation. The DNA methylation data generated by Nanopore sequencing were compared to a publicly available reference cohort comprising over 2,800 brain tumors including the four subgroups of MB (Capper et al. Nature; 2018) to generate a score that estimates a confidence with a tumor group assignment. Among the 24 MB analyzed with nanopore sequencing (six WNT, nine SHH, five group 3, and four group 4), all of them were classified in the appropriate subgroup established by expression-based Nanostring subgrouping. In addition to the subgrouping, we also examine the genomic profile. Furthermore, all previously identified clinically relevant genomic rearrangements (mostly MYC and MYCN amplifications) were also detected with our assay. In conclusion, we are confirming the full reliability of nanopore sequencing as a novel rapid and cheap assay for methylation-based MB subgrouping. We now plan to implement this technology to other embryonal tumors of the central nervous system.

Download Full-text