scholarly journals Clairvoyante: a multi-task convolutional deep neural network for variant calling in Single Molecule Sequencing

2018 ◽  
Author(s):  
Ruibang Luo ◽  
Fritz J. Sedlazeck ◽  
Tak-Wah Lam ◽  
Michael C. Schatz

AbstractThe accurate identification of DNA sequence variants is an important, but challenging task in genomics. It is particularly difficult for single molecule sequencing, which has a per-nucleotide error rate of ~5%-15%. Meeting this demand, we developed Clairvoyante, a multi-task five-layer convolutional neural network model for predicting variant type (SNP or indel), zygosity, alternative allele and indel length from aligned reads. For the well-characterized NA12878 human sample, Clairvoyante achieved 99.73%, 97.68% and 95.36% precision on known variants, and 98.65%, 92.57%, 87.26% F1-score for whole-genome analysis, using Illumina, PacBio, and Oxford Nanopore data, respectively. Training on a second human sample shows Clairvoyante is sample agnostic and finds variants in less than two hours on a standard server. Furthermore, we identified 3,135 variants that are missed using Illumina but supported independently by both PacBio and Oxford Nanopore reads. Clairvoyante is available open-source (https://github.com/aquaskyline/Clairvoyante), with modules to train, utilize and visualize the model.

2021 ◽  
Author(s):  
Pei Wu ◽  
Chao Liu ◽  
Ou Wang ◽  
Xia Zhao ◽  
Fang Chen ◽  
...  

AbstractIn this paper, we report a pipeline, AsmMix, which is capable of producing both contiguous and high-quality diploid genomes. The pipeline consists of two steps. In the first step, two sets of assemblies are generated: one is based on co-barcoded reads, which are highly accurate and haplotype-resolved but contain many gaps, the other assembly is based on single-molecule sequencing reads, which is contiguous but error-prone. In the second step, those two sets of assemblies are compared and integrated into a haplotype-resolved assembly with fewer errors. We test our pipeline using a dataset of human genome NA24385, perform variant calling from those assemblies and then compare against GIAB Benchmark. We show that AsmMix pipeline could produce highly contiguous, accurate, and haplotype-resolved assemblies. Especially the assembly mixing process could effectively reduce small-scale errors in the long read assembly.


2018 ◽  
Author(s):  
Ou Wang ◽  
Robert Chin ◽  
Xiaofang Cheng ◽  
Michelle Ka Wu ◽  
Qing Mao ◽  
...  

Obtaining accurate sequences from long DNA molecules is very important for genome assembly and other applications. Here we describe single tube long fragment read (stLFR), a technology that enables this a low cost. It is based on adding the same barcode sequence to sub-fragments of the original long DNA molecule (DNA co-barcoding). To achieve this efficiently, stLFR uses the surface of microbeads to create millions of miniaturized barcoding reactions in a single tube. Using a combinatorial process up to 3.6 billion unique barcode sequences were generated on beads, enabling practically non-redundant co-barcoding with 50 million barcodes per sample. Using stLFR, we demonstrate efficient unique co-barcoding of over 8 million 20-300 kb genomic DNA fragments. Analysis of the genome of the human genome NA12878 with stLFR demonstrated high quality variant calling and phasing into contigs up to N50 34 Mb. We also demonstrate detection of complex structural variants and complete diploid de novo assembly of NA12878. These analyses were all performed using single stLFR libraries and their construction did not significantly add to the time or cost of whole genome sequencing (WGS) library preparation. stLFR represents an easily automatable solution that enables high quality sequencing, phasing, SV detection, scaffolding, cost-effective diploid de novo genome assembly, and other long DNA sequencing applications.


2018 ◽  
Vol 1 (4) ◽  
pp. e00086
Author(s):  
S.P. Radko ◽  
L.K. Kurbatov ◽  
K.G. Ptitsyn ◽  
Y.Y. Kiseleva ◽  
E.A. Ponomarenko ◽  
...  

Transcriptome profiling is widely employed to analyze transcriptome dynamics when studying various biological processes at the cell and tissue levels. Unlike the second generation sequencers, which sequence relatively short fragments of nucleic acids, the third generation DNA/RNA sequencers developed by biotechnology companies “PacBio” and “Oxford Nanopore Technologies” allow one to sequence transcripts as single molecules and may be considered as potential molecular counters capable to measure the number of copies of each transcript with high throughput, sensitivity, and specificity. In the present review, the features of single molecule sequencing technologies offered by “PacBio” and “Oxford Nanopore Technologies” are considered alongside with their utility for transcriptome analysis, including the analysis of transcript isoforms. The prospects and limitations of the single molecule sequencing technology in application to quantitative transcriptome profiling are also discussed.


2018 ◽  
Author(s):  
Nicholas Noll ◽  
Eric Urich ◽  
Daniel Wüthrich ◽  
Vladimira Hinic ◽  
Adrian Egli ◽  
...  

Carbapenemase-producing bacteria are resistant against almost all commonly used betalactam and cephalosporin antibiotics and represent a growing public health crisis. Carbapenemases reside predominantly in mobile genetic elements and rapidly spread across genetic backgrounds and species boundaries. Here, we report more than one hundred finished, high quality genomes of carbapenemase producing enterobacteriaceae, P. aeruginosa and A. baumannii sequenced with Oxford Nanopore and Illumina technologies. We developed a number of high-throughput criteria to assess the quality of fully assembled genomes for which curated references do not exist. Using this diverse collection of closed genomes and plasmids, we demonstrate rapid movement of carbapenemase between genomic neighborhoods, sequence types, and across species boundaries with distinct patterns for different carbapenemases. Lastly, we present evidence of multiple ancestral recombination events between different Enterobacteriaceae MLSTs. Taken together, our samples suggest a hierarchical picture of genomic variation produced by the evolution of carbapenemase producing bacteria that will require new models to adequately understand and track.


2016 ◽  
Author(s):  
Chen Yang ◽  
Justin Chu ◽  
Ren&eacute L Warren ◽  
Inanç Birol

Motivation: In 2014, Oxford Nanopore Technologies (ONT) announced a new sequencing platform called MinION. The particular features of MinION reads, longer read lengths and single-molecule sequencing in particular, show potential for genome characterization. As of yet, the pre-commercial technology is exclusively available through early-access, and only a few datasets are publically available for testing. Further, no software exists that simulates MinION platform reads with genuine ONT characteristics. Results: In this article, we introduce NanoSim, a fast and scalable read simulator that captures the technology-specific features of ONT data, and allows for adjustments upon improvement of nanopore sequencing technology. Availability: NanoSim is written in Python and R. The source files and manual are available at the Genome Sciences Centre website: http://www.bcgsc.ca/platform/bioinfo/software/nanosim


2021 ◽  
Vol 17 (6) ◽  
pp. e1009078
Author(s):  
Jingwen Ren ◽  
Mark J. P. Chaisson

It is computationally challenging to detect variation by aligning single-molecule sequencing (SMS) reads, or contigs from SMS assemblies. One approach to efficiently align SMS reads is sparse dynamic programming (SDP), where optimal chains of exact matches are found between the sequence and the genome. While straightforward implementations of SDP penalize gaps with a cost that is a linear function of gap length, biological variation is more accurately represented when gap cost is a concave function of gap length. We have developed a method, lra, that uses SDP with a concave-cost gap penalty, and used lra to align long-read sequences from PacBio and Oxford Nanopore (ONT) instruments as well as de novo assembly contigs. This alignment approach increases sensitivity and specificity for SV discovery, particularly for variants above 1kb and when discovering variation from ONT reads, while having runtime that are comparable (1.05-3.76×) to current methods. When applied to calling variation from de novo assembly contigs, there is a 3.2% increase in Truvari F1 score compared to minimap2+htsbox. lra is available in bioconda (https://anaconda.org/bioconda/lra) and github (https://github.com/ChaissonLab/LRA).


2016 ◽  
Author(s):  
Arthur C. Rand ◽  
Miten Jain ◽  
Jordan Eizenga ◽  
Audrey Musselman-Brown ◽  
Hugh E. Olsen ◽  
...  

AbstractChemical modifications to DNA regulate cellular state and function. The Oxford Nanopore MinION is a portable single-molecule DNA sequencer that can sequence long fragments of genomic DNA. Here we show that the MinION can be used to detect and map two chemical modifications cytosine, 5-methylcytosine and 5-hydroxymethylcytosine. We present a probabilistic method that enables expansion of the nucleotide alphabet to include bases containing chemical modifications. Our results on synthetic DNA show that individual cytosine base modifications can be classified with accuracy up to 95% in a three-way comparison and 98% in a two-way comparison.Statement of SignificanceNanopore-based sequencing technology can produce long reads from unamplified genomic DNA, potentially allowing the characterization of chemical modifications and non-canonical DNA nucleotides as they occur in the cell. As the throughput of nanopore sequencers improves, simultaneous detection of multiple epigenetic modifications to cytosines will become an important capability of these devices. Here we present a statistical model that allows the Oxford Nanopore Technologies MinION to be used for detecting chemical modifications to cytosine using standard DNA preparation and sequencing techniques. Our method is based on modeling the ionic current due to DNA k-mers with a variable-order hidden Markov model where the emissions are distributed according to a hierarchical Dirichlet process mixture of normal distributions. This method provides a principled way to expand the nucleotide alphabet to allow for variant calling of modified bases.


Author(s):  
Hailei Zhang ◽  
Huan Zhong ◽  
Shoudong Zhang ◽  
Xiaojian Shao ◽  
Min Ni ◽  
...  

The 5′ end of a eukaryotic mRNA transcript generally has a 7-methylguanosine (m7G) cap that protects mRNA from degradation and mediates almost all other aspects of gene expression. Some RNAs in Escherichia coli, yeast, and mammals were recently found to contain an NAD+ cap. Here, we report the development of the method NAD tagSeq for transcriptome-wide identification and quantification of NAD+-capped RNAs (NAD-RNAs). The method uses an enzymatic reaction and then a click chemistry reaction to label NAD-RNAs with a synthetic RNA tag. The tagged RNA molecules can be enriched and directly sequenced using the Oxford Nanopore sequencing technology. NAD tagSeq can allow more accurate identification and quantification of NAD-RNAs, as well as reveal the sequences of whole NAD-RNA transcripts using single-molecule RNA sequencing. Using NAD tagSeq, we found that NAD-RNAs in Arabidopsis were produced by at least several thousand genes, most of which are protein-coding genes, with the majority of these transcripts coming from <200 genes. For some Arabidopsis genes, over 5% of their transcripts were NAD capped. Gene ontology terms overrepresented in the 2,000 genes that produced the highest numbers of NAD-RNAs are related to photosynthesis, protein synthesis, and responses to cytokinin and stresses. The NAD-RNAs in Arabidopsis generally have the same overall sequence structures as the canonical m7G-capped mRNAs, although most of them appear to have a shorter 5′ untranslated region (5′ UTR). The identification and quantification of NAD-RNAs and revelation of their sequence features can provide essential steps toward understanding the functions of NAD-RNAs.


Sign in / Sign up

Export Citation Format

Share Document