Optimizing experimental design for genome sequencing and assembly with Oxford Nanopore Technologies

Gigabyte ◽

10.46471/gigabyte.27 ◽

2021 ◽

Vol 2021 ◽

pp. 1-26

Author(s):

John M. Sutton ◽

Joshua D. Millwood ◽

A. Case McCormack ◽

Janna L. Fierst

Keyword(s):

Experimental Design ◽

Genome Sequencing ◽

Dna Sequences ◽

De Novo ◽

Error Rates ◽

Read Length ◽

High Quality ◽

High Molecular Weight Dna ◽

Oxford Nanopore ◽

Oxford Nanopore Technologies

High quality reference genome sequences are the core of modern genomics. Oxford Nanopore Technologies (ONT) produces inexpensive DNA sequences, but has high error rates, which make sequence assembly and analysis difficult as genome size and complexity increases. Robust experimental design is necessary for ONT genome sequencing and assembly, but few studies have addressed eukaryotic organisms. Here, we present novel results using simulated and empirical ONT and DNA libraries to identify best practices for sequencing and assembly for several model species. We find that the unique error structure of ONT libraries causes errors to accumulate and assembly statistics plateau as sequence depth increases. High-quality assembled eukaryotic sequences require high-molecular-weight DNA extractions that increase sequence read length, and computational protocols that reduce error through pre-assembly correction and read selection. Our quantitative results will be helpful for researchers seeking guidance for de novo assembly projects.

Download Full-text

Optimizing experimental design for genome sequencing and assembly with Oxford Nanopore Technologies

10.1101/2020.05.05.079327 ◽

2020 ◽

Author(s):

John M. Sutton ◽

Janna L. Fierst

Keyword(s):

Experimental Design ◽

Genome Sequencing ◽

Dna Sequences ◽

Genome Assembly ◽

De Novo ◽

Error Rates ◽

Oxford Nanopore ◽

Broad Array ◽

Sequencing Strategy ◽

Oxford Nanopore Technologies

SummaryHigh quality reference genome sequences are the core of modern genomics. Oxford Nanopore Technologies (ONT) produces inexpensive DNA sequences in excess of 100,000 nucleotides but error rates remain >10% and assembling these sequences, particularly for eukaryotes, is a non-trivial problem. To date there has been no comprehensive attempt to generate experimental design for ONT genome sequencing and assembly. Here, we simulate ONT and Illumina DNA sequence reads for Escherichia coli, Caenorhabditis elegans, Arabidopsis thaliana, and Drosophila melanogaster. We quantify the influence of sequencing coverage, assembly software and experimental design on de novo genome assembly and error correction to predict the optimum sequencing strategy for these organisms. We show proof of concept using real ONT data generated for the nematode Caenorhabditis remanei. ONT sequencing is inexpensive and accessible, and our quantitative results will be helpful for a broad array of researchers seeking guidance for de novo genome assembly projects.

Download Full-text

De novo clustering of long-read transcriptome data using a greedy, quality-value based algorithm

10.1101/463463 ◽

2018 ◽

Cited By ~ 8

Author(s):

Kristoffer Sahlin ◽

Paul Medvedev

Keyword(s):

Clustering Algorithm ◽

De Novo ◽

Substantial Improvement ◽

Error Rates ◽

Reconstruction Algorithms ◽

Long Reads ◽

Oxford Nanopore ◽

Long Read ◽

Transcript Reconstruction ◽

Oxford Nanopore Technologies

AbstractLong-read sequencing of transcripts with PacBio Iso-Seq and Oxford Nanopore Technologies has proven to be central to the study of complex isoform landscapes in many organisms. However, current de novo transcript reconstruction algorithms from long-read data are limited, leaving the potential of these technologies unfulfilled. A common bottleneck is the dearth of scalable and accurate algorithms for clustering long reads according to their gene family of origin. To address this challenge, we develop isONclust, a clustering algorithm that is greedy (in order to scale) and makes use of quality values (in order to handle variable error rates). We test isONclust on three simulated and five biological datasets, across a breadth of organisms, technologies, and read depths. Our results demonstrate that isONclust is a substantial improvement over previous approaches, both in terms of overall accuracy and/or scalability to large datasets. Our tool is available at https://github.com/ksahlin/isONclust.

Download Full-text

Sparc: a sparsity-based consensus algorithm for long erroneous sequencing reads

PeerJ ◽

10.7717/peerj.2016 ◽

2016 ◽

Vol 4 ◽

pp. e2016 ◽

Cited By ~ 18

Author(s):

Chengxi Ye ◽

Zhanshan (Sam) Ma

Keyword(s):

Error Rate ◽

Genome Assembly ◽

De Novo ◽

Consensus Sequence ◽

Variant Calling ◽

Error Rates ◽

Consensus Algorithm ◽

High Quality ◽

Oxford Nanopore ◽

Generation Sequencing

Motivation.The third generation sequencing (3GS) technology generates long sequences of thousands of bases. However, its current error rates are estimated in the range of 15–40%, significantly higher than those of the prevalent next generation sequencing (NGS) technologies (less than 1%). Fundamental bioinformatics tasks such asde novogenome assembly and variant calling require high-quality sequences that need to be extracted from these long but erroneous 3GS sequences.Results.We describe a versatile and efficient linear complexity consensus algorithm Sparc to facilitatede novogenome assembly. Sparc builds a sparse k-mer graph using a collection of sequences from a targeted genomic region. The heaviest path which approximates the most likely genome sequence is searched through a sparsity-induced reweighted graph as the consensus sequence. Sparc supports using NGS and 3GS data together, which leads to significant improvements in both cost efficiency and computational efficiency. Experiments with Sparc show that our algorithm can efficiently provide high-quality consensus sequences using both PacBio and Oxford Nanopore sequencing technologies. With only 30× PacBio data, Sparc can reach a consensus with error rate <0.5%. With the more challenging Oxford Nanopore data, Sparc can also achieve similar error rate when combined with NGS data. Compared with the existing approaches, Sparc[i] calculates the consensus with higher accuracy, uses 80% less memory and time, approximately. The source code is available for download athttps://github.com/yechengxi/Sparc.

Download Full-text

42 An Improved, High-quality Ovine Reference Genome Assembly

Journal of Animal Science ◽

10.1093/jas/skab235.039 ◽

2021 ◽

Vol 99 (Supplement_3) ◽

pp. 23-24

Author(s):

Kimberly M Davenport ◽

Derek M Bickhart ◽

Kim Worley ◽

Shwetha C Murali ◽

Noelle Cockett ◽

...

Keyword(s):

Genome Assembly ◽

Functional Annotation ◽

Reference Genome ◽

De Novo ◽

The United States ◽

Read Length ◽

Chromosome 11 ◽

High Quality ◽

Oxford Nanopore ◽

Long Read

Abstract Sheep are an important agricultural species used for both food and fiber in the United States and globally. A high-quality reference genome enhances the ability to discover genetic and biological mechanisms influencing important traits, such as meat and wool quality. The rapid advances in genome assembly algorithms and emergence of increasingly long sequence read length provide the opportunity for an improved de novo assembly of the sheep reference genome. Tissue was collected postmortem from an adult Rambouillet ewe selected by USDA-ARS for the Ovine Functional Annotation of Animal Genomes project. Short-read (55x coverage), long-read PacBio (75x coverage), and Hi-C data from this ewe were retrieved from public databases. We generated an additional 50x coverage of Oxford Nanopore data and assembled the combined long-read data with canu v1.9. The assembled contigs were polished with Nanopolish v0.12.5 and scaffolded using Hi-C data with Salsa v2.2. Gaps were filled with PBsuite v15.8.24 and polished with Nanopolish v0.12.5 followed by removal of duplicate contigs with PurgeDups v1.0.1. Chromosomes were oriented by identifying centromeres and telomeres with RepeatMasker v4.1.1, indicating a need to reverse the orientation of chromosome 11 relative to Oar_rambouillet_v1.0. Final polishing was performed with two rounds of a pipeline which consisted of freebayes v1.3.1 to call variants, Merfin to validate them, and BCFtools to generate the consensus fasta. The ARS-UI_Ramb_v2.0 assembly has improved continuity (contig N50 of 43.19 Mb) with a 19-fold and 38-fold decrease in the number of scaffolds compared with Oar_rambouillet_v1.0 and Oar_v4.0. ARS-UI_Ramb_v2.0 has greater per-base accuracy and fewer insertions and deletions identified from mapped RNA sequence than previous assemblies. This significantly improved reference assembly, public at NCBI GenBank under accession number GCA_016772045, will optimize the functional annotation of the sheep genome and facilitate improved mapping accuracy of genetic variant and expression data for traits relevant the sheep industry.

Download Full-text

Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation

10.1101/071282 ◽

2016 ◽

Cited By ~ 96

Author(s):

Sergey Koren ◽

Brian P. Walenz ◽

Konstantin Berlin ◽

Jason R. Miller ◽

Nicholas H. Bergman ◽

...

Keyword(s):

Single Molecule ◽

De Novo ◽

Error Rates ◽

Celera Assembler ◽

Oxford Nanopore ◽

Long Read ◽

Reference Quality ◽

Order Of Magnitude ◽

Assembly Algorithms ◽

Oxford Nanopore Technologies

AbstractLong-read single-molecule sequencing has revolutionized de novo genome assembly and enabled the automated reconstruction of reference-quality genomes. However, given the relatively high error rates of such technologies, efficient and accurate assembly of large repeats and closely related haplotypes remains challenging. We address these issues with Canu, a successor of Celera Assembler that is specifically designed for noisy single-molecule sequences. Canu introduces support for nanopore sequencing, halves depth-of-coverage requirements, and improves assembly continuity while simultaneously reducing runtime by an order of magnitude on large genomes versus Celera Assembler 8.2. These advances result from new overlapping and assembly algorithms, including an adaptive overlapping strategy based on tf-idf weighted MinHash and a sparse assembly graph construction that avoids collapsing diverged repeats and haplotypes. We demonstrate that Canu can reliably assemble complete microbial genomes and near-complete eukaryotic chromosomes using either PacBio or Oxford Nanopore technologies, and achieves a contig NG50 of greater than 21 Mbp on both human and Drosophila melanogaster PacBio datasets. For assembly structures that cannot be linearly represented, Canu provides graph-based assembly outputs in graphical fragment assembly (GFA) format for analysis or integration with complementary phasing and scaffolding techniques. The combination of such highly resolved assembly graphs with long-range scaffolding information promises the complete and automated assembly of complex genomes.

Download Full-text

LeafGo: Leaf to Genome, a quick workflow to produce high-quality De novo genomes with Third Generation Sequencing technology

10.1101/2021.01.25.428044 ◽

2021 ◽

Author(s):

Patrick Driguez ◽

Salim Bougouffa ◽

Karen Carty ◽

Alexander Putra ◽

Kamel Jabbari ◽

...

Keyword(s):

De Novo ◽

Rapid Development ◽

Plant Genome ◽

Plant Genomics ◽

High Quality ◽

High Molecular Weight Dna ◽

Tissue Samples ◽

Sequencing Technologies ◽

The Cost ◽

New Generation

AbstractRecent years have witnessed a rapid development of sequencing technologies. Fundamental differences and limitations among various platforms impact the time, the cost and the accuracy for sequencing whole genomes. Here we designed a complete de novo plant genome generation workflow that starts from plant tissue samples and produces high-quality draft genomes with relatively modest laboratory and bioinformatic resources within seven days. To optimize our workflow we selected different species of plants which were used to extract high molecular weight DNA, to make PacBio and ONT libraries for sequencing with the Sequel I, Sequel II and GridION platforms. We assembled high-quality draft genomes of two different Eucalyptus species E. rudis, and E. camaldulensis to chromosome level without using additional scaffolding technologies. For the rapid production of de novo genome assembly of plant species we showed that our DNA extraction protocol followed by PacBio high fidelity sequencing, and assembly with new generation assemblers such as hifiasm produce excellent results. Our findings will be a valuable benchmark for groups planning wet- and dry-lab plant genomics research and for high throughput plant genomics initiatives.

Download Full-text

Chromosome-level genome assembly of the female western mosquitofish (Gambusia affinis)

GigaScience ◽

10.1093/gigascience/giaa092 ◽

2020 ◽

Vol 9 (8) ◽

Cited By ~ 1

Author(s):

Feng Shao ◽

Arne Ludwig ◽

Yang Mao ◽

Ni Liu ◽

Zuogang Peng

Keyword(s):

Dna Sequences ◽

Genome Assembly ◽

Gambusia Affinis ◽

Comparative Genomic ◽

Suitable Model ◽

High Quality ◽

Western Mosquitofish ◽

Poeciliid Fish ◽

Oxford Nanopore ◽

Chromosome Level

Abstract Background The western mosquitofish (Gambusia affinis) is a sexually dimorphic poeciliid fish known for its worldwide biological invasion and therefore an important research model for studying invasion biology. This organism may also be used as a suitable model to explore sex chromosome evolution and reproductive development in terms of differentiation of ZW sex chromosomes, ovoviviparity, and specialization of reproductive organs. However, there is a lack of high-quality genomic data for the female G. affinis; hence, this study aimed to generate a chromosome-level genome assembly for it. Results The chromosome-level genome assembly was constructed using Oxford nanopore sequencing, BioNano, and Hi-C technology. G. affinis genomic DNA sequences containing 217 contigs with an N50 length of 12.9 Mb and 125 scaffolds with an N50 length of 26.5 Mb were obtained by Oxford nanopore and BioNano, respectively, and the 113 scaffolds (90.4% of scaffolds containing 97.9% nucleotide bases) were assembled into 24 chromosomes (pseudo-chromosomes) by Hi-C. The Z and W chromosomes of G. affinis were identified by comparative genomic analysis of female and male G. affinis, and the mechanism of differentiation of the Z and W chromosomes was explored. Combined with transcriptome data from 6 tissues, a total of 23,997 protein-coding genes were predicted and 23,737 (98.9%) genes were functionally annotated. Conclusions The high-quality female G. affinis reference genome provides a valuable omics resource for future studies of comparative genomics and functional genomics to explore the evolution of Z and W chromosomes and the reproductive developmental biology of G. affinis.

Download Full-text

Prowler: A novel trimming algorithm for Oxford Nanopore sequence data

10.1101/2021.05.09.443332 ◽

2021 ◽

Author(s):

Simon Lee ◽

Loan T. Nguyen ◽

Ben J. Hayes ◽

Elizabeth M Ross

Keyword(s):

Sequence Data ◽

Error Rates ◽

Read Length ◽

Sequencing Analysis ◽

Sequence Alignments ◽

Lower Error ◽

Oxford Nanopore ◽

High Quality Sequence ◽

Dna Sequencing Analysis ◽

Window Approach

Motivation: Quality control (QC) tools are critical in DNA sequencing analysis because they increase the accuracy of sequence alignments and thus the reliability of results. Oxford Nanopore Technologies (ONT) QC is currently rudimentary, generally based on whole read average quality. This results in discarding reads that contain regions of high quality sequence. Here we propose Prowler, a multi-window approach inspired by algorithms used to QC short read data. Importantly, we retain the phase and read length information by optionally replacing trimmed sections with Ns. Results: Prowler was applied to mammalian and bacterial datasets, to assess effects on alignment and assembly respectively. Compared to Nanofilt, alignments of data QCed with Prowler had lower error rates and more mapped reads. Assemblies of Prowler QCed data had a lower error rate than Nanofilt QCed data however this came at some cost to assembly contiguity. Availability and implementation: Prowler is implemented in Python and is available at: https://github.com/ProwlerForNanopore/ProwlerTrimmer Contact: [email protected]

Download Full-text

Complete Circular Genome Sequences of Brachyspira hyodysenteriae Isolates of the Four Different Sequence Types Causing Swine Dysentery in Switzerland

Microbiology Resource Announcements ◽

10.1128/mra.00847-21 ◽

2021 ◽

Vol 10 (39) ◽

Author(s):

Ana B. García-Martín ◽

Sarah Schmitt ◽

Friederike Zeeh ◽

Vincent Perreten

Keyword(s):

High Throughput Sequencing ◽

De Novo ◽

Hybrid Assembly ◽

Swine Dysentery ◽

Content Type ◽

Brachyspira Hyodysenteriae ◽

Oxford Nanopore ◽

Sequencing Platforms ◽

Sequence Types ◽

Oxford Nanopore Technologies

The complete genomes of four Brachyspira hyodysenteriae isolates of the four different sequence types (STs) (ST6, ST66, ST196, and ST197) causing swine dysentery in Switzerland were generated by whole-genome sequencing and de novo hybrid assembly of reads obtained from second (Illumina) and third (Oxford Nanopore Technologies and Pacific Biosciences) high-throughput sequencing platforms.

Download Full-text

High-Quality Genome Resource of the Pathogen of Botryosphaeria dothidea Causing Kiwifruit Soft Rot

PhytoFrontiers™ ◽

10.1094/phytofr-07-20-0006-a ◽

2021 ◽

pp. PHYTOFR-07-20-0

Author(s):

Kuan Liang ◽

Jianbin Lan ◽

Baoquan Wang ◽

Yuanyuan Liu ◽

Qi Lu ◽

...

Keyword(s):

De Novo ◽

Gc Content ◽

Soft Rot ◽

Read Length ◽

Comparative Genomic ◽

Secretory Proteins ◽

Botryosphaeria Dothidea ◽

High Quality ◽

Total Size ◽

High Quality Genome

Kiwifruit soft rot caused by the fungal pathogen Botryosphaeria dothidea is a serious disease in kiwifruit-growing regions worldwide. In this study, we reported the high-quality genome sequence of the highly virulent B. dothidea strain PTZ1 using PacBio Sequel techniques. In total, 100.87 million clean reads with mean read length of 9,871 bp were obtained. De novo assembly resulted in 28 contigs with a total size of 44.45 Mb. The GC content of the genome was 54.59%. Furthermore, genes related to specific virulence of the strain were identified, including 259 fungal cytochrome P450s, 550 carbohydrate-active enzymes, 860 secretory proteins, and 1,182 pathogen–host interactions related proteins. The genome is a useful resource to serve as a reference to facilitate the analysis of B. dothidea isolates and comparative genomic studies of the necrotroph pathogens. [Formula: see text] Copyright © 2021 The Author(s). This is an open access article distributed under the CC BY-NC-ND 4.0 International license .

Download Full-text