Rapid, raw-read reference and identification (R4IDs): A flexible platform for rapid generic species ID using long-read sequencing technology

Mapping Intimacies ◽

10.1101/281048 ◽

2018 ◽

Cited By ~ 2

Author(s):

Joe Parker ◽

Andrew Helmstetter ◽

James Crowe ◽

John Iacona ◽

Dion Devey ◽

...

Keyword(s):

Dna Sequencing ◽

Species Identification ◽

Sequence Data ◽

Vascular Plant ◽

Reference Sequence ◽

Read Length ◽

Reference Database ◽

Sequencing Technology ◽

Long Read ◽

Suitable Reference

AbstractThe versatility of the current DNA sequencing platforms and the development of portable, nanopore sequencers means that it has never been easier to collect genetic data for unknown sample ID. DNA barcoding and meta-barcoding have become increasingly popular and barcode databases continue to grow at an impressive rate. However, the number of canonical genome assemblies (reference or draft) that are publically available is relatively tiny, hindering the more widespread use of genome scale DNA sequencing technology for accurate species identification and discovery. Here, we show that rapid raw-read reference datasets, or R4IDs for short, generated in a matter of hours on the Oxford Nanopore MinION, can bridge this gap and accelerate the generation of useable reference sequence data. By exploiting the long read length of this technology, shotgun genomic sequencing of a small portion of an organism’s genome can act as a suitable reference database despite the low sequencing coverage. These R4IDs can then be used for accurate species identification with minimal amounts of re-sequencing effort (1000s of reads). We demonstrated the capabilities of this approach with six vascular plant species for which we created R4IDs in the laboratory and then re-sequenced, live at the Kew Science Festival 2016. We further validated our method using simulations to determine the broader applicability of the approach. Our data analysis pipeline has been made available as a Dockerised workflow for simple, scalable deployment for a range of uses.

Download Full-text

cpn60 barcode sequences accurately identify newly defined genera within the Lactobacillaceae

10.1101/2021.02.24.432354 ◽

2021 ◽

Author(s):

Ishika Shukla ◽

Janet E. Hill

Keyword(s):

Species Identification ◽

Sequence Data ◽

Sequence Diversity ◽

Reference Sequence ◽

Reference Database ◽

New Genera ◽

Accurate Identification ◽

Detection And Identification ◽

Taxonomic Framework ◽

Definition Of

AbstractThe cpn60 barcode sequence is established as an informative target for microbial species identification. Applications of cpn60 barcode sequencing are supported by the availability of “universal” PCR primers for its amplification and a curated reference database of cpn60 sequences, cpnDB. A recent reclassification of lactobacilli involving the definition of 23 new genera provided an opportunity to update cpnDB and to determine if the cpn60 barcode could be used for accurate identification of species consistent with the new framework. Analysis of 275 cpn60 sequences representing 258/269 of the validly named species in Lactobacillus, Paralactobacillus and the 23 newer genera showed that cpn60-based sequence relationships were consistent with the whole-genome-based phylogeny. Aligning or mapping full length barcode sequences or a 150 bp subsequence resulted in accurate and unambiguous species identification in almost all cases. Taken together, our results show that the combination of available reference sequence data, “universal” barcode amplification primers, and the inherent sequence diversity within the cpn60 barcode make it a useful target for the detection and identification of lactobacilli as defined by the latest taxonomic framework.Significance and Impact of the StudyThe genus Lactobacillus recently underwent a major reorganization resulting in the definition of 23 new genera. Lactobacilli are widespread in environmental and host-associated microbiomes and are exploited in food and biotechnology applications, making methods for their accurate identification desirable. Here we show that the combination of a reference sequence database, “universal” barcode amplification primers, and the inherent sequence diversity within the cpn60 barcode make it a useful target for the detection and identification of lactobacilli as defined by the latest taxonomic framework.

Download Full-text

Using next generation sequencing of alpine plants to improve fecal metabarcoding diet analysis for Dall’s sheep

BMC Research Notes ◽

10.1186/s13104-021-05590-z ◽

2021 ◽

Vol 14 (1) ◽

Author(s):

Kelly E. Williams ◽

Damian M. Menning ◽

Eric J. Wald ◽

Sandra L. Talbot ◽

Kumi L. Rattenbury ◽

...

Keyword(s):

Sequence Data ◽

Vascular Plant ◽

Alpine Plants ◽

Diet Analysis ◽

Reference Sequence ◽

Reference Library ◽

Ovis Dalli ◽

Plant Animal Interactions ◽

Dall’S Sheep ◽

Northwestern North America

Abstract Objectives Dall’s sheep (Ovis dalli dalli) are important herbivores in the mountainous ecosystems of northwestern North America, and recent declines in some populations have sparked concern. Our aim was to improve capabilities for fecal metabarcoding diet analysis of Dall’s sheep and other herbivores by contributing new sequence data for arctic and alpine plants. This expanded reference library will provide critical reference sequence data that will facilitate metabarcoding diet analysis of Dall’s sheep and thus improve understanding of plant-animal interactions in a region undergoing rapid climate change. Data description We provide sequences for the chloroplast rbcL gene of 16 arctic-alpine vascular plant species that are known to comprise the diet of Dall’s sheep. These sequences contribute to a growing reference library that can be used in diet studies of arctic herbivores.

Download Full-text

Haplotype-aware variant calling enables high accuracy in nanopore long-reads using deep neural networks

10.1101/2021.03.04.433952 ◽

2021 ◽

Author(s):

Kishwar Shafin ◽

Trevor Pesout ◽

Pi-Chuan Chang ◽

Maria Nattestad ◽

Alexey Kolesnikov ◽

...

Keyword(s):

De Novo ◽

Sequence Data ◽

Variant Calling ◽

High Accuracy ◽

Superior Performance ◽

Read Length ◽

Single Nucleotide Variants ◽

Single Nucleotide ◽

Short Read ◽

Long Read

Long-read sequencing has the potential to transform variant detection by reaching currently difficult-to-map regions and routinely linking together adjacent variations to enable read based phasing. Third-generation nanopore sequence data has demonstrated a long read length, but current interpretation methods for its novel pore-based signal have unique error profiles, making accurate analysis challenging. Here, we introduce a haplotype-aware variant calling pipeline PEPPER-Margin-DeepVariant that produces state-of-the-art variant calling results with nanopore data. We show that our nanopore-based method outperforms the short-read-based single nucleotide variant identification method at the whole genome-scale and produces high-quality single nucleotide variants in segmental duplications and low-mappability regions where short-read based genotyping fails. We show that our pipeline can provide highly-contiguous phase blocks across the genome with nanopore reads, contiguously spanning between 85% to 92% of annotated genes across six samples. We also extend PEPPER-Margin-DeepVariant to PacBio HiFi data, providing an efficient solution with superior performance than the current WhatsHap-DeepVariant standard. Finally, we demonstrate de novo assembly polishing methods that use nanopore and PacBio HiFi reads to produce diploid assemblies with high accuracy (Q35+ nanopore-polished and Q40+ PacBio-HiFi-polished).

Download Full-text

Information theoretic alignment free variant calling

PeerJ Computer Science ◽

10.7717/peerj-cs.71 ◽

2016 ◽

Vol 2 ◽

pp. e71

Author(s):

Justin Bedo ◽

Benjamin Goudey ◽

Jeremy Wazny ◽

Zeyu Zhou

Keyword(s):

Sequence Data ◽

Multinomial Distribution ◽

Variant Calling ◽

Whole Genome Sequence ◽

Reference Sequence ◽

Information Theoretic ◽

Learning Tasks ◽

Leibler Divergence ◽

Suitable Reference ◽

Mouse Dataset

While traditional methods for calling variants across whole genome sequence data rely on alignment to an appropriate reference sequence, alternative techniques are needed when a suitable reference does not exist. We present a novel alignment and assembly free variant calling method based on information theoretic principles designed to detect variants have strong statistical evidence for their ability to segregate samples in a given dataset. Our method uses the context surrounding a particular nucleotide to define variants. Given a set of reads, we model the probability of observing a given nucleotide conditioned on the surrounding prefix and suffixes of lengthkas a multinomial distribution. We then estimate which of these contexts are stable intra-sample and varying inter-sample using a statistic based on the Kullback–Leibler divergence.The utility of the variant calling method was evaluated through analysis of a pair of bacterial datasets and a mouse dataset. We found that our variants are highly informative for supervised learning tasks with performance similar to standard reference based calls and another reference free method (DiscoSNP++). Comparisons against reference based calls showed our method was able to capture very similar population structure on the bacterial dataset. The algorithm’s focus on discriminatory variants makes it suitable for many common analysis tasks for organisms that are too diverse to be mapped back to a single reference sequence.

Download Full-text

EquCab3, an Updated Reference Genome for the Domestic Horse

10.1101/306928 ◽

2018 ◽

Cited By ~ 9

Author(s):

Theodore S. Kalbfleisch ◽

Edward S. Rice ◽

Michael S. DePriest ◽

Brian P. Walenz ◽

Matthew S. Hestand ◽

...

Keyword(s):

Reference Genome ◽

Reference Sequence ◽

Large Animal ◽

Domestic Horse ◽

Sequencing Technology ◽

Proximity Ligation ◽

Genomics Research ◽

Long Read ◽

Solid Foundation ◽

Work Done

AbstractEquCab2, a high-quality reference genome for the domestic horse, was released in 2007. Since then, it has served as the foundation for nearly all genomic work done in equids. Recent advances in genomic sequencing technology and computational assembly methods have allowed scientists to improve reference assemblies of large animal and plant genomes in terms of contiguity and composition. In 2014, the equine genomics research community began a project to improve the reference sequence for the horse, building upon the solid foundation of EquCab2 and incorporating new short-read data, long-read data, and proximity ligation data. The result, EquCab3, is presented here. The count of non-N bases in the incorporated chromosomes is improved from 2.33Gb in EquCab2 to 2.41Gb from EquCab3. Contiguity has also been improved nearly 40-fold with a contig N50 of 4.5Mb and scaffold contiguity enhanced to where all but one of the 32 chromosomes is comprised of a single scaffold.

Download Full-text

Strategies and difficulties in assembling highly recombinogenic plant organelle genomes: a case study

10.7287/peerj.preprints.1599 ◽

2015 ◽

Author(s):

Concita Cantarella ◽

Rachele Tamburino ◽

Nunzia Scotti ◽

Teodoro Cardi ◽

Nunzio D'Agostino

Keyword(s):

Mitochondrial Dna ◽

Mitochondrial Genome ◽

De Novo ◽

Sequence Data ◽

Repeated Sequences ◽

Read Length ◽

Plant Mitochondrial Genome ◽

Sequencing Platform ◽

Organelle Genomes ◽

Long Read

Mitochondrial genomes in plants are larger and more complex than in other eukaryotes due to their recombinogenic nature as widely demonstrated. The mitochondrial DNA (mtDNA) is usually represented as a single circular map, the so-called master molecule. This molecule includes repeated sequences, some of which are able to recombine, generating sub-genomic molecules in various amounts, depending on the balance between their recombination and replication rates. Recent advances in DNA sequencing technology gave a huge boost to plant mitochondrial genome projects. Conventional approaches to mitochondrial genome sequencing involve extraction and enrichment of mitochondrial DNA, cloning, and sequencing. Large repeats and the dynamic mitochondrial genome organization complicate de novo sequence assembly from short reads. The PacBio RS long-read sequencing platform offers the promise of increased read length and unbiased genome coverage and thus the potential to produce genome sequence data of a finished quality (fewer gaps and longer contigs). However, recently published articles revealed that PacBio sequencing is still not sufficient to address mtDNA assembly-related issues. Here we present a preliminary hybrid assembly of a potato mtDNA based on both PacBio and Illumina reads and debate the strategies and obstacles in assembling genomes containing repeated sequences that are recombinationally active and serve as a constant source of rearrangements.

Download Full-text

Microbe-ID: an open source toolbox for microbial genotyping and species identification

PeerJ ◽

10.7717/peerj.2279 ◽

2016 ◽

Vol 4 ◽

pp. e2279 ◽

Cited By ~ 2

Author(s):

Javier F. Tabima ◽

Sydney E. Everhart ◽

Meredith M. Larsen ◽

Alexandra J. Weisberg ◽

Zhian N. Kamvar ◽

...

Keyword(s):

Open Source ◽

Species Identification ◽

Invasive Plant ◽

Reference Sequence ◽

Bootstrap Support ◽

Reference Database ◽

Bioinformatic Tools ◽

Link Type ◽

Minimum Spanning Network ◽

Analytical Tools

Development of tools to identify species, genotypes, or novel strains of invasive organisms is critical for monitoring emergence and implementing rapid response measures. Molecular markers, although critical to identifying species or genotypes, require bioinformatic tools for analysis. However, user-friendly analytical tools for fast identification are not readily available. To address this need, we created a web-based set of applications called Microbe-ID that allow for customizing a toolbox for rapid species identification and strain genotyping using any genetic markers of choice. Two components of Microbe-ID, named Sequence-ID and Genotype-ID, implement species and genotype identification, respectively. Sequence-ID allows identification of species by using BLAST to query sequences for any locus of interest against a custom reference sequence database. Genotype-ID allows placement of an unknown multilocus marker in either a minimum spanning network or dendrogram with bootstrap support from a user-created reference database. Microbe-ID can be used for identification of any organism based on nucleotide sequences or any molecular marker type and several examples are provided. We created a public website for demonstration purposes called Microbe-ID (microbe-id.org) and provided a working implementation for the genusPhytophthora(phytophthora-id.org). InPhytophthora-ID, the Sequence-ID application allows identification based on ITS orcoxspacer sequences. Genotype-ID groups individuals into clonal lineages based on simple sequence repeat (SSR) markers for the two invasive plant pathogen speciesP. infestansandP. ramorum. All code is open source and available on github and CRAN. Instructions for installation and use are provided athttps://github.com/grunwaldlab/Microbe-ID.

Download Full-text

The current approaches to the study of algae: DNA barcoding and DNA taxonomy

Issues of modern algology (Вопросы современной альгологии) ◽

10.33624/2311-0147-2021-2(26)-124-130 ◽

2021 ◽

pp. 124-130

Author(s):

Anna D. Temraleeva ◽

Elena S. Krivina ◽

Yury S. Bukin

Keyword(s):

Phylogenetic Analysis ◽

Dna Sequencing ◽

Dna Barcoding ◽

Species Identification ◽

Green Algae ◽

Molecular Genetic ◽

Algal Species ◽

Species Boundaries ◽

Sequencing Technology ◽

Dna Taxonomy

The understanding of the impossibility of distinguishing algal species based on morphological features came with the development of DNA sequencing technology, which today is a necessary tool for defining species boundaries and testing traditional species concepts. The paper discusses popular approaches to species identification (DNA barcoding) and the description of new and revision of known species (DNA taxonomy) using molecular genetic methods. The requirements and limitations in their work are given, as well as examples of phylogenetic analysis of green algae from the clade Moewusinia and Parachlorella, including the genus Micractinium.

Download Full-text

Robust long-read native DNA sequencing using the ONT CsgG Nanopore system

Wellcome Open Research ◽

10.12688/wellcomeopenres.11246.1 ◽

2017 ◽

Vol 2 ◽

pp. 23 ◽

Cited By ~ 12

Author(s):

Jean-Michel Carter ◽

Shobbir Hussain

Keyword(s):

Dna Sequencing ◽

Cancer Cell Line ◽

Read Length ◽

Nanopore Sequencing ◽

Computational Tools ◽

Practical Applications ◽

Oxford Nanopore ◽

Sequencing Method ◽

Long Read ◽

Oxford Nanopore Technologies

Background: The ability to obtain long read lengths during DNA sequencing has several potentially important practical applications. Especially long read lengths have been reported using the Nanopore sequencing method, currently commercially available from Oxford Nanopore Technologies (ONT). However, early reports have demonstrated only limited levels of combined throughput and sequence accuracy. Recently, ONT released a new CsgG pore sequencing system as well as a 250b/s translocation chemistry with potential for improvements. Methods: We made use of such components on ONTs miniature ‘MinION’ device and sequenced native genomic DNA obtained from the near haploid cancer cell line HAP1. Analysis of our data was performed utilising recently described computational tools tailored for nanopore/long-read sequencing outputs, and here we present our key findings. Results: From a single sequencing run, we obtained ~240,000 high-quality mapped reads, comprising a total of ~2.3 billion bases. A mean read length of 9.6kb and an N50 of ~17kb was achieved, while sequences mapped to reference with a mean identity of 85%. Notably, we obtained ~68X coverage of the mitochondrial genome and were able to achieve a mean consensus identity of 99.8% for sequenced mtDNA reads. Conclusions: With improved sequencing chemistries already released and higher-throughput instruments in the pipeline, this early study suggests that ONT CsgG-based sequencing may be a useful option for potential practical long-read applications.

Download Full-text

Improvements in the sequencing and assembly of plant genomes

Gigabyte ◽

10.46471/gigabyte.24 ◽

2021 ◽

Vol 2021 ◽

pp. 1-10 ◽

Cited By ~ 1

Author(s):

Priyanka Sharma ◽

Othman Al-Dossary ◽

Bader Alsubaie ◽

Ibrahim Al-Mssallem ◽

Onkar Nath ◽

...

Keyword(s):

Sequence Data ◽

Linear Increase ◽

Persea Americana ◽

Plant Genome ◽

Sequencing Technology ◽

Genome Coverage ◽

Plant Genomes ◽

Oxford Nanopore ◽

Long Read ◽

Using Data

Advances in DNA sequencing have made it easier to sequence and assemble plant genomes. Here, we extend an earlier study, and compare recent methods for long read sequencing and assembly. Updated Oxford Nanopore Technology software improved assemblies. Using more accurate sequences produced by repeated sequencing of the same molecule (Pacific Biosciences HiFi) resulted in less fragmented assembly of sequencing reads. Using data for increased genome coverage resulted in longer contigs, but reduced total assembly length and improved genome completeness. The original model species, Macadamia jansenii, was also compared with three other Macadamia species, as well as avocado (Persea americana) and jojoba (Simmondsia chinensis). In these angiosperms, increasing sequence data volumes caused a linear increase in contig size, decreased assembly length and further improved already high completeness. Differences in genome size and sequence complexity influenced the success of assembly. Advances in long read sequencing technology continue to improve plant genome sequencing and assembly. However, results were improved by greater genome coverage, with the amount needed to achieve a particular level of assembly being species dependent.

Download Full-text