Robust taxonomic classification of uncharted microbial sequences and bins with CAT and BAT

Genome Biology ◽

10.1186/s13059-019-1817-x ◽

2019 ◽

Vol 20 (1) ◽

Cited By ~ 26

Author(s):

F. A. Bastiaan von Meijenfeldt ◽

Ksenia Arkhipova ◽

Diego D. Cambuy ◽

Felipe H. Coutinho ◽

Bas E. Dutilh

Keyword(s):

Dna Sequences ◽

De Novo ◽

Taxonomic Classification ◽

Classification Method ◽

Reference Database ◽

Annotation Tool ◽

Multiple Signals

Abstract Current-day metagenomics analyses increasingly involve de novo taxonomic classification of long DNA sequences and metagenome-assembled genomes. Here, we show that the conventional best-hit approach often leads to classifications that are too specific, especially when the sequences represent novel deep lineages. We present a classification method that integrates multiple signals to classify sequences (Contig Annotation Tool, CAT) and metagenome-assembled genomes (Bin Annotation Tool, BAT). Classifications are automatically made at low taxonomic ranks if closely related organisms are present in the reference database and at higher ranks otherwise. The result is a high classification precision even for sequences from considerably unknown organisms.

Download Full-text

Contig annotation tool CAT robustly classifies assembled metagenomic contigs and long sequences

10.1101/072868 ◽

2016 ◽

Cited By ~ 13

Author(s):

Diego D. Cambuy ◽

Felipe H. Coutinho ◽

Bas E. Dutilh

Keyword(s):

Single Molecule ◽

Dna Sequences ◽

Taxonomic Classification ◽

Annotation Tool ◽

Single Molecule Sequencing ◽

Short Read ◽

Long Read ◽

Micro Organisms ◽

Taxonomic Annotation

AbstractIn modern-day metagenomics, there is an increasing need for robust taxonomic annotation of long DNA sequences from unknown micro-organisms. Long metagenomic sequences may be derived from assembly of short-read metagenomes, or from long-read single molecule sequencing. Here we introduce CAT, a pipeline for robust taxonomic classification of long DNA sequences. We show that CAT correctly classifies contigs at different taxonomic levels, even in simulated metagenomic datasets that are very distantly related from the sequences in the database. CAT is implemented in Python and the required scripts can be freely downloaded from Github.

Download Full-text

Taxonomic identification from metagenomic and metabarcoding data using any genetic marker

10.1101/253377 ◽

2018 ◽

Author(s):

Johan Bengtsson-Palme ◽

Rodney T. Richardson ◽

Marco Meola ◽

Christian Wurzbacher ◽

Émilie D. Tremblay ◽

...

Keyword(s):

Genetic Marker ◽

Dna Sequences ◽

Sequence Data ◽

Taxonomic Diversity ◽

Taxonomic Classification ◽

Taxonomic Identification ◽

Link Type

Correct taxonomic identification of DNA sequences is central to studies of biodiversity using both shotgun metagenomic and metabarcoding approaches. However, there is no genetic marker that gives sufficient performance across all the biological kingdoms, hampering studies of taxonomic diversity in many groups of organisms. We here present a major update to Metaxa2 (http://microbiology.se/software/metaxa2/) that enables the use of any genetic marker for taxonomic classification of metagenome and amplicon sequence data.

Download Full-text

Assessing alignment-based taxonomic classification of ancient microbial DNA

10.7287/peerj.preprints.27166v1 ◽

2018 ◽

Author(s):

Raphael Eisenhofer ◽

Laura Susan Weyrich

Keyword(s):

Ancient Dna ◽

Dna Sequences ◽

Random Sequence ◽

Taxonomic Classification ◽

Metagenomic Data ◽

Data Sets ◽

Protein Alignments ◽

Microbial Dna ◽

Dna Characteristics

The field of paleomicrobiology—the study of ancient microorganisms—is rapidly growing due to recent methodological and technological advancements. It is now possible to obtain vast quantities of DNA data from ancient specimens in a high-throughput manner and use this information to investigate the dynamics and evolution of past microbial communities. However, we still know very little about how the characteristics of ancient DNA influence our ability to accurately assign microbial taxonomies (i.e. identify species) within ancient metagenomic samples. Here, we use both simulated and published metagenomic data sets to investigate how ancient DNA characteristics affect alignment-based taxonomic classification. We find that nucleotide-to-nucleotide, rather than nucleotide-to-protein, alignments are preferable when assigning taxonomies to DNA fragment lengths routinely identified within ancient specimens (<60 bp). We determine that deamination (a form of ancient DNA damage) and random sequence substitutions corresponding to ~100,000 years of genomic divergence minimally impact alignment-based classification. We also test four different reference databases and find that database choice can significantly bias the results of alignment-based taxonomic classification in ancient metagenomic studies. Finally, we perform a reanalysis of previously published ancient dental calculus data, increasing the number of microbial DNA sequences assigned taxonomically by an average of 64.2-fold and identifying microbial species previously unidentified in the original study. Overall, this study enhances our understanding of how ancient DNA characteristics influence alignment-based taxonomic classification of ancient microorganisms and provides recommendations for future paleomicrobiological studies.

Download Full-text

BERTax: taxonomic classification of DNA sequences with Deep Neural Networks

10.1101/2021.07.09.451778 ◽

2021 ◽

Author(s):

Florian Mock ◽

Fleming Kretschmer ◽

Anton Kriese ◽

Sebastian Böcker ◽

Manja Marz

Keyword(s):

Language Processing ◽

Dna Sequences ◽

Information Gain ◽

Taxonomic Classification ◽

Training Data ◽

Misclassification Rate ◽

Genomic Sequences ◽

Similar Species ◽

Common Task

Taxonomic classification, i.e., the identification and assignment to groups of biological organisms with the same origin and characteristics, is a common task in genetics. Nowadays, taxonomic classification is mainly based on genome similarity search to large genome databases. In this process, the classification quality depends heavily on the database since representative relatives have to be known already. Many genomic sequences cannot be classified at all or only with a high misclassification rate. Here we present BERTax, a program that uses a deep neural network to precisely classify the superkingdom, phylum, and genus of DNA sequences taxonomically without the need for a known representative relative from a database. For this, BERTax uses the natural language processing model BERT trained to represent DNA. We show BERTax to be at least on par with the state-of-the-art approaches when taxonomically similar species are part of the training data. In case of an entirely novel organism, however, BERTax clearly outperforms any existing approach. Finally, we show that BERTax can also be combined with database approaches to further increase the prediction quality. Since BERTax is not based on homologous entries in databases, it allows precise taxonomic classification of a broader range of genomic sequences. This leads to a higher number of correctly classified sequences and thus increases the overall information gain.

Download Full-text

Construction & assessment of a unified curated reference database for improving the taxonomic classification of bacteria using 16S rRNA sequence data

The Indian Journal of Medical Research ◽

10.4103/ijmr.ijmr_220_18 ◽

2020 ◽

Vol 151 (1) ◽

pp. 93

Author(s):

Rakesh Aggarwal ◽

Shikha Agnihotry ◽

AdityaN Sarangi

Keyword(s):

16S Rrna ◽

Sequence Data ◽

Taxonomic Classification ◽

Reference Database ◽

Rrna Sequence ◽

16S Rrna Sequence

Download Full-text

Assessing alignment-based taxonomic classification of ancient microbial DNA

10.7287/peerj.preprints.27166 ◽

2018 ◽

Author(s):

Raphael Eisenhofer ◽

Laura Susan Weyrich

Keyword(s):

Ancient Dna ◽

Dna Sequences ◽

Random Sequence ◽

Taxonomic Classification ◽

Metagenomic Data ◽

Data Sets ◽

Protein Alignments ◽

Microbial Dna ◽

Dna Characteristics

The field of paleomicrobiology—the study of ancient microorganisms—is rapidly growing due to recent methodological and technological advancements. It is now possible to obtain vast quantities of DNA data from ancient specimens in a high-throughput manner and use this information to investigate the dynamics and evolution of past microbial communities. However, we still know very little about how the characteristics of ancient DNA influence our ability to accurately assign microbial taxonomies (i.e. identify species) within ancient metagenomic samples. Here, we use both simulated and published metagenomic data sets to investigate how ancient DNA characteristics affect alignment-based taxonomic classification. We find that nucleotide-to-nucleotide, rather than nucleotide-to-protein, alignments are preferable when assigning taxonomies to DNA fragment lengths routinely identified within ancient specimens (<60 bp). We determine that deamination (a form of ancient DNA damage) and random sequence substitutions corresponding to ~100,000 years of genomic divergence minimally impact alignment-based classification. We also test four different reference databases and find that database choice can significantly bias the results of alignment-based taxonomic classification in ancient metagenomic studies. Finally, we perform a reanalysis of previously published ancient dental calculus data, increasing the number of microbial DNA sequences assigned taxonomically by an average of 64.2-fold and identifying microbial species previously unidentified in the original study. Overall, this study enhances our understanding of how ancient DNA characteristics influence alignment-based taxonomic classification of ancient microorganisms and provides recommendations for future paleomicrobiological studies.

Download Full-text

Assessing alignment-based taxonomic classification of ancient microbial DNA

PeerJ ◽

10.7717/peerj.6594 ◽

2019 ◽

Vol 7 ◽

pp. e6594 ◽

Cited By ~ 5

Author(s):

Raphael Eisenhofer ◽

Laura Susan Weyrich

Keyword(s):

Ancient Dna ◽

Dna Sequences ◽

Random Sequence ◽

Taxonomic Classification ◽

Metagenomic Data ◽

Data Sets ◽

Protein Alignments ◽

Microbial Dna ◽

Dna Characteristics

The field of palaeomicrobiology—the study of ancient microorganisms—is rapidly growing due to recent methodological and technological advancements. It is now possible to obtain vast quantities of DNA data from ancient specimens in a high-throughput manner and use this information to investigate the dynamics and evolution of past microbial communities. However, we still know very little about how the characteristics of ancient DNA influence our ability to accurately assign microbial taxonomies (i.e. identify species) within ancient metagenomic samples. Here, we use both simulated and published metagenomic data sets to investigate how ancient DNA characteristics affect alignment-based taxonomic classification. We find that nucleotide-to-nucleotide, rather than nucleotide-to-protein, alignments are preferable when assigning taxonomies to short DNA fragment lengths routinely identified within ancient specimens (<60 bp). We determine that deamination (a form of ancient DNA damage) and random sequence substitutions corresponding to ∼100,000 years of genomic divergence minimally impact alignment-based classification. We also test four different reference databases and find that database choice can significantly bias the results of alignment-based taxonomic classification in ancient metagenomic studies. Finally, we perform a reanalysis of previously published ancient dental calculus data, increasing the number of microbial DNA sequences assigned taxonomically by an average of 64.2-fold and identifying microbial species previously unidentified in the original study. Overall, this study enhances our understanding of how ancient DNA characteristics influence alignment-based taxonomic classification of ancient microorganisms and provides recommendations for future palaeomicrobiological studies.

Download Full-text

SprayNPray: user-friendly taxonomic profiling of genome and metagenome contigs

10.1101/2021.07.17.452725 ◽

2021 ◽

Author(s):

Arkadiy I Garber ◽

Catherine R Armbruster ◽

Stella E Lee ◽

Vaughn S Cooper ◽

Jennifer M Bomberger ◽

...

Keyword(s):

Gc Content ◽

Taxonomic Classification ◽

Reference Database ◽

Taxonomic Profiling ◽

Spot Check ◽

Domains Of Life ◽

Multiple Domains ◽

Multiple Metrics ◽

User Friendly

Shotgun sequencing of cultured microbial isolates/individual eukaryotes (whole-genome sequencing) and microbial communities (metagenomics) has become commonplace in biology. Very often, sequenced samples encompass organisms spanning multiple domains of life, necessitating increasingly elaborate software for accurate taxonomic classification of assembled sequences. While many software tools for taxonomic classification exist, SprayNPray offers a quick and user-friendly, semi- automated approach, allowing users to separate contigs by taxonomy (and other metrics) of interest. Easy installation, usage, and intuitive output, which is amenable to visual inspection and/or further computational parsing, will reduce barriers for biologists beginning to analyze genomes and metagenomes. This approach can be used for broad-level overviews, preliminary analyses, or as a supplement to other taxonomic classification or binning software. SprayNPray profiles contigs using multiple metrics, including closest homologs from a user-specified reference database, gene density, read coverage, GC content, tetranucleotide frequency, and codon-usage bias. The output from this software is designed to allow users to spot-check metagenome-assembled genomes, identify, and remove contigs from putative contaminants in isolate assemblies, identify bacteria in eukaryotic assemblies (and vice-versa), and identify possible horizontal gene transfer events.

Download Full-text

Concatenation of paired-end reads improves taxonomic classification of amplicons for profiling microbial communities

BMC Bioinformatics ◽

10.1186/s12859-021-04410-2 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Daniel P. Dacey ◽

Frédéric J. J. Chain

Keyword(s):

Read Depth ◽

Taxonomic Composition ◽

Taxonomic Classification ◽

Read Length ◽

Reference Database ◽

Reference Databases ◽

Sequence Quality ◽

First Time ◽

Mock Communities

Abstract Background Taxonomic classification of genetic markers for microbiome analysis is affected by the numerous choices made from sample preparation to bioinformatics analysis. Paired-end read merging is routinely used to capture the entire amplicon sequence when the read ends overlap. However, the exclusion of unmerged reads from further analysis can result in underestimating the diversity in the sequenced microbial community and is influenced by bioinformatic processes such as read trimming and the choice of reference database. A potential solution to overcome this is to concatenate (join) reads that do not overlap and keep them for taxonomic classification. The use of concatenated reads can outperform taxonomic recovery from single-end reads, but it remains unclear how their performance compares to merged reads. Using various sequenced mock communities with different amplicons, read length, read depth, taxonomic composition, and sequence quality, we tested how merging and concatenating reads performed for genus recall and precision in bioinformatic pipelines combining different parameters for read trimming and taxonomic classification using different reference databases. Results The addition of concatenated reads to merged reads always increased pipeline performance. The top two performing pipelines both included read concatenation, with variable strengths depending on the mock community. The pipeline that combined merged and concatenated reads that were quality-trimmed performed best for mock communities with larger amplicons and higher average quality sequences. The pipeline that used length-trimmed concatenated reads outperformed quality trimming in mock communities with lower quality sequences but lost a significant amount of input sequences for taxonomic classification during processing. Genus level classification was more accurate using the SILVA reference database compared to Greengenes. Conclusions Merged sequences with the addition of concatenated sequences that were unable to be merged increased performance of taxonomic classifications. This was especially beneficial in mock communities with larger amplicons. We have shown for the first time, using an in-depth comparison of pipelines containing merged vs concatenated reads combined with different trimming parameters and reference databases, the potential advantages of concatenating sequences in improving resolution in microbiome investigations.

Download Full-text