Taxonomic identification from metagenomic and metabarcoding data using any genetic marker

Robust taxonomic classification of uncharted microbial sequences and bins with CAT and BAT

Genome Biology ◽

10.1186/s13059-019-1817-x ◽

2019 ◽

Vol 20 (1) ◽

Cited By ~ 26

Author(s):

F. A. Bastiaan von Meijenfeldt ◽

Ksenia Arkhipova ◽

Diego D. Cambuy ◽

Felipe H. Coutinho ◽

Bas E. Dutilh

Keyword(s):

Dna Sequences ◽

De Novo ◽

Taxonomic Classification ◽

Classification Method ◽

Reference Database ◽

Annotation Tool ◽

Multiple Signals

Abstract Current-day metagenomics analyses increasingly involve de novo taxonomic classification of long DNA sequences and metagenome-assembled genomes. Here, we show that the conventional best-hit approach often leads to classifications that are too specific, especially when the sequences represent novel deep lineages. We present a classification method that integrates multiple signals to classify sequences (Contig Annotation Tool, CAT) and metagenome-assembled genomes (Bin Annotation Tool, BAT). Classifications are automatically made at low taxonomic ranks if closely related organisms are present in the reference database and at higher ranks otherwise. The result is a high classification precision even for sequences from considerably unknown organisms.

Download Full-text

Evaluating and optimizing the performance of software commonly used for the taxonomic classification of DNA metabarcoding sequence data

Molecular Ecology Resources ◽

10.1111/1755-0998.12628 ◽

2016 ◽

Vol 17 (4) ◽

pp. 760-769 ◽

Cited By ~ 21

Author(s):

Rodney T. Richardson ◽

Johan Bengtsson-Palme ◽

Reed M. Johnson

Keyword(s):

Sequence Data ◽

Taxonomic Classification ◽

Dna Metabarcoding

Download Full-text

Contig annotation tool CAT robustly classifies assembled metagenomic contigs and long sequences

10.1101/072868 ◽

2016 ◽

Cited By ~ 13

Author(s):

Diego D. Cambuy ◽

Felipe H. Coutinho ◽

Bas E. Dutilh

Keyword(s):

Single Molecule ◽

Dna Sequences ◽

Taxonomic Classification ◽

Annotation Tool ◽

Single Molecule Sequencing ◽

Short Read ◽

Long Read ◽

Micro Organisms ◽

Taxonomic Annotation

AbstractIn modern-day metagenomics, there is an increasing need for robust taxonomic annotation of long DNA sequences from unknown micro-organisms. Long metagenomic sequences may be derived from assembly of short-read metagenomes, or from long-read single molecule sequencing. Here we introduce CAT, a pipeline for robust taxonomic classification of long DNA sequences. We show that CAT correctly classifies contigs at different taxonomic levels, even in simulated metagenomic datasets that are very distantly related from the sequences in the database. CAT is implemented in Python and the required scripts can be freely downloaded from Github.

Download Full-text

FOCUS2: agile and sensitive classification of metagenomics data using a reduced database

10.1101/046425 ◽

2016 ◽

Cited By ~ 2

Author(s):

Genivaldo Gueiros Z. Silva ◽

Bas E. Dutilh ◽

Robert A. Edwards

Keyword(s):

Microbial Community ◽

Dna Sequences ◽

Computational Method ◽

Environmental Research ◽

Supplementary Information ◽

Sequence Classification ◽

Computationally Efficient ◽

Link Type ◽

Metagenomics Data

ABSTRACTSummaryMetagenomics approaches rely on identifying the presence of organisms in the microbial community from a set of unknown DNA sequences. Sequence classification has valuable applications in multiple important areas of medical and environmental research. Here we introduce FOCUS2, an update of the previously published computational method FOCUS. FOCUS2 was tested with 10 simulated and 543 real metagenomes demonstrating that the program is more sensitive, faster, and more computationally efficient than existing methods.AvailabilityThe Python implementation is freely available at https://edwards.sdsu.edu/FOCUS2.Supplementary informationavailable at Bioinformatics online.

Download Full-text

Two new asexual genera and six new asexual species in the family Microthyriaceae (Dothideomycetes, Ascomycota) from China

MycoKeys ◽

10.3897/mycokeys.85.70829 ◽

2021 ◽

Vol 85 ◽

pp. 1-30

Author(s):

Min Qiao ◽

Hua Zheng ◽

Ji-Shu Guo ◽

Rafael F. Castañeda-Ruiz ◽

Jian-Ping Xu ◽

...

Keyword(s):

New Taxa ◽

Dna Sequences ◽

Sequence Data ◽

Southern China ◽

Phylogenetic Analyses ◽

Large Subunit ◽

Aquatic Hyphomycetes ◽

Internal Transcribed Spacers ◽

The Family

The family Microthyriaceae is represented by relatively few mycelial cultures and DNA sequences; as a result, the taxonomy and classification of this group of organisms remain poorly understood. During the investigation of the diversity of aquatic hyphomycetes from southern China, several isolates were collected. These isolates were cultured and sequenced and a BLAST search of its LSU sequences against data in GenBank revealed that the closest related taxa are in the genus Microthyrium. Phylogenetic analyses, based on the combined sequence data from the internal transcribed spacers (ITS) and the large subunit (LSU), revealed that these isolates represent eight new taxa in Microthyriaceae, including two new genera, Antidactylariagen. nov. and Isthmomycesgen. nov. and six new species, Antidactylaria minifimbriatasp. nov., Isthmomyces oxysporussp. nov., I. dissimilissp. nov., I. macrosporussp. nov., Triscelophorus anisopterioideussp. nov. and T. sinensissp. nov. These new taxa are described, illustrated for their morphologies and compared with similar taxa. In addition, two new combinations are proposed in this family.

Download Full-text

Optimizing taxonomic classification of marker gene amplicon sequences

10.7287/peerj.preprints.3208v2 ◽

2018 ◽

Cited By ~ 4

Author(s):

Nicholas A Bokulich ◽

Benjamin D Kaehler ◽

Jai Ram Rideout ◽

Matthew Dillon ◽

Evan Bolyen ◽

...

Keyword(s):

Machine Learning ◽

Sequence Data ◽

Marker Gene ◽

Parameter Tuning ◽

Operating Conditions ◽

Evaluation Framework ◽

Taxonomic Classification ◽

Consensus Methods ◽

Learning Classifier

Background: Taxonomic classification of marker-gene sequences is an important step in microbiome analysis. Results: We present q2-feature-classifier ( https://github.com/qiime2/q2-feature-classifier ), a QIIME 2 plugin containing several novel machine-learning and alignment-based taxonomy classifiers that meet or exceed the accuracy of existing methods for marker-gene amplicon sequence classification. We evaluated and optimized several commonly used taxonomic classification methods (RDP, BLAST, UCLUST) and several new methods (a scikit-learn naive Bayes machine-learning classifier, and alignment-based taxonomy consensus methods of VSEARCH, BLAST+, and SortMeRNA) for classification of marker-gene amplicon sequence data. Conclusions: Our results illustrate the importance of parameter tuning for optimizing classifier performance, and we make recommendations regarding parameter choices for a range of standard operating conditions. q2-feature-classifier and our evaluation framework, tax-credit, are both free, open-source, BSD-licensed packages available on GitHub.

Download Full-text

GenTB: A user-friendly genome-based predictor for tuberculosis resistance powered by machine learning

10.1101/2021.03.27.437319 ◽

2021 ◽

Author(s):

Matthias I Gröschel ◽

Martin Owens ◽

Luca Freschi ◽

Roger Vargas ◽

Maximilian G Marin ◽

...

Keyword(s):

Public Health ◽

Dna Sequences ◽

Sequence Data ◽

Drug Susceptibility ◽

Control Input ◽

Genotypic Resistance ◽

Prediction Tools ◽

Link Type ◽

Resistance Prediction ◽

User Friendly

ABSTRACTIntroductionMultidrug-resistant Mycobacterium tuberculosis (Mtb) is a significant global public health threat. Genotypic resistance prediction from Mtb DNA sequences offers an alternative to laboratory-based drug-susceptibility testing. User-friendly and accurate resistance prediction tools are needed to enable public health and clinical practitioners to rapidly diagnose resistance and inform treatment regimens.MethodsWe present Translational Genomics platform for Tuberculosis (GenTB), a web-based application to predict antibiotic resistance from next-generation sequence data. The user can choose between two potential predictors, a Random Forest (RF) classifier and a Wide and Deep Neural Network (WDNN) to predict phenotypic resistance to 13 and 10 anti-tuberculosis drugs, respectively. We benchmark GenTB’s predictive performance along with leading TB resistance prediction tools (Mykrobe and TB-Profiler) using a ground truth dataset of 20,408 isolates with laboratory-based drug susceptibility data.ResultsAll four tools reliably predicted resistance to first-line tuberculosis drugs but had varying performance for second-line drugs. The mean sensitivities for GenTB-RF and GenTB-WDNN across the nine shared drugs was 77.6% (95% CI 76.6 - 78.5%) and 75.4% (95% CI 74.5 - 76.4%) respectively, and marginally higher than the sensitivities of TB-Profiler at 74.4% (95% CI 73.4 - 75.3%) and Mykrobe at 71.9% (95% CI 70.9 - 72.9%). The higher sensitivities were at an expense of ≤1.5% lower specificity: Mykrobe 97.6% (95% CI 97.5 - 97.7%), TB-Profiler 96.9% (95% CI 96.7 to 97.0%), GenTB-WDNN 96.2% (95% CI 96.0 to 96.4%), and GenTB-RF 96.1% (95% CI 96.0 to 96.3%). Genotypic resistance sensitivity was 11% and 9% lower for isoniazid and rifampicin respectively, on isolates sequenced at low depth (<10x across 95% of the genome) emphasizing the need to quality control input sequence data before prediction. We discuss differences between tools in reporting results to the user including variants underlying the resistance calls and any novel or indeterminate variantsConclusionGenTB is an easy-to-use online tool to rapidly and accurately predict resistance to anti-tuberculosis drugs. GenTB can be accessed online at https://gentb.hms.harvard.edu, and the source code is available at https://github.com/farhat-lab/gentb-site.

Download Full-text

Machine learning using intrinsic genomic signatures for rapid classification of novel pathogens: COVID-19 case study

10.1101/2020.02.03.932350 ◽

2020 ◽

Cited By ~ 10

Author(s):

Gurjit S. Randhawa ◽

Maximillian P.M. Soltysiak ◽

Hadi El Roz ◽

Camila P.E. de Souza ◽

Kathleen A. Hill ◽

...

Keyword(s):

Machine Learning ◽

Death Rate ◽

Genomic Sequence ◽

Sequence Data ◽

Rank Correlation ◽

Taxonomic Classification ◽

Supervised Machine Learning ◽

Biological Knowledge ◽

Alignment Free

AbstractAs of February 20, 2020, the 2019 novel coronavirus (renamed to COVID-19) spread to 30 countries with 2130 deaths and more than 75500 confirmed cases. COVID-19 is being compared to the infamous SARS coronavirus, which resulted, between November 2002 and July 2003, in 8098 confirmed cases worldwide with a 9.6% death rate and 774 deaths. Though COVID-19 has a death rate of 2.8% as of 20 February, the 75752 confirmed cases in a few weeks (December 8, 2019 to February 20, 2020) are alarming, with cases likely being under-reported given the comparatively longer incubation period. Such outbreaks demand elucidation of taxonomic classification and origin of the virus genomic sequence, for strategic planning, containment, and treatment. This paper identifies an intrinsic COVID-19 genomic signature and uses it together with a machine learning-based alignment-free approach for an ultra-fast, scalable, and highly accurate classification of whole COVID-19 genomes. The proposed method combines supervised machine learning with digital signal processing for genome analyses, augmented by a decision tree approach to the machine learning component, and a Spearman’s rank correlation coefficient analysis for result validation. These tools are used to analyze a large dataset of over 5000 unique viral genomic sequences, totalling 61.8 million bp. Our results support a hypothesis of a bat origin and classify COVID-19 as Sarbecovirus, within Betacoronavirus. Our method achieves high levels of classification accuracy and discovers the most relevant relationships among over 5,000 viral genomes within a few minutes, ab initio, using raw DNA sequence data alone, and without any specialized biological knowledge, training, gene or genome annotations. This suggests that, for novel viral and pathogen genome sequences, this alignment-free whole-genome machine-learning approach can provide a reliable real-time option for taxonomic classification.

Download Full-text

A Phylogenetic Investigation of the New Zealand Pteridaceae Ferns

10.26686/wgtn.16934449.v1 ◽

2021 ◽

Author(s):

◽

Whitney L M Bouma

Keyword(s):

New Zealand ◽

Dna Sequences ◽

Phylogenetic Relationships ◽

Native Species ◽

Sequence Data ◽

Phylogenetic Analyses ◽

Morphological Characters ◽

Taxonomic Confusion ◽

Species Specific

<p>The fern family Pteridaceae is among the largest fern families in New Zealand. It comprises 17 native species among five genera. Traditionally the classification of Pteridaceae was based on morphological characters. The advent of molecular technology, now makes is possible to test these morphology-based classifications. The Pteridaceae has previously been subjected to phylogenetic analyses; however representatives from New Zealand and the South Pacific have never been well represented in these studies. This thesis research aimed to investigate the phylogenetic relationships of the New Zealand Pteridaceae, as well as, the phylogenetic relationships of the New Zealand species to their overseas relatives. The DNA sequences of several Chloroplast loci (e.g. trnL-trnF locus, rps4 and rps4-trnS IGS, atpB, and rbcL) were determined and the phylogenetic relationships of the New Zealand Pteridaceae and several species-specific question within the genus Pellaea and Adiantum were investigated. Results presented in this thesis confirm previously published phylogenetics of the Pteridaceae, which show the resolution of five major clades, i.e.,cryptogrammoids, ceratopteridoids, pteridoids, cheilanthoids, and the adiantoids. The addition of the New Zealand species revealed a possible South West Pacific groups formed by the respective genera, where New Zealand species were generally more related to one another than to overseas relatives. Within the New Zealand Pellaea, the analysis of the trnL-trnF locus sequence data showed that the morphologically-intermediate plants P. aff. falcata, responsible for taxonomic confusion, were more closely related to P. rotundifolia than to P. falcata. Furthermore, the species collected on the Kermadec Islands, previously thought to be P. falcata, are genetically distinct from the Australian P. falcata and they could constitute a new species. Adiantum hispidulum, which is polymorphic for two different hair types being used to distinguish them as different species, was also reinvestigated morphologically and molecularly. Morphological inspection of hairs revealed three hair types as opposed to the previous thought two, and furthermore, they correspond to three different trnL-trnF sequences haplotypes.</p>

Download Full-text

GenTB: A user-friendly genome-based predictor for tuberculosis resistance powered by machine learning

Genome Medicine ◽

10.1186/s13073-021-00953-4 ◽

2021 ◽

Vol 13 (1) ◽

Cited By ~ 1

Author(s):

Matthias I. Gröschel ◽

Martin Owens ◽

Luca Freschi ◽

Roger Vargas ◽

Maximilian G. Marin ◽

...

Keyword(s):

Public Health ◽

Dna Sequences ◽

Sequence Data ◽

Drug Susceptibility ◽

Control Input ◽

Genotypic Resistance ◽

Prediction Tools ◽

Link Type ◽

Resistance Prediction ◽

User Friendly

Abstract Background Multidrug-resistant Mycobacterium tuberculosis (Mtb) is a significant global public health threat. Genotypic resistance prediction from Mtb DNA sequences offers an alternative to laboratory-based drug-susceptibility testing. User-friendly and accurate resistance prediction tools are needed to enable public health and clinical practitioners to rapidly diagnose resistance and inform treatment regimens. Results We present Translational Genomics platform for Tuberculosis (GenTB), a free and open web-based application to predict antibiotic resistance from next-generation sequence data. The user can choose between two potential predictors, a Random Forest (RF) classifier and a Wide and Deep Neural Network (WDNN) to predict phenotypic resistance to 13 and 10 anti-tuberculosis drugs, respectively. We benchmark GenTB’s predictive performance along with leading TB resistance prediction tools (Mykrobe and TB-Profiler) using a ground truth dataset of 20,408 isolates with laboratory-based drug susceptibility data. All four tools reliably predicted resistance to first-line tuberculosis drugs but had varying performance for second-line drugs. The mean sensitivities for GenTB-RF and GenTB-WDNN across the nine shared drugs were 77.6% (95% CI 76.6–78.5%) and 75.4% (95% CI 74.5–76.4%), respectively, and marginally higher than the sensitivities of TB-Profiler at 74.4% (95% CI 73.4–75.3%) and Mykrobe at 71.9% (95% CI 70.9–72.9%). The higher sensitivities were at an expense of ≤ 1.5% lower specificity: Mykrobe 97.6% (95% CI 97.5–97.7%), TB-Profiler 96.9% (95% CI 96.7 to 97.0%), GenTB-WDNN 96.2% (95% CI 96.0 to 96.4%), and GenTB-RF 96.1% (95% CI 96.0 to 96.3%). Averaged across the four tools, genotypic resistance sensitivity was 11% and 9% lower for isoniazid and rifampicin respectively, on isolates sequenced at low depth (< 10× across 95% of the genome) emphasizing the need to quality control input sequence data before prediction. We discuss differences between tools in reporting results to the user including variants underlying the resistance calls and any novel or indeterminate variants Conclusions GenTB is an easy-to-use online tool to rapidly and accurately predict resistance to anti-tuberculosis drugs. GenTB can be accessed online at https://gentb.hms.harvard.edu, and the source code is available at https://github.com/farhat-lab/gentb-site.

Download Full-text