Accessible, curated metagenomic data through ExperimentHub

Mapping Intimacies ◽

10.1101/103085 ◽

2017 ◽

Cited By ~ 2

Author(s):

Edoardo Pasolli ◽

Lucas Schiffer ◽

Paolo Manghi ◽

Audrey Renson ◽

Valerie Obenchain ◽

...

Keyword(s):

Open Source Software ◽

Large Scale ◽

Human Microbiome ◽

Human Microbiome Project ◽

Disease Classification ◽

Metagenomic Data ◽

Sequencing Data ◽

Data Types ◽

Command Line Interface ◽

Functional Profiles

We present curatedMetagenomicData, a Bioconductor and command-line interface to thousands of metagenomic profiles from the Human Microbiome Project and other publicly available datasets, and ExperimentHub, a platform for convenient cloud-based distribution of data to the R desktop. The resource provides standardized per-participant metadata linked to bacterial, fungal, archaeal, and viral taxonomic abundances, as well as quantitative metabolic functional profiles. The datasets can be immediately analyzed in R or other software with a minimum of bioinformatic expertise and no preprocessing of data. We demonstrate identification of taxonomic/functional correlations, an investigation of gut “enterotypes”, and a comparison of the accuracy of disease classification from different data types. These documented analyses can be reproduced efficiently on a laptop, without the barriers of working with large-scale, raw sequencing data. The building and expansion of curatedMetagenomicData is based entirely on open source software and pipelines, to facilitate the addition of new microbiome datasets and methods.

Download Full-text

Searching more genomic sequence with less memory for fast and accurate metagenomic profiling

10.1101/036681 ◽

2016 ◽

Author(s):

Shea N Gardner ◽

Sasha K Ames ◽

Maya B Gokhale ◽

Tom R Slezak ◽

Jonathan Allen

Keyword(s):

Large Scale ◽

Genomic Sequence ◽

Sequence Data ◽

Low Cost ◽

False Negative ◽

Human Microbiome ◽

Human Microbiome Project ◽

Metagenomic Data ◽

Reference Database ◽

Metagenomic Sequence

Software for rapid, accurate, and comprehensive microbial profiling of metagenomic sequence data on a desktop will play an important role in large scale clinical use of metagenomic data. Here we describe LMAT-ML (Livermore Metagenomics Analysis Toolkit-Marker Library) which can be run with 24 GB of DRAM memory, an amount available on many clusters, or with 16 GB DRAM plus a 24 GB low cost commodity flash drive (NVRAM), a cost effective alternative for desktop or laptop users. We compared results from LMAT with five other rapid, low-memory tools for metagenome analysis for 131 Human Microbiome Project samples, and assessed discordant calls with BLAST. All the tools except LMAT-ML reported overly specific or incorrect species and strain resolution of reads that were in fact much more widely conserved across species, genera, and even families. Several of the tools misclassified reads from synthetic or vector sequence as microbial or human reads as viral. We attribute the high numbers of false positive and false negative calls to a limited reference database with inadequate representation of known diversity. Our comparisons with real world samples show that LMAT-ML is the only tool tested that classifies the majority of reads, and does so with high accuracy.

Download Full-text

TaxiBGC: a Taxonomy-guided Approach for the Identification of Experimentally Verified Microbial Biosynthetic Gene Clusters in Shotgun Metagenomic Data

10.1101/2021.07.30.454505 ◽

2021 ◽

Author(s):

Utpal Bakshi ◽

Vinod K Gupta ◽

Aileen R Lee ◽

John M Davis ◽

Sriram Chandrasekaran ◽

...

Keyword(s):

Large Scale ◽

Human Microbiome ◽

Gene Clusters ◽

Human Microbiome Project ◽

Metagenomic Data ◽

Biosynthetic Gene ◽

Case Control Studies ◽

Biosynthetic Gene Clusters ◽

Host Interactions ◽

Microbiome Data

Biosynthetic gene clusters (BGCs) in microbial genomes encode for the production of bioactive secondary metabolites (SMs). Given the well-recognized importance of SMs in microbe-microbe and microbe-host interactions, the large-scale identification of BGCs from microbial metagenomes could offer novel functional insights into complex chemical ecology. Despite recent progress, currently available tools for predicting BGCs from shotgun metagenomes have several limitations, including the need for computationally demanding read-assembly and prediction of a narrow breadth of BGC classes. To overcome these limitations, we developed TaxiBGC (Taxonomy-guided Identification of Biosynthetic Gene Clusters), a computational pipeline for identifying experimentally verified BGCs in shotgun metagenomes by first pinpointing the microbial species likely to produce them. We show that our species-centric approach was able to identify BGCs in simulated metagenomes more accurately than by solely detecting BGC genes. By applying TaxiBGC on 5,423 metagenomes from the Human Microbiome Project and various case-control studies, we identified distinct BGC signatures of major human body sites and candidate stool-borne biomarkers for multiple diseases, including inflammatory bowel disease, colorectal cancer, and psychiatric disorders. In all, TaxiBGC demonstrates a significant advantage over existing techniques for systematically characterizing BGCs and inferring their SMs from microbiome data.

Download Full-text

Sequence Comparison of Vaginolysin from Different Gardnerella Species

Pathogens ◽

10.3390/pathogens10020086 ◽

2021 ◽

Vol 10 (2) ◽

pp. 86

Author(s):

Erin M. Garcia ◽

Myrna G. Serrano ◽

Laahirie Edupuganti ◽

David J. Edwards ◽

Gregory A. Buck ◽

...

Keyword(s):

Amino Acid ◽

Human Microbiome ◽

Human Microbiome Project ◽

Distinct Species ◽

Vaginal Swab ◽

Metagenomic Data ◽

Gardnerella Vaginalis ◽

Vaginal Epithelial Cells ◽

Species Specific

Gardnerella vaginalis has recently been split into 13 distinct species. In this study, we tested the hypotheses that species-specific variations in the vaginolysin (VLY) amino acid sequence could influence the interaction between the toxin and vaginal epithelial cells and that VLY variation may be one factor that distinguishes less virulent or commensal strains from more virulent strains. This was assessed by bioinformatic analyses of publicly available Gardnerella spp. sequences and quantification of cytotoxicity and cytokine production from purified, recombinantly produced versions of VLY. After identifying conserved differences that could distinguish distinct VLY types, we analyzed metagenomic data from a cohort of female subjects from the Vaginal Human Microbiome Project to investigate whether these different VLY types exhibited any significant associations with symptoms or Gardnerella spp.-relative abundance in vaginal swab samples. While Type 1 VLY was most prevalent among the subjects and may be associated with increased reports of symptoms, subjects with Type 2 VLY dominant profiles exhibited increased relative Gardnerella spp. abundance. Our findings suggest that amino acid differences alter the interaction of VLY with vaginal keratinocytes, which may potentiate differences in bacterial vaginosis (BV) immunopathology in vivo.

Download Full-text

Meta-Apo improves accuracy of 16S-amplicon-based prediction of microbiome function

BMC Genomics ◽

10.1186/s12864-020-07307-1 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Gongchao Jing ◽

Yufeng Zhang ◽

Wenzhi Cui ◽

Lu Liu ◽

Jian Xu ◽

...

Keyword(s):

16S Rrna ◽

Large Scale ◽

Low Cost ◽

Human Microbiome ◽

Amplicon Sequencing ◽

Training Sample ◽

Rrna Gene ◽

16S Amplicon Sequencing ◽

Cross Platform ◽

Functional Profiles

Abstract Background Due to their much lower costs in experiment and computation than metagenomic whole-genome sequencing (WGS), 16S rRNA gene amplicons have been widely used for predicting the functional profiles of microbiome, via software tools such as PICRUSt 2. However, due to the potential PCR bias and gene profile variation among phylogenetically related genomes, functional profiles predicted from 16S amplicons may deviate from WGS-derived ones, resulting in misleading results. Results Here we present Meta-Apo, which greatly reduces or even eliminates such deviation, thus deduces much more consistent diversity patterns between the two approaches. Tests of Meta-Apo on > 5000 16S-rRNA amplicon human microbiome samples from 4 body sites showed the deviation between the two strategies is significantly reduced by using only 15 WGS-amplicon training sample pairs. Moreover, Meta-Apo enables cross-platform functional comparison between WGS and amplicon samples, thus greatly improve 16S-based microbiome diagnosis, e.g. accuracy of gingivitis diagnosis via 16S-derived functional profiles was elevated from 65 to 95% by WGS-based classification. Therefore, with the low cost of 16S-amplicon sequencing, Meta-Apo can produce a reliable, high-resolution view of microbiome function equivalent to that offered by shotgun WGS. Conclusions This suggests that large-scale, function-oriented microbiome sequencing projects can probably benefit from the lower cost of 16S-amplicon strategy, without sacrificing the precision in functional reconstruction that otherwise requires WGS. An optimized C++ implementation of Meta-Apo is available on GitHub (https://github.com/qibebt-bioinfo/meta-apo) under a GNU GPL license. It takes the functional profiles of a few paired WGS:16S-amplicon samples as training, and outputs the calibrated functional profiles for the much larger number of 16S-amplicon samples.

Download Full-text

Utilizing the VirIdAl Pipeline to Search for Viruses in the Metagenomic Data of Bat Samples

Viruses ◽

10.3390/v13102006 ◽

2021 ◽

Vol 13 (10) ◽

pp. 2006

Author(s):

Anna Y Budkina ◽

Elena V Korneenko ◽

Ivan A Kotov ◽

Daniil A Kiselev ◽

Ilya V Artyushin ◽

...

Keyword(s):

Large Scale ◽

High Throughput Sequencing ◽

Metagenomic Data ◽

Sequencing Data ◽

Viral Pathogens ◽

Genomic Databases ◽

Bioinformatic Pipeline ◽

Viral Genomes ◽

Sequencing Technologies ◽

Viral Screening

According to various estimates, only a small percentage of existing viruses have been discovered, naturally much less being represented in the genomic databases. High-throughput sequencing technologies develop rapidly, empowering large-scale screening of various biological samples for the presence of pathogen-associated nucleotide sequences, but many organisms are yet to be attributed specific loci for identification. This problem particularly impedes viral screening, due to vast heterogeneity in viral genomes. In this paper, we present a new bioinformatic pipeline, VirIdAl, for detecting and identifying viral pathogens in sequencing data. We also demonstrate the utility of the new software by applying it to viral screening of the feces of bats collected in the Moscow region, which revealed a significant variety of viruses associated with bats, insects, plants, and protozoa. The presence of alpha and beta coronavirus reads, including the MERS-like bat virus, deserves a special mention, as it once again indicates that bats are indeed reservoirs for many viral pathogens. In addition, it was shown that alignment-based methods were unable to identify the taxon for a large proportion of reads, and we additionally applied other approaches, showing that they can further reveal the presence of viral agents in sequencing data. However, the incompleteness of viral databases remains a significant problem in the studies of viral diversity, and therefore necessitates the use of combined approaches, including those based on machine learning methods.

Download Full-text

Towards end-to-end disease prediction from raw metagenomic data

10.1101/2020.10.29.360297 ◽

2020 ◽

Author(s):

Maxence Queyrel ◽

Edi Prifti ◽

Jean-Daniel Zucker

Keyword(s):

Dna Sequences ◽

Real Life ◽

Multiple Instance Learning ◽

Disease Classification ◽

Metagenomic Data ◽

Numerical Representation ◽

Metagenomic Sequencing ◽

Sequencing Data ◽

End To End ◽

Bioinformatics Workflows

AbstractAnalysis of the human microbiome using metagenomic sequencing data has demonstrated high ability in discriminating various human diseases. Raw metagenomic sequencing data require multiple complex and computationally heavy bioinformatics steps prior to data analysis. Such data contain millions of short sequences read from the fragmented DNA sequences and are stored as fastq files. Conventional processing pipelines consist multiple steps including quality control, filtering, alignment of sequences against genomic catalogs (genes, species, taxonomic levels, functional pathways, etc.). These pipelines are complex to use, time consuming and rely on a large number of parameters that often provide variability and impact the estimation of the microbiome elements. Recent studies have demonstrated that training Deep Neural Networks directly from raw sequencing data is a promising approach to bypass some of the challenges associated with mainstream bioinformatics pipelines. Most of these methods use the concept of word and sentence embeddings that create a meaningful and numerical representation of DNA sequences, while extracting features and reducing the dimentionality of the data. In this paper we present an end-to-end approach that classifies patients into disease groups directly from raw metagenomic reads: metagenome2vec. This approach is composed of four steps (i) generating a vocabulary of k-mers and learning their numerical embeddings; (ii) learning DNA sequence (read) embeddings; (iii) identifying the genome from which the sequence is most likely to come and (iv) training a multiple instance learning classifier which predicts the phenotype based on the vector representation of the raw data. An attention mechanism is applied in the network so that the model can be interpreted, assigning a weight to the influence of the prediction for each genome. Using two public real-life datasets as well a simulated one, we demonstrated that this original approach reached very high performances, comparable with the state-of-the-art methods applied directly on processed data though mainstream bioinformatics workflows. These results are encouraging for this proof of concept work. We believe that with further dedication, the DNN models have the potential to surpass mainstream bioinformatics workflows in disease classification tasks.

Download Full-text

Mouse Gut Microbiome-Encoded β-Glucuronidases Identified Using Metagenome Analysis Guided by Protein Structure

mSystems ◽

10.1128/msystems.00452-19 ◽

2019 ◽

Vol 4 (4) ◽

Cited By ~ 5

Author(s):

Benjamin C. Creekmore ◽

Josh H. Gray ◽

William G. Walton ◽

Kristen A. Biernat ◽

Michael S. Little ◽

...

Keyword(s):

Protein Structure ◽

Active Site ◽

Human Microbiome ◽

Drug Efficacy ◽

Human Microbiome Project ◽

Structural Features ◽

Model Organisms ◽

Mouse Strains ◽

Sequencing Data ◽

Metagenome Analysis

ABSTRACT Gut microbial β-glucuronidase (GUS) enzymes play important roles in drug efficacy and toxicity, intestinal carcinogenesis, and mammalian-microbial symbiosis. Recently, the first catalog of human gut GUS proteins was provided for the Human Microbiome Project stool sample database and revealed 279 unique GUS enzymes organized into six categories based on active-site structural features. Because mice represent a model biomedical research organism, here we provide an analogous catalog of mouse intestinal microbial GUS proteins—a mouse gut GUSome. Using metagenome analysis guided by protein structure, we examined 2.5 million unique proteins from a comprehensive mouse gut metagenome created from several mouse strains, providers, housing conditions, and diets. We identified 444 unique GUS proteins and organized them into six categories based on active-site features, similarly to the human GUSome analysis. GUS enzymes were encoded by the major gut microbial phyla, including Firmicutes (60%) and Bacteroidetes (21%), and there were nearly 20% for which taxonomy could not be assigned. No differences in gut microbial gus gene composition were observed for mice based on sex. However, mice exhibited gus differences based on active-site features associated with provider, location, strain, and diet. Furthermore, diet yielded the largest differences in gus composition. Biochemical analysis of two low-fat-associated GUS enzymes revealed that they are variable with respect to their efficacy of processing both sulfated and nonsulfated heparan nonasaccharides containing terminal glucuronides. IMPORTANCE Mice are commonly employed as model organisms of mammalian disease; as such, our understanding of the compositions of their gut microbiomes is critical to appreciating how the mouse and human gastrointestinal tracts mirror one another. GUS enzymes, with importance in normal physiology and disease, are an attractive set of proteins to use for such analyses. Here we show that while the specific GUS enzymes differ at the sequence level, a core GUSome functionality appears conserved between mouse and human gastrointestinal bacteria. Mouse strain, provider, housing location, and diet exhibit distinct GUSomes and gus gene compositions, but sex seems not to affect the GUSome. These data provide a basis for understanding the gut microbial GUS enzymes present in commonly used laboratory mice. Further, they demonstrate the utility of metagenome analysis guided by protein structure to provide specific sets of functionally related proteins from whole-genome metagenome sequencing data.

Download Full-text

Computational Modeling of the Human Microbiome

Microorganisms ◽

10.3390/microorganisms8020197 ◽

2020 ◽

Vol 8 (2) ◽

pp. 197

Author(s):

Shomeek Chowdhury ◽

Stephen S. Fong

Keyword(s):

Computational Modeling ◽

Human Health ◽

Large Scale ◽

Human Microbiome ◽

Human Microbiome Project ◽

Microbial Composition ◽

Site Specific ◽

Microbiome Research ◽

High Level ◽

The Impact

The impact of microorganisms on human health has long been acknowledged and studied, but recent advances in research methodologies have enabled a new systems-level perspective on the collections of microorganisms associated with humans, the human microbiome. Large-scale collaborative efforts such as the NIH Human Microbiome Project have sought to kick-start research on the human microbiome by providing foundational information on microbial composition based upon specific sites across the human body. Here, we focus on the four main anatomical sites of the human microbiome: gut, oral, skin, and vaginal, and provide information on site-specific background, experimental data, and computational modeling. Each of the site-specific microbiomes has unique organisms and phenomena associated with them; there are also high-level commonalities. By providing an overview of different human microbiome sites, we hope to provide a perspective where detailed, site-specific research is needed to understand causal phenomena that impact human health, but there is equally a need for more generalized methodology improvements that would benefit all human microbiome research.

Download Full-text

phylogenize: correcting for phylogeny reveals genes associated with microbial distributions

Bioinformatics ◽

10.1093/bioinformatics/btz722 ◽

2019 ◽

Vol 36 (4) ◽

pp. 1289-1290

Author(s):

Patrick H Bradley ◽

Katherine S Pollard

Keyword(s):

Community Composition ◽

Human Microbiome ◽

Human Microbiome Project ◽

Shotgun Sequencing ◽

Supplementary Information ◽

Phylogenetic Comparative Methods ◽

Supplementary Data ◽

Sequencing Data ◽

Phylogenetic Regression ◽

Project Data

Abstract Summary Phylogenetic comparative methods are powerful but presently under-utilized ways to identify microbial genes underlying differences in community composition. These methods help to identify functionally important genes because they test for associations beyond those expected when related microbes occupy similar environments. We present phylogenize, a pipeline with web, QIIME 2 and R interfaces that allows researchers to perform phylogenetic regression on 16S amplicon and shotgun sequencing data and to visualize results. phylogenize applies broadly to both host-associated and environmental microbiomes. Using Human Microbiome Project and Earth Microbiome Project data, we show that phylogenize draws similar conclusions from 16S versus shotgun sequencing and reveals both known and candidate pathways associated with host colonization. Availability and implementation phylogenize is available at https://phylogenize.org and https://bitbucket.org/pbradz/phylogenize. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

A resistome roadmap: from the human body to pristine environments

10.1101/2021.10.08.463752 ◽

2021 ◽

Author(s):

Lucia Maestre-Carballa ◽

Manuel Martínez-García ◽

Vicente Navarro-López

Keyword(s):

Oral Cavity ◽

Resistance Genes ◽

Human Body ◽

Human Microbiome ◽

Antibiotic Resistance Genes ◽

Human Microbiome Project ◽

Individual Variability ◽

Body Parts ◽

Sequencing Data ◽

The One

A comprehensive characterization of the human body resistome (sets of antibiotic resistance genes (ARGs)) is yet to be done and paramount for addressing the antibiotic microbial resistance threat. Here, we study the resistome of 771 samples from five major body parts (skin, nares, vagina, gut and oral cavity) of healthy subjects from the Human Microbiome Project and addressed the potential dispersion of ARGs in pristine environments. A total of 28,731 ARGs belonging to 344 different ARG types were found in the HMP proteome dataset (n=9.1x107 proteins analyzed). Our study reveals a distinct resistome profile (ARG type and abundance) between body sites and high inter-individual variability. Nares had the highest ARG load (≈5.4 genes/genome) followed by the oral cavity, while the gut showed one of the highest ARG richness (shared with nares) but the lowest abundance (≈1.3 genes/genome). Fluroquinolone resistance genes were the most abundant in the human body, followed by macrolide-lincosamide-streptogramin (MLS) or tetracycline. Most of the ARGs belonged to common bacterial commensals and multidrug resistance trait was predominant in the nares and vagina. Our data also provide hope, since the spread of common ARG from the human body to pristine environments (n=271 samples; 77 Gb of sequencing data and 2.1x108 proteins analyzed) thus far remains very unlikely (only one case found in an autochthonous bacterium from a pristine environment). These findings broaden our understanding of ARG in the context of the human microbiome and the One-Health Initiative of WHO uniting human host-microbes and environments as a whole.

Download Full-text