Sunbeam: an extensible pipeline for analyzing metagenomic sequencing experiments

Mapping Intimacies ◽

10.1101/326363 ◽

2018 ◽

Cited By ~ 3

Author(s):

Erik L. Clarke ◽

Louis J. Taylor ◽

Chunyu Zhao ◽

Andrew Connell ◽

Jung-Jin Lee ◽

...

Keyword(s):

Cluster Computing ◽

Workflow Management ◽

Software Tool ◽

Direct Analysis ◽

Low Complexity ◽

Single Step ◽

Metagenomic Data ◽

Metagenomic Sequencing ◽

Reference Genomes ◽

Adapter Trimming

AbstractBackgroundAnalysis of mixed microbial communities using metagenomic sequencing experiments requires multiple preprocessing and analytical steps to interpret the microbial and genetic composition of samples. Analytical steps include quality control, adapter trimming, host decontamination, metagenomic classification, read assembly, and alignment to reference genomes.ResultsWe present a modular and user-extensible pipeline called Sunbeam that performs these steps in a consistent and reproducible fashion. It can be installed in a single step, does not require administrative access to the host computer system, and can work with most cluster computing frameworks. We also introduce Komplexity, a software tool to eliminate potentially problematic, low-complexity nucleotide sequences from metagenomic data. Unique components of the Sunbeam pipeline include direct analysis of data from NCBI SRA and an easy-to-use extension framework that enables users to add custom processing or analysis steps directly to the workflow. The pipeline and its extension framework are well documented, in routine use, and regularly updated.ConclusionsSunbeam provides a foundation to build more in-depth analyses and to enable comparisons in metagenomic sequencing experiments by removing problematic low complexity reads and standardizing post-processing and analytical steps. Sunbeam is written in Python using the Snakemake workflow management software and is freely available at github.com/sunbeam-labs/sunbeam under the GPLv3.

Download Full-text

Evaluation of the effects of library preparation procedure and sample characteristics on the accuracy of metagenomic profiles

10.1101/2021.04.12.439578 ◽

2021 ◽

Author(s):

Christopher Gaulke ◽

Emily R Schmeltzer ◽

Mark Dasenko ◽

Brett M Tyler ◽

Rebecca Vega Thurber ◽

...

Keyword(s):

Microbial Community ◽

Cost Effective ◽

Low Complexity ◽

Careful Consideration ◽

Metagenomic Data ◽

Metagenomic Sequencing ◽

Library Preparation ◽

Dna Amount ◽

Preparation Methods ◽

Shotgun Metagenomic Sequencing

Shotgun metagenomic sequencing has transformed our understanding of microbial community ecology. However, preparing metagenomic libraries for high-throughput DNA sequencing remains a costly, labor-intensive, and time-consuming procedure, which in turn limits the utility of metagenomes. Several library preparation procedures have recently been developed to offset these costs, but it is unclear how these newer procedures compare to current standards in the field. In particular, it is not clear if all such procedures perform equally well across different types of microbial communities, or if features of the biological samples being processed (e.g., DNA amount) impact the accuracy of the approach. To address these questions, we assessed how five different shotgun DNA sequence library preparation methods, including the commonly used Nextera® Flex kit, perform when applied to metagenomic DNA. We measured each method's ability to produce metagenomic data that accurately represents the underlying taxonomic and genetic diversity of the community. We performed these analyses across a range of microbial community types (e.g., soil, coral-associated, mouse-gut-associated) and input DNA amounts. We find that the type of community and amount of input DNA influence each method’s performance, indicating that careful consideration may be needed when selecting between methods, especially for low complexity communities. However, cost-effective preparation methods we assessed are generally comparable to the current gold standard Nextera® DNA Flex kit for high-complexity communities. Overall, the results from this analysis will help expand and even facilitate access to metagenomic approaches in future studies.

Download Full-text

Clin-mNGS: Automated Pipeline for Pathogen Detection from Clinical Metagenomic Data

Current Bioinformatics ◽

10.2174/1574893615999200608130029 ◽

2020 ◽

Vol 15 ◽

Author(s):

Akshatha Prasanna ◽

Vidya Niranjan

Keyword(s):

Antimicrobial Resistance ◽

High Performance ◽

Pathogen Detection ◽

Bacterial Species ◽

Workflow Management ◽

Metagenomic Data ◽

Antimicrobial Resistance Genes ◽

Culture Independent ◽

Automated Pipeline ◽

User Friendly

Background: Since bacteria are the earliest known organisms, there has been significant interest in their variety and biology, most certainly concerning human health. Recent advances in Metagenomics sequencing (mNGS), a culture-independent sequencing technology have facilitated an accelerated development in clinical microbiology and our understanding of pathogens. Objective: For the implementation of mNGS in routine clinical practice to become feasible, a practical and scalable strategy for the study of mNGS data is essential. This study presents a robust automated pipeline to analyze clinical metagenomic data for pathogen identification and classification. Method: The proposed Clin-mNGS pipeline is an integrated, open-source, scalable, reproducible, and user-friendly framework scripted using the Snakemake workflow management software. The implementation avoids the hassle of manual installation and configuration of the multiple command-line tools and dependencies. The approach directly screens pathogens from clinical raw reads and generates consolidated reports for each sample. Results: The pipeline is demonstrated using publicly available data and is tested on a desktop Linux system and a High-performance cluster. The study compares variability in results from different tools and versions. The versions of the tools are made user modifiable. The pipeline results in quality check, filtered reads, host subtraction, assembled contigs, assembly metrics, relative abundances of bacterial species, antimicrobial resistance genes, plasmid finding, and virulence factors identification. The results obtained from the pipeline are evaluated based on sensitivity and positive predictive value. Conclusion: Clin-mNGS is an automated Snakemake pipeline validated for the analysis of microbial clinical metagenomics reads to perform taxonomic classification and antimicrobial resistance prediction.

Download Full-text

Evaluation of the CosmosID Bioinformatics Platform for Prosthetic Joint-Associated Sonicate Fluid Shotgun Metagenomic Data Analysis

Journal of Clinical Microbiology ◽

10.1128/jcm.01182-18 ◽

2018 ◽

Vol 57 (2) ◽

Cited By ~ 8

Author(s):

Qun Yan ◽

Yu Mi Wi ◽

Matthew J. Thoendel ◽

Yash S. Raval ◽

Kerryl E. Greenwood-Quaintance ◽

...

Keyword(s):

Antibiotic Resistance ◽

Metagenomic Data ◽

Metagenomic Sequencing ◽

Antibacterial Resistance ◽

Sequencing Data ◽

Bacterial Detection ◽

Shotgun Metagenomic Sequencing ◽

Prosthetic Joint ◽

Validation Set ◽

Fluid Culture

ABSTRACT We previously demonstrated that shotgun metagenomic sequencing can detect bacteria in sonicate fluid, providing a diagnosis of prosthetic joint infection (PJI). A limitation of the approach that we used is that data analysis was time-consuming and specialized bioinformatics expertise was required, both of which are barriers to routine clinical use. Fortunately, automated commercial analytic platforms that can interpret shotgun metagenomic data are emerging. In this study, we evaluated the CosmosID bioinformatics platform using shotgun metagenomic sequencing data derived from 408 sonicate fluid samples from our prior study with the goal of evaluating the platform vis-à-vis bacterial detection and antibiotic resistance gene detection for predicting staphylococcal antibacterial susceptibility. Samples were divided into a derivation set and a validation set, each consisting of 204 samples; results from the derivation set were used to establish cutoffs, which were then tested in the validation set for identifying pathogens and predicting staphylococcal antibacterial resistance. Metagenomic analysis detected bacteria in 94.8% (109/115) of sonicate fluid culture-positive PJIs and 37.8% (37/98) of sonicate fluid culture-negative PJIs. Metagenomic analysis showed sensitivities ranging from 65.7 to 85.0% for predicting staphylococcal antibacterial resistance. In conclusion, the CosmosID platform has the potential to provide fast, reliable bacterial detection and identification from metagenomic shotgun sequencing data derived from sonicate fluid for the diagnosis of PJI. Strategies for metagenomic detection of antibiotic resistance genes for predicting staphylococcal antibacterial resistance need further development.

Download Full-text

MetaDEGalaxy: Galaxy workflow for differential abundance analysis of 16s metagenomic data

F1000Research ◽

10.12688/f1000research.18866.2 ◽

2019 ◽

Vol 8 ◽

pp. 726

Author(s):

Mike W.C. Thang ◽

Xin-Yi Chua ◽

Gareth Price ◽

Dominique Gorse ◽

Matt A. Field

Keyword(s):

Microbial Communities ◽

Sequence Data ◽

Metagenomic Data ◽

Marker Genes ◽

Metagenomic Sequencing ◽

Differential Analysis ◽

Biomedical Sciences ◽

Metagenomic Sequence ◽

Differential Abundance ◽

Differential Abundance Analysis

Metagenomic sequencing is an increasingly common tool in environmental and biomedical sciences. While software for detailing the composition of microbial communities using 16S rRNA marker genes is relatively mature, increasingly researchers are interested in identifying changes exhibited within microbial communities under differing environmental conditions. In order to gain maximum value from metagenomic sequence data we must improve the existing analysis environment by providing accessible and scalable computational workflows able to generate reproducible results. Here we describe a complete end-to-end open-source metagenomics workflow running within Galaxy for 16S differential abundance analysis. The workflow accepts 454 or Illumina sequence data (either overlapping or non-overlapping paired end reads) and outputs lists of the operational taxonomic unit (OTUs) exhibiting the greatest change under differing conditions. A range of analysis steps and graphing options are available giving users a high-level of control over their data and analyses. Additionally, users are able to input complex sample-specific metadata information which can be incorporated into differential analysis and used for grouping / colouring within graphs. Detailed tutorials containing sample data and existing workflows are available for three different input types: overlapping and non-overlapping read pairs as well as for pre-generated Biological Observation Matrix (BIOM) files. Using the Galaxy platform we developed MetaDEGalaxy, a complete metagenomics differential abundance analysis workflow. MetaDEGalaxy is designed for bench scientists working with 16S data who are interested in comparative metagenomics. MetaDEGalaxy builds on momentum within the wider Galaxy metagenomics community with the hope that more tools will be added as existing methods mature.

Download Full-text

Accurate Reconstruction of Microbial Strains from Metagenomic Sequencing Using Representative Reference Genomes

10.1101/215707 ◽

2017 ◽

Cited By ~ 2

Author(s):

Zhemin Zhou ◽

Nina Luhmann ◽

Nabil-Fareed Alikhan ◽

Christopher Quince ◽

Mark Achtman

Keyword(s):

Evaluation Studies ◽

Species Level ◽

Metagenomic Sequencing ◽

Sequencing Data ◽

Reference Databases ◽

Microbial Strains ◽

Taxonomic Assignments ◽

Taxonomic Groups ◽

Reference Genomes ◽

Recent Evaluation

AbstractExploring the genetic diversity of microbes within the environment through metagenomic sequencing first requires classifying these reads into taxonomic groups. Current methods compare these sequencing data with existing biased and limited reference databases. Several recent evaluation studies demonstrate that current methods either lack sufficient sensitivity for species-level assignments or suffer from false positives, overestimating the number of species in the metagenome. Both are especially problematic for the identification of low-abundance microbial species, e. g. detecting pathogens in ancient metagenomic samples. We present a new method, SPARSE, which improves taxonomic assignments of metagenomic reads. SPARSE balances existing biased reference databases by grouping reference genomes into similarity-based hierarchical clusters, implemented as an efficient incremental data structure. SPARSE assigns reads to these clusters using a probabilistic model, which specifically penalizes non-specific mappings of reads from unknown sources and hence reduces false-positive assignments. Our evaluation on simulated datasets from two recent evaluation studies demonstrated the improved precision of SPARSE in comparison to other methods for species-level classification. In a third simulation, our method successfully differentiated multiple co-existing Escherichia coli strains from the same sample. In real archaeological datasets, SPARSE identified ancient pathogens with ≤ 0.02% abundance, consistent with published findings that required additional sequencing data. In these datasets, other methods either missed targeted pathogens or reported non-existent ones. SPARSE and all evaluation scripts are available at https://github.com/zheminzhou/SPARSE.

Download Full-text

Towards end-to-end disease prediction from raw metagenomic data

10.1101/2020.10.29.360297 ◽

2020 ◽

Author(s):

Maxence Queyrel ◽

Edi Prifti ◽

Jean-Daniel Zucker

Keyword(s):

Dna Sequences ◽

Real Life ◽

Multiple Instance Learning ◽

Disease Classification ◽

Metagenomic Data ◽

Numerical Representation ◽

Metagenomic Sequencing ◽

Sequencing Data ◽

End To End ◽

Bioinformatics Workflows

AbstractAnalysis of the human microbiome using metagenomic sequencing data has demonstrated high ability in discriminating various human diseases. Raw metagenomic sequencing data require multiple complex and computationally heavy bioinformatics steps prior to data analysis. Such data contain millions of short sequences read from the fragmented DNA sequences and are stored as fastq files. Conventional processing pipelines consist multiple steps including quality control, filtering, alignment of sequences against genomic catalogs (genes, species, taxonomic levels, functional pathways, etc.). These pipelines are complex to use, time consuming and rely on a large number of parameters that often provide variability and impact the estimation of the microbiome elements. Recent studies have demonstrated that training Deep Neural Networks directly from raw sequencing data is a promising approach to bypass some of the challenges associated with mainstream bioinformatics pipelines. Most of these methods use the concept of word and sentence embeddings that create a meaningful and numerical representation of DNA sequences, while extracting features and reducing the dimentionality of the data. In this paper we present an end-to-end approach that classifies patients into disease groups directly from raw metagenomic reads: metagenome2vec. This approach is composed of four steps (i) generating a vocabulary of k-mers and learning their numerical embeddings; (ii) learning DNA sequence (read) embeddings; (iii) identifying the genome from which the sequence is most likely to come and (iv) training a multiple instance learning classifier which predicts the phenotype based on the vector representation of the raw data. An attention mechanism is applied in the network so that the model can be interpreted, assigning a weight to the influence of the prediction for each genome. Using two public real-life datasets as well a simulated one, we demonstrated that this original approach reached very high performances, comparable with the state-of-the-art methods applied directly on processed data though mainstream bioinformatics workflows. These results are encouraging for this proof of concept work. We believe that with further dedication, the DNN models have the potential to surpass mainstream bioinformatics workflows in disease classification tasks.

Download Full-text

Mumame: a software tool for quantifying gene-specific point-mutations in shotgun metagenomic data

Metabarcoding and Metagenomics ◽

10.3897/mbmg.3.36236 ◽

2019 ◽

Vol 3 ◽

Cited By ~ 1

Author(s):

Shruthi Magesh ◽

Viktor Jonsson ◽

Johan Bengtsson-Palme

Keyword(s):

Microbial Communities ◽

Point Mutations ◽

Software Tool ◽

Metagenomic Data ◽

Data Sets ◽

Resistance Mutations ◽

Shotgun Metagenomics ◽

Key Factor ◽

Detection Of Mutations ◽

And Function

Metagenomics has emerged as a central technique for studying the structure and function of microbial communities. Often the functional analysis is restricted to classification into broad functional categories. However, important phenotypic differences, such as resistance to antibiotics, are often the result of just one or a few point mutations in otherwise identical sequences. Bioinformatic methods for metagenomic analysis have generally been poor at accounting for this fact, resulting in a somewhat limited picture of important aspects of microbial communities. Here, we address this problem by providing a software tool called Mumame, which can distinguish between wildtype and mutated sequences in shotgun metagenomic data and quantify their relative abundances. We demonstrate the utility of the tool by quantifying antibiotic resistance mutations in several publicly available metagenomic data sets. We also identified that sequencing depth is a key factor to detect rare mutations. Therefore, much larger numbers of sequences may be required for reliable detection of mutations than for most other applications of shotgun metagenomics. Mumame is freely available online (http://microbiology.se/software/mumame).

Download Full-text

Metagenomic Signatures of Bacterial Adaptation to Life in the Phyllosphere of a Salt-Secreting Desert Tree

Applied and Environmental Microbiology ◽

10.1128/aem.00483-16 ◽

2016 ◽

Vol 82 (9) ◽

pp. 2854-2861 ◽

Cited By ~ 24

Author(s):

Omri M. Finkel ◽

Tom O. Delmont ◽

Anton F. Post ◽

Shimshon Belkin

Keyword(s):

High Salinity ◽

Stress Factors ◽

Metagenomic Data ◽

Metagenomic Sequencing ◽

Contig Assembly ◽

Bacterial Populations ◽

Content Type ◽

Light Sensing ◽

Wide Range ◽

Globally Distributed

ABSTRACTThe leaves ofTamarix aphylla, a globally distributed, salt-secreting desert tree, are dotted with alkaline droplets of high salinity. To successfully inhabit these organic carbon-rich droplets, bacteria need to be adapted to multiple stress factors, including high salinity, high alkalinity, high UV radiation, and periodic desiccation. To identify genes that are important for survival in this harsh habitat, microbial community DNA was extracted from the leaf surfaces of 10Tamarix aphyllatrees along a 350-km longitudinal gradient. Shotgun metagenomic sequencing, contig assembly, and binning yielded 17 genome bins, six of which were >80% complete. These genomic bins, representing three phyla (Proteobacteria,Bacteroidetes, andFirmicutes), were closely related to halophilic and alkaliphilic taxa isolated from aquatic and soil environments. Comparison of these genomic bins to the genomes of their closest relatives revealed functional traits characteristic of bacterial populations inhabiting theTamarixphyllosphere, independent of their taxonomic affiliation. These functions, most notably light-sensing genes, are postulated to represent important adaptations toward colonization of this habitat.IMPORTANCEPlant leaves are an extensive and diverse microbial habitat, forming the main interface between solar energy and the terrestrial biosphere. There are hundreds of thousands of plant species in the world, exhibiting a wide range of morphologies, leaf surface chemistries, and ecological ranges. In order to understand the core adaptations of microorganisms to this habitat, it is important to diversify the type of leaves that are studied. This study provides an analysis of the genomic content of the most abundant bacterial inhabitants of the globally distributed, salt-secreting desert treeTamarix aphylla. Draft genomes of these bacteria were assembled, using the culture-independent technique of assembly and binning of metagenomic data. Analysis of the genomes reveals traits that are important for survival in this habitat, most notably, light-sensing and light utilization genes.

Download Full-text

Automatized 3D-Scanning Application for the Virtualization of Large Components

Volume 2: Combustion, Fuels, and Emissions; Renewable Energy: Solar and Wind; Inlets and Exhausts; Emerging Technologies: Hybrid Electric Propulsion and Alternate Power Generation; GT Operation and Maintenance; Materials and Manufacturing (Including Coatings, Composites, CMCs, Additive Manufacturing); Analytics and Digital Solutions for Gas Turbines/Rotating Machinery ◽

10.1115/gtindia2019-2388 ◽

2019 ◽

Cited By ~ 1

Author(s):

Stephan Mönchinger ◽

Marvin M. Schmidt ◽

Sebastian Dreßen ◽

Patrick Wissmann ◽

Rainer Stark

Keyword(s):

Gas Turbines ◽

Software Tool ◽

Low Complexity ◽

3D Scanning ◽

Shop Floor ◽

Final State ◽

Scanning Procedure ◽

Cost Efficient ◽

Easy Integration ◽

Automated Scanning

Abstract Many of the large components of modern gas turbines are cast, resulting in rough surface profiles, which have to be machined to achieve the component’s final state. As there are high deviations in casting components, the real geometry does not meet the ideal model dimensions and is known neither to the supplier nor to the customer. While manual 3D-scanning processes, heavily depending on the operator’s qualification, get more attention, conventional means are still the basis for quality assurance of such parts. Although significant time reduction can be reached by automated scanning, there is still a low variety of corresponding applications for large components on the market. Flexible systems are an approach for further development as most of the manufacturers handling large components already have and use machine tools for the processing of their components. The designed and implemented prototypical system allows the acquisition of a large component’s surface with only a few manual inputs prior to the actual scanning procedure. It can be used with existing machining tools, allowing an easy implementation for different use cases of a pre-manufacturing scan, e.g. for CAM planning. The application is implemented in a small software tool that can be adapted to other machines with low effort. The implementation has been demonstrated in a real manufacturing environment with typical environmental conditions in the shop floor. The prototypical application has been built mainly with existing components. Following the V-Model, each domain has been investigated individually followed by a complete system investigation. With a system price below 100.000€ the price is below 10% of most automated systems on the market. The presented cost efficient, low complexity prototypical system can provide early information about the product for a digital process chain in industry 4.0, enabling flexible, intuitive and easy integration.

Download Full-text

Evaluating Metagenomic Prediction of the Metaproteome in a 4.5-Year Study of a Patient with Crohn's Disease

mSystems ◽

10.1128/msystems.00337-18 ◽

2019 ◽

Vol 4 (1) ◽

Cited By ~ 18

Author(s):

Robert H. Mills ◽

Yoshiki Vázquez-Baeza ◽

Qiyun Zhu ◽

Lingjing Jiang ◽

James Gaffney ◽

...

Keyword(s):

Crohn’S Disease ◽

Crohn's Disease ◽

Dna Analysis ◽

Gene Copy Number ◽

Metagenomic Data ◽

Gene Copy ◽

Metagenomic Sequencing ◽

Data Types ◽

Fecal Microbiome ◽

Disease States

ABSTRACT Although genetic approaches are the standard in microbiome analysis, proteome-level information is largely absent. This discrepancy warrants a better understanding of the relationship between gene copy number and protein abundance, as this is crucial information for inferring protein-level changes from metagenomic data. As it remains unknown how metaproteomic systems evolve during dynamic disease states, we leveraged a 4.5-year fecal time series using samples from a single patient with colonic Crohn’s disease. Utilizing multiplexed quantitative proteomics and shotgun metagenomic sequencing of eight time points in technical triplicate, we quantified over 29,000 protein groups and 110,000 genes and compared them to five protein biomarkers of disease activity. Broad-scale observations were consistent between data types, including overall clustering by principal-coordinate analysis and fluctuations in Gene Ontology terms related to Crohn’s disease. Through linear regression, we determined genes and proteins fluctuating in conjunction with inflammatory metrics. We discovered conserved taxonomic differences relevant to Crohn’s disease, including a negative association of Faecalibacterium and a positive association of Escherichia with calprotectin. Despite concordant associations of genera, the specific genes correlated with these metrics were drastically different between metagenomic and metaproteomic data sets. This resulted in the generation of unique functional interpretations dependent on the data type, with metaproteome evidence for previously investigated mechanisms of dysbiosis. An example of one such mechanism was a connection between urease enzymes, amino acid metabolism, and the local inflammation state within the patient. This proof-of-concept approach prompts further investigation of the metaproteome and its relationship with the metagenome in biologically complex systems such as the microbiome. IMPORTANCE A majority of current microbiome research relies heavily on DNA analysis. However, as the field moves toward understanding the microbial functions related to healthy and disease states, it is critical to evaluate how changes in DNA relate to changes in proteins, which are functional units of the genome. This study tracked the abundance of genes and proteins as they fluctuated during various inflammatory states in a 4.5-year study of a patient with colonic Crohn’s disease. Our results indicate that despite a low level of correlation, taxonomic associations were consistent in the two data types. While there was overlap of the data types, several associations were uniquely discovered by analyzing the metaproteome component. This case study provides unique and important insights into the fundamental relationship between the genes and proteins of a single individual’s fecal microbiome associated with clinical consequences.

Download Full-text