NanoSPC: a scalable, portable, cloud compatible viral nanopore metagenomic data processing pipeline

Yifei Xu; Fan Yang-Turner; Denis Volk; Derrick Crook

doi:10.1093/nar/gkaa413

NanoSPC: a scalable, portable, cloud compatible viral nanopore metagenomic data processing pipeline

Nucleic Acids Research ◽

10.1093/nar/gkaa413 ◽

2020 ◽

Vol 48 (W1) ◽

pp. W366-W371

Author(s):

Yifei Xu ◽

Fan Yang-Turner ◽

Denis Volk ◽

Derrick Crook

Keyword(s):

Point Of Care ◽

Treatment Strategies ◽

Metagenomic Data ◽

Infection Prevention And Control ◽

Metagenomic Sequencing ◽

Sequencing Data ◽

Single Nucleotide Variants ◽

Point Of Care Test ◽

Elastic Computing ◽

Analyse Data

Abstract Metagenomic sequencing combined with Oxford Nanopore Technology has the potential to become a point-of-care test for infectious disease in public health and clinical settings, providing rapid diagnosis of infection, guiding individual patient management and treatment strategies, and informing infection prevention and control practices. However, publicly available, streamlined, and reproducible pipelines for analyzing Nanopore metagenomic sequencing data are still lacking. Here we introduce NanoSPC, a scalable, portable and cloud compatible pipeline for analyzing Nanopore sequencing data. NanoSPC can identify potentially pathogenic viruses and bacteria simultaneously to provide comprehensive characterization of individual samples. The pipeline can also detect single nucleotide variants and assemble high quality complete consensus genome sequences, permitting high-resolution inference of transmission. We implement NanoSPC using Nextflow manager within Docker images to allow reproducibility and portability of the analysis. Moreover, we deploy NanoSPC to our scalable pathogen pipeline platform, enabling elastic computing for high throughput Nanopore data on HPC cluster as well as multiple cloud platforms, such as Google Cloud, Amazon Elastic Computing Cloud, Microsoft Azure and OpenStack. Users could either access our web interface (https://nanospc.mmmoxford.uk) to run cloud-based analysis, monitor process, and visualize results, as well as download Docker images and run command line to analyse data locally.

Download Full-text

Evaluation of the CosmosID Bioinformatics Platform for Prosthetic Joint-Associated Sonicate Fluid Shotgun Metagenomic Data Analysis

Journal of Clinical Microbiology ◽

10.1128/jcm.01182-18 ◽

2018 ◽

Vol 57 (2) ◽

Cited By ~ 8

Author(s):

Qun Yan ◽

Yu Mi Wi ◽

Matthew J. Thoendel ◽

Yash S. Raval ◽

Kerryl E. Greenwood-Quaintance ◽

...

Keyword(s):

Antibiotic Resistance ◽

Metagenomic Data ◽

Metagenomic Sequencing ◽

Antibacterial Resistance ◽

Sequencing Data ◽

Bacterial Detection ◽

Shotgun Metagenomic Sequencing ◽

Prosthetic Joint ◽

Validation Set ◽

Fluid Culture

ABSTRACT We previously demonstrated that shotgun metagenomic sequencing can detect bacteria in sonicate fluid, providing a diagnosis of prosthetic joint infection (PJI). A limitation of the approach that we used is that data analysis was time-consuming and specialized bioinformatics expertise was required, both of which are barriers to routine clinical use. Fortunately, automated commercial analytic platforms that can interpret shotgun metagenomic data are emerging. In this study, we evaluated the CosmosID bioinformatics platform using shotgun metagenomic sequencing data derived from 408 sonicate fluid samples from our prior study with the goal of evaluating the platform vis-à-vis bacterial detection and antibiotic resistance gene detection for predicting staphylococcal antibacterial susceptibility. Samples were divided into a derivation set and a validation set, each consisting of 204 samples; results from the derivation set were used to establish cutoffs, which were then tested in the validation set for identifying pathogens and predicting staphylococcal antibacterial resistance. Metagenomic analysis detected bacteria in 94.8% (109/115) of sonicate fluid culture-positive PJIs and 37.8% (37/98) of sonicate fluid culture-negative PJIs. Metagenomic analysis showed sensitivities ranging from 65.7 to 85.0% for predicting staphylococcal antibacterial resistance. In conclusion, the CosmosID platform has the potential to provide fast, reliable bacterial detection and identification from metagenomic shotgun sequencing data derived from sonicate fluid for the diagnosis of PJI. Strategies for metagenomic detection of antibiotic resistance genes for predicting staphylococcal antibacterial resistance need further development.

Download Full-text

Towards end-to-end disease prediction from raw metagenomic data

10.1101/2020.10.29.360297 ◽

2020 ◽

Author(s):

Maxence Queyrel ◽

Edi Prifti ◽

Jean-Daniel Zucker

Keyword(s):

Dna Sequences ◽

Real Life ◽

Multiple Instance Learning ◽

Disease Classification ◽

Metagenomic Data ◽

Numerical Representation ◽

Metagenomic Sequencing ◽

Sequencing Data ◽

End To End ◽

Bioinformatics Workflows

AbstractAnalysis of the human microbiome using metagenomic sequencing data has demonstrated high ability in discriminating various human diseases. Raw metagenomic sequencing data require multiple complex and computationally heavy bioinformatics steps prior to data analysis. Such data contain millions of short sequences read from the fragmented DNA sequences and are stored as fastq files. Conventional processing pipelines consist multiple steps including quality control, filtering, alignment of sequences against genomic catalogs (genes, species, taxonomic levels, functional pathways, etc.). These pipelines are complex to use, time consuming and rely on a large number of parameters that often provide variability and impact the estimation of the microbiome elements. Recent studies have demonstrated that training Deep Neural Networks directly from raw sequencing data is a promising approach to bypass some of the challenges associated with mainstream bioinformatics pipelines. Most of these methods use the concept of word and sentence embeddings that create a meaningful and numerical representation of DNA sequences, while extracting features and reducing the dimentionality of the data. In this paper we present an end-to-end approach that classifies patients into disease groups directly from raw metagenomic reads: metagenome2vec. This approach is composed of four steps (i) generating a vocabulary of k-mers and learning their numerical embeddings; (ii) learning DNA sequence (read) embeddings; (iii) identifying the genome from which the sequence is most likely to come and (iv) training a multiple instance learning classifier which predicts the phenotype based on the vector representation of the raw data. An attention mechanism is applied in the network so that the model can be interpreted, assigning a weight to the influence of the prediction for each genome. Using two public real-life datasets as well a simulated one, we demonstrated that this original approach reached very high performances, comparable with the state-of-the-art methods applied directly on processed data though mainstream bioinformatics workflows. These results are encouraging for this proof of concept work. We believe that with further dedication, the DNN models have the potential to surpass mainstream bioinformatics workflows in disease classification tasks.

Download Full-text

Harnessing the strategy of metagenomics for exploring the intestinal microecology of sable (Martes zibellina), the national first-level protected animal

10.21203/rs.3.rs-28506/v3 ◽

2020 ◽

Author(s):

Jiakuo Yan ◽

Xiaoyang Wu ◽

Jun Chen ◽

Yao Chen ◽

Honghai Zhang

Keyword(s):

Information Processing ◽

Complex Structure ◽

Intestinal Flora ◽

Metagenomic Library ◽

Metagenomic Data ◽

Metagenomic Sequencing ◽

Sequencing Data ◽

Illumina Hiseq ◽

Martes Zibellina ◽

Gene Functions

Abstract Sable (Martes zibellina), a member of family Mustelidae, order Carnivora, is primarily distributed in the cold northern zone of Eurasia. The purpose of this study was to explore the intestinal flora of the sable by metagenomic library-based techniques. Libraries were sequenced on an Illumina HiSeq 4000 instrument. The effective sequencing data of each sample was above 6,000 M, and the ratio of clean reads to raw reads was over 98%. The total ORF length was approximately 603,031, equivalent to 347.36 Mbp. We investigated gene functions with the KEGG database and identified 7,140 KEGG ortholog (KO) groups comprising 129,788 genes across all of the samples. We selected a subset of genes with the highest abundances to construct cluster heat maps. From the results of the KEGG metabolic pathway annotations, we acquired information on gene functions, as represented by the categories of metabolism, environmental information processing, genetic information processing, cellular processes and organismal systems. We then investigated gene function with the CAZy database and identified functional carbohydrate hydrolases corresponding to genes in the intestinal microorganisms of sable. This finding is consistent with the fact that the sable is adapted to cold environments and requires a large amount of energy to maintain its metabolic activity. We also investigated gene functions with the eggNOG database; the main functions of genes included gene duplication, recombination and repair, transport and metabolism of amino acids, and transport and metabolism of carbohydrates. In this study, we attempted to identify the complex structure of the microbial population of sable based on metagenomic sequencing methods, which use whole metagenomic data, and to map the obtained sequences to known genes or pathways in existing databases, such as CAZy, KEGG, and eggNOG. We then explored the genetic composition and functional diversity of the microbial community based on the mapped functional categories.

Download Full-text

Novel Feline Papillomavirus Isolate P20 Assembled from Metagenomic Data Isolated from Human Skin of a House Cat Owner

10.1101/2021.11.01.466825 ◽

2021 ◽

Author(s):

Ema Helene Graham ◽

Michael S. Adamowicz ◽

Peter Angeletti ◽

Jennifer Clarke ◽

Samodha Fernando ◽

...

Keyword(s):

Human Skin ◽

Genome Organization ◽

Metagenomic Data ◽

Metagenomic Sequencing ◽

Sequencing Data

A novel feline papillomavirus isolate was assembled from metagenomic sequencing data collected from the human skin of a house cat owner. This circular papillomavirus isolate P20 is 8069 bp in length and displays genome organization typical of feline papillomaviruses, but only exhibits approximately 75% synteny to other feline papillomaviruses.

Download Full-text

BiomeSeq: A Tool for the Characterization of Animal Microbiomes from Metagenomic Data

10.21203/rs.3.rs-842545/v1 ◽

2021 ◽

Author(s):

Kelly A. Mulholland ◽

Calvin L. Keeler

Keyword(s):

Relative Abundance ◽

Performance Metrics ◽

Complete Characterization ◽

Metagenomic Data ◽

Metagenomic Sequencing ◽

Sequencing Data ◽

Microbial Composition ◽

Additional Species ◽

User Friendly

Abstract BackgroundThe complete characterization of a microbiome is critical in elucidating the complex ecology of the microbial composition within healthy and diseased animals. Many microbiome studies characterize only the bacterial component, for which there are several well-developed sequencing methods, bioinformatics tools and databases available. The lack of comprehensive bioinformatics workflows and databases have limited efforts to characterize the other components existing in a microbiome. BiomeSeq is a tool for the analysis of the complete animal microbiome using metagenomic sequencing data. With its comprehensive workflow and customizable parameters and microbial databases, BiomeSeq can rapidly quantify the viral, fungal, bacteriophage and bacterial components of a sample and produce informative tables for analysis. ResultsSimulated datasets were constructed, which contained known abundances of microbial sequences, and several performance metrics were analyzed, including correlation of predicted abundance with known abundance, root mean square error and rate of speed. BiomeSeq demonstrated high precision (average of 99.52%) and sensitivity (average of 93.01%). BiomeSeq was employed in detecting and quantifying the respiratory microbiome of a commercial poultry broiler flock throughout its grow-out cycle from hatching to processing and successfully processed 780 million reads. For each microbial species detected, BiomeSeq calculated the normalized abundance, percent relative abundance, and coverage as well as the diversity for each sample. Rate of speed for each step in the pipeline, precision and accuracy were calculated to examine BiomeSeq’s performance using in silico sequencing datasets. When compared to bacterial results generated by the commonly used 16S rRNA sequencing method, BiomeSeq detected the same most abundant bacteria, including Gallibacterium, Corynebacterium and Staphylococcus, as well as several additional species. ConclusionsBiomeSeq provides for the detection and quantification of the microbiome from next-generation metagenomic sequencing data. This tool is implemented into a user-friendly container that requires one command and generates a table containing taxonomical information for each microbe detected. It also determines normalized abundance, percent relative abundance, genome coverage and sample diversity calculations for each sample.

Download Full-text

Conserved bacterial genomes from two geographically distinct peritidal stromatolite formations shed light on potential functional guilds

10.1101/818625 ◽

2019 ◽

Author(s):

Samantha C. Waterworth ◽

Eric W. Isemonger ◽

Evan R. Rees ◽

Rosemary A. Dorrington ◽

Jason C. Kwan

Keyword(s):

Microbial Mats ◽

Bacterial Species ◽

Species Conservation ◽

Cumulative Effect ◽

Metagenomic Data ◽

Metagenomic Sequencing ◽

Sequencing Data ◽

Space Forms ◽

Nitrogenous Compounds ◽

Shark Bay

SUMMARYStromatolites are complex microbial mats that form lithified layers and ancient forms are the oldest evidence of life on earth, dating back over 3.4 billion years. Modern stromatolites are relatively rare but may provide clues about the function and evolution of their ancient counterparts. In this study, we focus on peritidal stromatolites occurring at Cape Recife and Schoenmakerskop on the southeastern South African coastline. Using assembled shotgun metagenomic data we obtained 183 genomic bins, of which the most dominant taxa were from the Cyanobacteriia class (Cyanobacteria phylum), with lower but notable abundances of bacteria classified as Alphaproteobacteria, Gammaproteobacteria and Bacteroidia. We identified functional gene sets in bacterial species conserved across two geographically distinct stromatolite formations, which may promote carbonate precipitation through the reduction of nitrogenous compounds and possible production of calcium ions. We propose that an abundance of extracellular alkaline phosphatases may lead to the formation of phosphatic deposits within these stromatolites. We conclude that the cumulative effect of several conserved bacterial species drives accretion in these two stromatolite formations.ORIGINALITY-SIGNIFICANCEPeritidal stromatolites are unique among stromatolite formations as they grow at the dynamic interface of calcium carbonate-rich groundwater and coastal marine waters. The peritidal space forms a relatively unstable environment and the factors that influence the growth of these peritidal structures is not well understood. To our knowledge, this is the first comparative study that assesses species conservation within the microbial communities of two geographically distinct peritidal stromatolite formations. We assessed the potential functional roles of these communities using genomic bins clustered from metagenomic sequencing data. We identified several conserved bacterial species across the two sites and hypothesize that their genetic functional potential may be important in the formation of pertidal stromatolites. We contrasted these findings against a well-studied site in Shark Bay, Australia and show that, unlike these hypersaline formations, archaea do not play a major role in peritidal stromatolite formation. Furthermore, bacterial nitrogen and phosphate metabolisms of conserved species may be driving factors behind lithification in peritidal stromatolites.

Download Full-text

Comprehensive discovery of CRISPR-targeted terminally redundant sequences in the human gut metagenome: Viruses, plasmids, and more

PLoS Computational Biology ◽

10.1371/journal.pcbi.1009428 ◽

2021 ◽

Vol 17 (10) ◽

pp. e1009428

Author(s):

Ryota Sugimoto ◽

Luca Nishimura ◽

Phuong Thanh Nguyen ◽

Jumpei Ito ◽

Nicholas F. Parrish ◽

...

Keyword(s):

De Novo ◽

Sequence Similarity ◽

Metagenomic Data ◽

Marker Genes ◽

Biological Entity ◽

Metagenomic Sequencing ◽

Sequencing Data ◽

Human Gut ◽

Protein Coding ◽

Viral Sequences

Viruses are the most numerous biological entity, existing in all environments and infecting all cellular organisms. Compared with cellular life, the evolution and origin of viruses are poorly understood; viruses are enormously diverse, and most lack sequence similarity to cellular genes. To uncover viral sequences without relying on either reference viral sequences from databases or marker genes that characterize specific viral taxa, we developed an analysis pipeline for virus inference based on clustered regularly interspaced short palindromic repeats (CRISPR). CRISPR is a prokaryotic nucleic acid restriction system that stores the memory of previous exposure. Our protocol can infer CRISPR-targeted sequences, including viruses, plasmids, and previously uncharacterized elements, and predict their hosts using unassembled short-read metagenomic sequencing data. By analyzing human gut metagenomic data, we extracted 11,391 terminally redundant CRISPR-targeted sequences, which are likely complete circular genomes. The sequences included 2,154 tailed-phage genomes, together with 257 complete crAssphage genomes, 11 genomes larger than 200 kilobases, 766 genomes of Microviridae species, 56 genomes of Inoviridae species, and 95 previously uncharacterized circular small genomes that have no reliably predicted protein-coding gene. We predicted the host(s) of approximately 70% of the discovered genomes at the taxonomic level of phylum by linking protospacers to taxonomically assigned CRISPR direct repeats. These results demonstrate that our protocol is efficient for de novo inference of CRISPR-targeted sequences and their host prediction.

Download Full-text

Harnessing the strategy of metagenomics for exploring the intestinal microecology of sable (Martes zibellina), the national first-level protected animal

AMB Express ◽

10.1186/s13568-020-01103-6 ◽

2020 ◽

Vol 10 (1) ◽

Author(s):

Jiakuo Yan ◽

Xiaoyang Wu ◽

Jun Chen ◽

Yao Chen ◽

Honghai Zhang

Keyword(s):

Information Processing ◽

Complex Structure ◽

Intestinal Flora ◽

Metagenomic Library ◽

Metagenomic Data ◽

Metagenomic Sequencing ◽

Sequencing Data ◽

Illumina Hiseq ◽

Martes Zibellina ◽

Gene Functions

Abstract Sable (Martes zibellina), a member of family Mustelidae, order Carnivora, is primarily distributed in the cold northern zone of Eurasia. The purpose of this study was to explore the intestinal flora of the sable by metagenomic library-based techniques. Libraries were sequenced on an Illumina HiSeq 4000 instrument. The effective sequencing data of each sample was above 6000 M, and the ratio of clean reads to raw reads was over 98%. The total ORF length was approximately 603,031, equivalent to 347.36 Mbp. We investigated gene functions with the KEGG database and identified 7140 KEGG ortholog (KO) groups comprising 129,788 genes across all of the samples. We selected a subset of genes with the highest abundances to construct cluster heat maps. From the results of the KEGG metabolic pathway annotations, we acquired information on gene functions, as represented by the categories of metabolism, environmental information processing, genetic information processing, cellular processes and organismal systems. We then investigated gene function with the CAZy database and identified functional carbohydrate hydrolases corresponding to genes in the intestinal microorganisms of sable. This finding is consistent with the fact that the sable is adapted to cold environments and requires a large amount of energy to maintain its metabolic activity. We also investigated gene functions with the eggNOG database; the main functions of genes included gene duplication, recombination and repair, transport and metabolism of amino acids, and transport and metabolism of carbohydrates. In this study, we attempted to identify the complex structure of the microbial population of sable based on metagenomic sequencing methods, which use whole metagenomic data, and to map the obtained sequences to known genes or pathways in existing databases, such as CAZy, KEGG, and eggNOG. We then explored the genetic composition and functional diversity of the microbial community based on the mapped functional categories.

Download Full-text

Accurate and sensitive detection of microbial eukaryotes from whole metagenome shotgun sequencing

Microbiome ◽

10.1186/s40168-021-01015-y ◽

2021 ◽

Vol 9 (1) ◽

Author(s):

Abigail L. Lind ◽

Katherine S. Pollard

Keyword(s):

Gene Families ◽

Shotgun Sequencing ◽

Metagenomic Data ◽

Marker Genes ◽

Metagenomic Sequencing ◽

Sequencing Data ◽

Dna And Rna ◽

Paired Samples ◽

Microbial Eukaryotes ◽

Conserved Gene

Abstract Background Microbial eukaryotes are found alongside bacteria and archaea in natural microbial systems, including host-associated microbiomes. While microbial eukaryotes are critical to these communities, they are challenging to study with shotgun sequencing techniques and are therefore often excluded. Results Here, we present EukDetect, a bioinformatics method to identify eukaryotes in shotgun metagenomic sequencing data. Our approach uses a database of 521,824 universal marker genes from 241 conserved gene families, which we curated from 3713 fungal, protist, non-vertebrate metazoan, and non-streptophyte archaeplastida genomes and transcriptomes. EukDetect has a broad taxonomic coverage of microbial eukaryotes, performs well on low-abundance and closely related species, and is resilient against bacterial contamination in eukaryotic genomes. Using EukDetect, we describe the spatial distribution of eukaryotes along the human gastrointestinal tract, showing that fungi and protists are present in the lumen and mucosa throughout the large intestine. We discover that there is a succession of eukaryotes that colonize the human gut during the first years of life, mirroring patterns of developmental succession observed in gut bacteria. By comparing DNA and RNA sequencing of paired samples from human stool, we find that many eukaryotes continue active transcription after passage through the gut, though some do not, suggesting they are dormant or nonviable. We analyze metagenomic data from the Baltic Sea and find that eukaryotes differ across locations and salinity gradients. Finally, we observe eukaryotes in Arabidopsis leaf samples, many of which are not identifiable from public protein databases. Conclusions EukDetect provides an automated and reliable way to characterize eukaryotes in shotgun sequencing datasets from diverse microbiomes. We demonstrate that it enables discoveries that would be missed or clouded by false positives with standard shotgun sequence analysis. EukDetect will greatly advance our understanding of how microbial eukaryotes contribute to microbiomes.

Download Full-text

Detecting and phasing minor single-nucleotide variants from long-read sequencing data

Nature Communications ◽

10.1038/s41467-021-23289-4 ◽

2021 ◽

Vol 12 (1) ◽

Author(s):

Zhixing Feng ◽

Jose C. Clemente ◽

Brandon Wong ◽

Eric E. Schadt

Keyword(s):

Genetic Heterogeneity ◽

Error Rates ◽

Metagenomic Data ◽

Sequencing Data ◽

Single Nucleotide Variants ◽

Single Nucleotide ◽

Sequencing Technologies ◽

Oxford Nanopore ◽

Long Read ◽

Technological Limitations

AbstractCellular genetic heterogeneity is common in many biological conditions including cancer, microbiome, and co-infection of multiple pathogens. Detecting and phasing minor variants play an instrumental role in deciphering cellular genetic heterogeneity, but they are still difficult tasks because of technological limitations. Recently, long-read sequencing technologies, including those by Pacific Biosciences and Oxford Nanopore, provide an opportunity to tackle these challenges. However, high error rates make it difficult to take full advantage of these technologies. To fill this gap, we introduce iGDA, an open-source tool that can accurately detect and phase minor single-nucleotide variants (SNVs), whose frequencies are as low as 0.2%, from raw long-read sequencing data. We also demonstrate that iGDA can accurately reconstruct haplotypes in closely related strains of the same species (divergence ≥0.011%) from long-read metagenomic data.

Download Full-text