MALT: Fast alignment and analysis of metagenomic DNA sequence data applied to the Tyrolean Iceman

Mapping Intimacies ◽

10.1101/050559 ◽

2016 ◽

Cited By ~ 38

Author(s):

Alexander Herbig ◽

Frank Maixner ◽

Kirsten I. Bos ◽

Albert Zink ◽

Johannes Krause ◽

...

Keyword(s):

Large Scale ◽

Sequence Data ◽

Bacterial Species ◽

Human Microbiome ◽

Metagenomic Analysis ◽

Sequencing Data ◽

Metagenomic Dna ◽

Alignment Procedure ◽

Taxonomic Profile ◽

Tyrolean Iceman

AbstractModern next generation sequencing technologies produce vast amounts of data in the context of large-scale metagenomic studies, in which complex microbial communities can be reconstructed to an unprecedented level of detail. Most prominent examples are human microbiome studies that correlate the bacterial taxonomic profile with specific physiological conditions or diseases.In order to perform these analyses high-throughput computational tools are needed that are able to process these data within a short time while preserving a high level of sensitivity and specificity.Here we present MALT (MEGAN ALignment Tool) a program for the ultrafast alignment and analysis of metagenomic DNA sequencing data. MALT processes hundreds of millions of sequencing reads within only a few hours. In addition to the alignment procedure MALT implements a taxonomic binning algorithm that is able to specifically assign reads to bacterial species. Its tight integration with the interactive metagenomic analysis software MEGAN allows for visualization and further analyses of results.We demonstrate MALT by its application to the metagenomic analysis of two ancient microbiomes from oral cavity and lung samples of the 5,300-year-old Tyrolean Iceman. Despite the strong environmental background, MALT is able to pick up the weak signal of the original microbiomes and identifies multiple species that are typical representatives of the respective host environment.

Download Full-text

EdClust: A heuristic sequence clustering method with higher sensitivity

Journal of Bioinformatics and Computational Biology ◽

10.1142/s0219720021500360 ◽

2021 ◽

Author(s):

Ming Cao ◽

Qinke Peng ◽

Ze-Gang Wei ◽

Fei Liu ◽

Yi-Fan Hou

Keyword(s):

Large Scale ◽

Sequence Data ◽

Clustering Algorithms ◽

Clustering Methods ◽

Sequencing Data ◽

Clustering Method ◽

Cluster Number ◽

Sequence Clustering ◽

Downstream Analysis ◽

Heuristic Clustering

The development of high-throughput technologies has produced increasing amounts of sequence data and an increasing need for efficient clustering algorithms that can process massive volumes of sequencing data for downstream analysis. Heuristic clustering methods are widely applied for sequence clustering because of their low computational complexity. Although numerous heuristic clustering methods have been developed, they suffer from two limitations: overestimation of inferred clusters and low clustering sensitivity. To address these issues, we present a new sequence clustering method (edClust) based on Edlib, a C/C[Formula: see text] library for fast, exact semi-global sequence alignment to group similar sequences. The new method edClust was tested on three large-scale sequence databases, and we compared edClust to several classic heuristic clustering methods, such as UCLUST, CD-HIT, and VSEARCH. Evaluations based on the metrics of cluster number and seed sensitivity (SS) demonstrate that edClust can produce fewer clusters than other methods and that its SS is higher than that of other methods. The source codes of edClust are available from https://github.com/zhang134/EdClust.git under the GNU GPL license.

Download Full-text

gplas: a comprehensive tool for plasmid analysis using short-read graphs

Bioinformatics ◽

10.1093/bioinformatics/btaa233 ◽

2020 ◽

Vol 36 (12) ◽

pp. 3874-3876 ◽

Cited By ~ 1

Author(s):

Sergio Arredondo-Alonso ◽

Martin Bootsma ◽

Yaïr Hein ◽

Malbert R C Rogers ◽

Jukka Corander ◽

...

Keyword(s):

Large Scale ◽

Sequence Data ◽

Bacterial Genome ◽

Workflow Management ◽

Supplementary Information ◽

Whole Genome Sequencing Data ◽

Network Partitioning ◽

Sequencing Data ◽

Genetic Traits ◽

Short Read

Abstract Summary Plasmids can horizontally transmit genetic traits, enabling rapid bacterial adaptation to new environments and hosts. Short-read whole-genome sequencing data are often applied to large-scale bacterial comparative genomics projects but the reconstruction of plasmids from these data is facing severe limitations, such as the inability to distinguish plasmids from each other in a bacterial genome. We developed gplas, a new approach to reliably separate plasmid contigs into discrete components using sequence composition, coverage, assembly graph information and network partitioning based on a pruned network of plasmid unitigs. Gplas facilitates the analysis of large numbers of bacterial isolates and allows a detailed analysis of plasmid epidemiology based solely on short-read sequence data. Availability and implementation Gplas is written in R, Bash and uses a Snakemake pipeline as a workflow management system. Gplas is available under the GNU General Public License v3.0 at https://gitlab.com/sirarredondo/gplas.git. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

A high-resolution pipeline for 16S-sequencing identifies bacterial strains in human microbiome

10.1101/565572 ◽

2019 ◽

Cited By ~ 1

Author(s):

Igor Segota ◽

Tao Long

Keyword(s):

Bacterial Species ◽

Human Microbiome ◽

Amplicon Sequencing ◽

R Package ◽

Strain Level ◽

Sequencing Data ◽

Bacterial Strains ◽

16S Sequencing ◽

16S Amplicon Sequencing ◽

Sequencing Data Analysis

We developed a High-resolution Microbial Analysis Pipeline (HiMAP) for 16S amplicon sequencing data analysis, aiming at bacterial species or strain-level identification from human microbiome to enable experimental validation for causal effects of the associated bacterial strains on health and diseases. HiMAP achieved higher accuracy in identifying species in human microbiome mock community than other pipelines. HiMAP identified majority of the species, with strain-level resolution wherever possible, as detected by whole genome shotgun sequencing using MetaPhlAn2 and reported comparable relative abundances. HiMAP is an open-source R package available at https://github.com/taolonglab/himap.

Download Full-text

Searching more genomic sequence with less memory for fast and accurate metagenomic profiling

10.1101/036681 ◽

2016 ◽

Author(s):

Shea N Gardner ◽

Sasha K Ames ◽

Maya B Gokhale ◽

Tom R Slezak ◽

Jonathan Allen

Keyword(s):

Large Scale ◽

Genomic Sequence ◽

Sequence Data ◽

Low Cost ◽

False Negative ◽

Human Microbiome ◽

Human Microbiome Project ◽

Metagenomic Data ◽

Reference Database ◽

Metagenomic Sequence

Software for rapid, accurate, and comprehensive microbial profiling of metagenomic sequence data on a desktop will play an important role in large scale clinical use of metagenomic data. Here we describe LMAT-ML (Livermore Metagenomics Analysis Toolkit-Marker Library) which can be run with 24 GB of DRAM memory, an amount available on many clusters, or with 16 GB DRAM plus a 24 GB low cost commodity flash drive (NVRAM), a cost effective alternative for desktop or laptop users. We compared results from LMAT with five other rapid, low-memory tools for metagenome analysis for 131 Human Microbiome Project samples, and assessed discordant calls with BLAST. All the tools except LMAT-ML reported overly specific or incorrect species and strain resolution of reads that were in fact much more widely conserved across species, genera, and even families. Several of the tools misclassified reads from synthetic or vector sequence as microbial or human reads as viral. We attribute the high numbers of false positive and false negative calls to a limited reference database with inadequate representation of known diversity. Our comparisons with real world samples show that LMAT-ML is the only tool tested that classifies the majority of reads, and does so with high accuracy.

Download Full-text

Multiomic Strategies Reveal Diversity and Important Functional Aspects of Human Gut Microbiome

BioMed Research International ◽

10.1155/2018/6074918 ◽

2018 ◽

Vol 2018 ◽

pp. 1-13 ◽

Cited By ~ 1

Author(s):

Ravi Ranjan ◽

Asha Rani ◽

Patricia W. Finn ◽

David L. Perkins

Keyword(s):

Sequence Data ◽

Bacterial Species ◽

Abundant Species ◽

Shotgun Sequencing ◽

Sequencing Analysis ◽

Sequencing Data ◽

Illumina Platform ◽

Shotgun Metagenomics ◽

Functional Aspects ◽

Detectable Bias

It is well accepted that dysbiosis of microbiota is associated with disease; however, the biological mechanisms that promote susceptibility or resilience to disease remain elusive. One of the major limitations of previous microbiome studies has been the lack of complementary metatranscriptomic (functional) data to complement the interpretation of metagenomics (bacterial abundance). The purpose of this study was twofold, first to evaluate the bacterial diversity and differential gene expression of gut microbiota using complementary shotgun metagenomics (MG) and metatranscriptomics (MT) from same fecal sample. Second, to compare sequence data using different Illumina platforms and with different sequencing parameters as new sequencers are introduced, and to determine if the data are comparable on different platforms. In this study, we perform ultradeep metatranscriptomic shotgun sequencing for a sample that we previously analyzed with metagenomics shotgun sequencing. We performed sequencing analysis using different Illumina platforms, with different sequencing and analysis parameters. Our results suggest that use of different Illumina platform did not lead to detectable bias in the sequencing data. The analysis of the sample using MG and MT approach shows that some species genes are highly represented in the MT than in the MG, indicating that some species are highly metabolically active. Our analysis also shows that ~52% of the genes in the metagenome are in the metatranscriptome and therefore are robustly expressed. The functions of the low and rare abundance bacterial species remain poorly understood. Our observations indicate that among the low abundant species analyzed in this study some were found to be more metabolically active compared to others, and can contribute distinct profiles of biological functions that may modulate the host-microbiota and bacteria-bacteria interactions.

Download Full-text

gplas: a comprehensive tool for plasmid analysis using short-read graphs

10.1101/835900 ◽

2019 ◽

Cited By ~ 1

Author(s):

Sergio Arredondo-Alonso ◽

Martin Bootsma ◽

Yaïr Hein ◽

Malbert R.C. Rogers ◽

Jukka Corander ◽

...

Keyword(s):

Large Scale ◽

Sequence Data ◽

Bacterial Genome ◽

Workflow Management ◽

Whole Genome Sequencing Data ◽

Sequencing Data ◽

Genetic Traits ◽

Short Read ◽

Sequence Composition ◽

Short Read Sequence

ABSTRACTSummaryPlasmids can horizontally transmit genetic traits, enabling rapid bacterial adaptation to new environments and hosts. Short-read whole-genome sequencing data is often applied to large-scale bacterial comparative genomics projects but the reconstruction of plasmids from these data is facing severe limitations, such as the inability to distinguish plasmids from each other in a bacterial genome. We developed gplas, a new approach to reliably separate plasmid contigs into discrete components using sequence composition, coverage, assembly graph information and clustering based on a pruned network of plasmid unitigs. Gplas facilitates the analysis of large numbers of bacterial isolates and allows a detailed analysis of plasmid epidemiology based solely on short read sequence data.Availability and implementationGplas is written in R, Bash and uses a Snakemake pipeline as a workflow management system. Gplas is available under the GNU General Public License v3.0 at https://gitlab.com/sirarredondo/[email protected]

Download Full-text

CRAFT: Compact genome Representation toward large-scale Alignment-Free daTabase

Bioinformatics ◽

10.1093/bioinformatics/btaa699 ◽

2020 ◽

Author(s):

Yang Young Lu ◽

Jiaxing Bai ◽

Yiwen Wang ◽

Ying Wang ◽

Fengzhu Sun

Keyword(s):

Dna Sequences ◽

Sequence Comparison ◽

Large Scale ◽

High Throughput Sequencing ◽

Sequence Data ◽

Practical Interest ◽

Supplementary Information ◽

Sequencing Data ◽

Computationally Efficient ◽

Alignment Free

Abstract Motivation Rapid developments in sequencing technologies have boosted generating high volumes of sequence data. To archive and analyze those data, one primary step is sequence comparison. Alignment-free sequence comparison based on k-mer frequencies offers a computationally efficient solution, yet in practice, the k-mer frequency vectors for large k of practical interest lead to excessive memory and storage consumption. Results We report CRAFT, a general genomic/metagenomic search engine to learn compact representations of sequences and perform fast comparison between DNA sequences. Specifically, given genome or high throughput sequencing data as input, CRAFT maps the data into a much smaller embedding space and locates the best matching genome in the archived massive sequence repositories. With 102−104-fold reduction of storage space, CRAFT performs fast query for gigabytes of data within seconds or minutes, achieving comparable performance as six state-of-the-art alignment-free measures. Availability and implementation CRAFT offers a user-friendly graphical user interface with one-click installation on Windows and Linux operating systems, freely available at https://github.com/jiaxingbai/CRAFT. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Streamlining Data-Intensive Biology With Workflow Systems

10.1101/2020.06.30.178673 ◽

2020 ◽

Cited By ~ 1

Author(s):

Taylor Reiter ◽

Phillip T. Brooks ◽

Luiz Irber ◽

Shannon E.K. Joslin ◽

Charles M. Reid ◽

...

Keyword(s):

Data Analysis ◽

Large Scale ◽

High Throughput Sequencing ◽

Sequence Data ◽

Open Science ◽

Biological Data ◽

Data Generation ◽

Biological Sequence ◽

Sequencing Data ◽

Workflow Systems

AbstractAs the scale of biological data generation has increased, the bottleneck of research has shifted from data generation to analysis. Researchers commonly need to build computational workflows that include multiple analytic tools and require incremental development as experimental insights demand tool and parameter modifications. These workflows can produce hundreds to thousands of intermediate files and results that must be integrated for biological insight. Data-centric workflow systems that internally manage computational resources, software, and conditional execution of analysis steps are reshaping the landscape of biological data analysis, and empowering researchers to conduct reproducible analyses at scale. Adoption of these tools can facilitate and expedite robust data analysis, but knowledge of these techniques is still lacking. Here, we provide a series of practices and strategies for leveraging workflow systems with structured project, data, and resource management to streamline large-scale biological analysis. We present these strategies in the context of high-throughput sequencing data analysis, but the principles are broadly applicable to biologists working beyond this field.Author SummaryWe present a guide for workflow-enabled biological sequence data analysis, developed through our own teaching, training and analysis projects. We recognize that this is based on our own use cases and experiences, but we hope that our guide will contribute to a larger discussion within the open source and open science communities and lead to more comprehensive resources. Our main goal is to accelerate the research of scientists conducting sequence analyses by introducing them to organized workflow practices that not only benefit their own research but also facilitate open and reproducible science.

Download Full-text

Application of genotyping-by-sequencing data on inferring the phylogeny of Curcuma (Zingiberaceae) from China

10.21203/rs.2.15210/v1 ◽

2019 ◽

Author(s):

Heng Liang ◽

Yan Zhang ◽

Jiabing Deng ◽

Gang Gao ◽

Chunbang Ding ◽

...

Keyword(s):

Phylogenetic Relationships ◽

Phylogenetic Trees ◽

Large Scale ◽

Reference Genome ◽

Genomic Sequence ◽

Sequence Data ◽

Morphological Differentiation ◽

Genotyping By Sequencing ◽

Tibet Plateau ◽

Sequencing Data

Abstract Background: Genotyping-by-sequencing (GBS), as one of the next generation sequences, has been applied to large scale genotyping in plants, which is poor in morphological differentiation and low in genetic divergence among different species. Curcuma is a significantly medicinal and edible genus. Improvement efforts of phylogenetic relationships and disentangling species are still a challenge due to poor morphology and lack in a reference genome. Result: A high-throughput genomic sequence data which was obtained through GBS protocols was used to investigate the relationships among 8 species with 60 total samples of Curcuma. Through the use of the ipyrad software, 437,061 loci and 997,988 filtered SNPs without reliance upon a reference genome were produced. After quality control (QC) of the filtered SNPs, 1,295 high-quality SNPs were used to clarify the phylogenetic relationships among Curcuma species. Based on these data, a supermatrix approach was used to speculate the phylogeny, and the phylogenetic trees and the relationships were inferred . Conclusions: Varying degrees of support can be explained, as well as the diversification events for Chinese Curcuma. The diversification events showed that the third intense uplift of Qinghai–Tibet Plateau (QTP) and formation of the Hengduan Mountains may speed up Curcuma interspecific divergence in China. The PCA suggested the same topology of the phylogenetic tree. The genetic structure analysis revealed that extensive hybridization may exist in Chinese Curcuma. Additionally, the GBS will be a promising approach for the phylogenetic and systematic study in the future.

Download Full-text

HLA-A alleles including HLA-A29 affect the composition of the gut microbiome: a potential clue to the pathogenesis of birdshot retinochoroidopathy

Scientific Reports ◽

10.1038/s41598-020-74751-0 ◽

2020 ◽

Vol 10 (1) ◽

Author(s):

Peter R. Sternes ◽

Tammy M. Martin ◽

Michael Paley ◽

Sarah Diamond ◽

Mark J. Asquith ◽

...

Keyword(s):

Gut Microbiome ◽

Bacterial Species ◽

Human Microbiome ◽

Human Microbiome Project ◽

Eye Disease ◽

Intestinal Biopsy ◽

Healthy Individuals ◽

Sequencing Data ◽

Immune Mediated ◽

Bacterial Profiling

Abstract Birdshot retinochoroidopathy occurs exclusively in individuals who are HLA-A29 positive. The mechanism to account for this association is unknown. The gut microbiome has been causally implicated in many immune-mediated diseases. We hypothesized that HLA-A29 would affect the composition of the gut microbiome, leading to a dysbiosis and immune-mediated eye disease. Fecal and intestinal biopsy samples were obtained from 107 healthy individuals from Portland, Oregon environs, 10 of whom were HLA-A29 positive, undergoing routine colonoscopy. Bacterial profiling was achieved via 16S rRNA metabarcoding. Publicly available whole meta-genome sequencing data from the Human Microbiome Project (HMP), consisting of 298 healthy controls mostly of US origin, were also interrogated. PERMANOVA and sparse partial least squares discriminant analysis (sPLSDA) demonstrated that subjects who were HLA-A29 positive differed in bacterial species composition (beta diversity) compared to HLA-A29 negative subjects in both the Portland (p = 0.019) and HMP cohorts (p = 0.0002). The Portland and HMP cohorts evidenced different subsets of bacterial species associated with HLA-A29 status, likely due to differences in the metagenomic techniques employed. The functional composition of the HMP cohort did not differ overall (p = 0.14) between HLA-A29 positive and negative subjects, although some distinct pathways such as heparan sulfate biosynthesis showed differences. As we and others have shown for various HLA alleles, the HLA allotype impacts the composition of the microbiome. We hypothesize that HLA-A29 may predispose chorioretinitis via an altered gut microbiome.

Download Full-text