An evaluation of the accuracy and speed of metagenome analysis tools

Stinus Lindgreen; Karen L. Adair; Paul P. Gardner

doi:10.1038/srep19233

An evaluation of the accuracy and speed of metagenome analysis tools

Scientific Reports ◽

10.1038/srep19233 ◽

2016 ◽

Vol 6 (1) ◽

Cited By ~ 187

Author(s):

Stinus Lindgreen ◽

Karen L. Adair ◽

Paul P. Gardner

Keyword(s):

Aquatic Ecosystems ◽

Large Scale ◽

High Throughput Sequencing ◽

Data Sets ◽

Metagenome Analysis ◽

Analysis Tools ◽

Sequencing Platforms ◽

Capacity Data ◽

High Degree ◽

Realistic Data

Abstract Metagenome studies are becoming increasingly widespread, yielding important insights into microbial communities covering diverse environments from terrestrial and aquatic ecosystems to human skin and gut. With the advent of high-throughput sequencing platforms, the use of large scale shotgun sequencing approaches is now commonplace. However, a thorough independent benchmark comparing state-of-the-art metagenome analysis tools is lacking. Here, we present a benchmark where the most widely used tools are tested on complex, realistic data sets. Our results clearly show that the most widely used tools are not necessarily the most accurate, that the most accurate tool is not necessarily the most time consuming and that there is a high degree of variability between available tools. These findings are important as the conclusions of any metagenomics study are affected by errors in the predicted community composition and functional capacity. Data sets and results are freely available from http://www.ucbioinformatics.org/metabenchmark.html

Download Full-text

An evaluation of the accuracy and speed of metagenome analysis tools

10.1101/017830 ◽

2015 ◽

Cited By ~ 10

Author(s):

Stinus Lindgreen ◽

Karen L Adair ◽

Paul Gardner

Keyword(s):

Aquatic Ecosystems ◽

Large Scale ◽

High Throughput Sequencing ◽

State Of The Art ◽

Data Sets ◽

Metagenome Analysis ◽

Analysis Tools ◽

Sequencing Platforms ◽

High Degree ◽

Realistic Data

Metagenome studies are becoming increasingly widespread, yielding important insights into microbial communities covering diverse environments from terrestrial and aquatic ecosystems to human skin and gut. With the advent of high-throughput sequencing platforms, the use of large scale shotgun sequencing approaches is now commonplace. However, a thorough independent benchmark comparing state-of-the-art metagenome analysis tools is lacking. Here, we present a benchmark where the most widely used tools are tested on complex, realistic data sets. Our results clearly show that the most widely used tools are not necessarily the most accurate, that the most accurate tool is not necessarily the most time consuming, and that there is a high degree of variability between available tools. These findings are important as the conclusions of any metagenomics study are affected by errors in the predicted community composition. Data sets and results are freely available from http://www.ucbioinformatics.org/metabenchmark.html

Download Full-text

T-SNE visualization of large-scale neural recordings

10.1101/087395 ◽

2016 ◽

Cited By ~ 5

Author(s):

George Dimitriadis ◽

Joana Neto ◽

Adam R. Kampff

Keyword(s):

Large Scale ◽

New Technologies ◽

Dimensional Space ◽

Clustering Algorithms ◽

Brain Regions ◽

Data Sets ◽

Neural Recordings ◽

Sorting Problem ◽

Feature Spaces ◽

High Degree

AbstractElectrophysiology is entering the era of ‘Big Data’. Multiple probes, each with hundreds to thousands of individual electrodes, are now capable of simultaneously recording from many brain regions. The major challenge confronting these new technologies is transforming the raw data into physiologically meaningful signals, i.e. single unit spikes. Sorting the spike events of individual neurons from a spatiotemporally dense sampling of the extracellular electric field is a problem that has attracted much attention [22, 23], but is still far from solved. Current methods still rely on human input and thus become unfeasible as the size of the data sets grow exponentially.Here we introduce the t-student stochastic neighbor embedding (t-sne) dimensionality reduction method [27] as a visualization tool in the spike sorting process. T-sne embeds the n-dimensional extracellular spikes (n = number of features by which each spike is decomposed) into a low (usually two) dimensional space. We show that such embeddings, even starting from different feature spaces, form obvious clusters of spikes that can be easily visualized and manually delineated with a high degree of precision. We propose that these clusters represent single units and test this assertion by applying our algorithm on labeled data sets both from hybrid [23] and paired juxtacellular/extracellular recordings [15]. We have released a graphical user interface (gui) written in python as a tool for the manual clustering of the t-sne embedded spikes and as a tool for an informed overview and fast manual curration of results from other clustering algorithms. Furthermore, the generated visualizations offer evidence in favor of the use of probes with higher density and smaller electrodes. They also graphically demonstrate the diverse nature of the sorting problem when spikes are recorded with different methods and arise from regions with different background spiking statistics.

Download Full-text

Technological advancements and their importance for nematode identification

SOIL ◽

10.5194/soil-2-257-2016 ◽

2016 ◽

Vol 2 (2) ◽

pp. 257-270 ◽

Cited By ~ 6

Author(s):

Mohammed Ahmed ◽

Melanie Sapp ◽

Thomas Prior ◽

Gerrit Karssen ◽

Matthew Alan Back

Keyword(s):

High Throughput ◽

Crop Production ◽

Large Scale ◽

High Throughput Sequencing ◽

Biological Indicators ◽

Rapid Identification ◽

Community Studies ◽

Terrestrial Environments ◽

Traditional Taxonomy ◽

Sequencing Platforms

Abstract. Nematodes represent a species-rich and morphologically diverse group of metazoans known to inhabit both aquatic and terrestrial environments. Their role as biological indicators and as key players in nutrient cycling has been well documented. Some plant-parasitic species are also known to cause significant losses to crop production. In spite of this, there still exists a huge gap in our knowledge of their diversity due to the enormity of time and expertise often involved in characterising species using phenotypic features. Molecular methodology provides useful means of complementing the limited number of reliable diagnostic characters available for morphology-based identification. We discuss herein some of the limitations of traditional taxonomy and how molecular methodologies, especially the use of high-throughput sequencing, have assisted in carrying out large-scale nematode community studies and characterisation of phytonematodes through rapid identification of multiple taxa. We also provide brief descriptions of some the current and almost-outdated high-throughput sequencing platforms and their applications in both plant nematology and soil ecology.

Download Full-text

A Fast Cluster Motif Finding Algorithm for ChIP-Seq Data Sets

BioMed Research International ◽

10.1155/2015/218068 ◽

2015 ◽

Vol 2015 ◽

pp. 1-10 ◽

Cited By ~ 4

Author(s):

Yipu Zhang ◽

Ping Wang

Keyword(s):

High Throughput ◽

Motif Discovery ◽

Large Scale ◽

High Throughput Sequencing ◽

Es Cells ◽

Motif Finding ◽

Data Sets ◽

Data Set ◽

Binding Motifs ◽

Motif Finding Algorithm

New high-throughput technique ChIP-seq, coupling chromatin immunoprecipitation experiment with high-throughput sequencing technologies, has extended the identification of binding locations of a transcription factor to the genome-wide regions. However, the most existing motif discovery algorithms are time-consuming and limited to identify binding motifs in ChIP-seq data which normally has the significant characteristics of large scale data. In order to improve the efficiency, we propose a fast cluster motif finding algorithm, named as FCmotif, to identify the(l, d)motifs in large scale ChIP-seq data set. It is inspired by the emerging substrings mining strategy to find the enriched substrings and then searching the neighborhood instances to construct PWM and cluster motifs in different length. FCmotif is not following the OOPS model constraint and can find long motifs. The effectiveness of proposed algorithm has been proved by experiments on the ChIP-seq data sets from mouse ES cells. The whole detection of the real binding motifs and processing of the full size data of several megabytes finished in a few minutes. The experimental results show that FCmotif has advantageous to deal with the(l, d)motif finding in the ChIP-seq data; meanwhile it also demonstrates better performance than other current widely-used algorithms such as MEME, Weeder, ChIPMunk, and DREME.

Download Full-text

Intra-genomic rDNA gene variability of Nassellaria and Spumellaria (Rhizaria, Radiolaria) assessed by Sanger, MinION and Illumina sequencing

10.1101/2021.10.05.463214 ◽

2021 ◽

Author(s):

Miguel Mendez Sandin ◽

Sarah Romac ◽

Fabrice Not

Keyword(s):

Large Scale ◽

High Throughput Sequencing ◽

Phylogenetic Reconstruction ◽

Environmental Dna ◽

Genomic Diversity ◽

Genomic Variability ◽

Rdna Gene ◽

Sequencing Errors ◽

Sequencing Platforms ◽

Gene Variability

Ribosomal DNA (rDNA) genes are known to be valuable markers for the barcoding of eukaryotic life and its phylogenetic classification at various taxonomic levels. The large scale exploration of environmental microbial diversity through metabarcoding approaches have been focused mainly on the hypervariable regions V4 and V9 of the 18S rDNA gene. Yet, the accurate interpretation of such environmental surveys is hampered by technical (e.g., PCR and sequencing errors) and biological biases (e.g., intra-genomic variability). Here we explored the intra-genomic diversity of Nassellaria and Spumellaria specimens (Radiolaria) by comparing Sanger sequencing with two different high-throughput sequencing platforms: Illumina and Oxford Nanopore Technologies (MinION). Our analysis determined that intra-genomic variability of Nassellaria and Spumellaria is generally low, yet in some Spumellaria specimens we found two different copies of the V4 with a similarity lower than 97%. From the different sequencing methods, Illumina showed the highest number of contaminations (i.e., environmental DNA, cross-contamination, tag-jumping), revealed by its high sequencing depth; and Minion showed the highest sequencing rate error (~14%). Yet the long reads produced by MinION (~2900 bp) allowed accurate phylogenetic reconstruction studies. These results, highlight the requirement for a careful interpretation of Illumina based metabarcoding studies, in particular regarding low abundant amplicons, and open future perspectives towards full environmental rDNA metabarcoding surveys.

Download Full-text

Gradients Do Grow on Trees: A Linear-Time O(N)-Dimensional Gradient for Statistical Phylogenetics

Molecular Biology and Evolution ◽

10.1093/molbev/msaa130 ◽

2020 ◽

Vol 37 (10) ◽

pp. 3047-3060

Author(s):

Xiang Ji ◽

Zhenyu Zhang ◽

Andrew Holbrook ◽

Akihiko Nishimura ◽

Guy Baele ◽

...

Keyword(s):

Large Scale ◽

High Throughput Sequencing ◽

Linear Time ◽

Phylogenetic Reconstruction ◽

Fold Increase ◽

Time Algorithm ◽

Data Sets ◽

Lassa Virus ◽

Computational Performance ◽

Computational Bottleneck

Abstract Calculation of the log-likelihood stands as the computational bottleneck for many statistical phylogenetic algorithms. Even worse is its gradient evaluation, often used to target regions of high probability. Order O(N)-dimensional gradient calculations based on the standard pruning algorithm require O(N2) operations, where N is the number of sampled molecular sequences. With the advent of high-throughput sequencing, recent phylogenetic studies have analyzed hundreds to thousands of sequences, with an apparent trend toward even larger data sets as a result of advancing technology. Such large-scale analyses challenge phylogenetic reconstruction by requiring inference on larger sets of process parameters to model the increasing data heterogeneity. To make these analyses tractable, we present a linear-time algorithm for O(N)-dimensional gradient evaluation and apply it to general continuous-time Markov processes of sequence substitution on a phylogenetic tree without a need to assume either stationarity or reversibility. We apply this approach to learn the branch-specific evolutionary rates of three pathogenic viruses: West Nile virus, Dengue virus, and Lassa virus. Our proposed algorithm significantly improves inference efficiency with a 126- to 234-fold increase in maximum-likelihood optimization and a 16- to 33-fold computational performance increase in a Bayesian framework.

Download Full-text

Ssecrett and NeuroTrace: Interactive Visualization and Analysis Tools for Large-Scale Neuroscience Data Sets

IEEE Computer Graphics and Applications ◽

10.1109/mcg.2010.56 ◽

2010 ◽

Vol 30 (3) ◽

pp. 58-70 ◽

Cited By ~ 45

Author(s):

Won-Ki Jeong ◽

Johanna Beyer ◽

Markus Hadwiger ◽

Rusty Blue ◽

Charles Law ◽

...

Keyword(s):

Large Scale ◽

Interactive Visualization ◽

Data Sets ◽

Analysis Tools

Download Full-text

DataMeadow: A Visual Canvas for Analysis of Large-Scale Multivariate Data

Information Visualization ◽

10.1057/palgrave.ivs.9500170 ◽

2008 ◽

Vol 7 (1) ◽

pp. 18-33 ◽

Cited By ~ 34

Author(s):

Niklas Elmqvist ◽

John Stasko ◽

Philippas Tsigas

Keyword(s):

Visual Analytics ◽

Large Scale ◽

Multidimensional Data ◽

Data Sets ◽

Data Set ◽

Data Dependencies ◽

Expert Review ◽

History Of ◽

Multidimensional Data Sets ◽

High Degree

Supporting visual analytics of multiple large-scale multidimensional data sets requires a high degree of interactivity and user control beyond the conventional challenges of visualizing such data sets. We present the DataMeadow, a visual canvas providing rich interaction for constructing visual queries using graphical set representations called DataRoses. A DataRose is essentially a starplot of selected columns in a data set displayed as multivariate visualizations with dynamic query sliders integrated into each axis. The purpose of the DataMeadow is to allow users to create advanced visual queries by iteratively selecting and filtering into the multidimensional data. Furthermore, the canvas provides a clear history of the analysis that can be annotated to facilitate dissemination of analytical results to stakeholders. A powerful direct manipulation interface allows for selection, filtering, and creation of sets, subsets, and data dependencies. We have evaluated our system using a qualitative expert review involving two visualization researchers. Results from this review are favorable for the new method.

Download Full-text

Integrated Genome Browser: visual analytics platform for genomics

10.1101/026351 ◽

2015 ◽

Author(s):

Nowlan H. Freese ◽

David C. Norris ◽

Ann E. Loraine

Keyword(s):

Open Source ◽

Visual Analytics ◽

Large Scale ◽

High Throughput Sequencing ◽

Genomic Data ◽

Genome Browser ◽

Data Availability ◽

Data Sets ◽

Related Data ◽

Genome Scale

Motivation: Genome browsers that support fast navigation and interactive visual analytics can help scientists achieve deeper insight into large-scale genomic data sets more quickly, thus accelerating the discovery process. Toward this end, we developed Integrated Genome Browser (IGB), a highly configurable, interactive and fast open source desktop genome browser. Results: Here we describe multiple updates to IGB, including all-new capability to display and interact with data from high-throughput sequencing experiments. To demonstrate, we describe example visualizations and analyses of data sets from RNA-Seq, ChIP-Seq, and bisulfite sequencing experiments. Understanding results from genome-scale experiments requires viewing the data in the context of reference genome annotations and other related data sets. To facilitate this, we enhanced IGB's ability to consume data from diverse sources, including Galaxy, Distributed Annotation, and IGB-specific Quickload servers. To support future visualization needs as new genome-scale assays enter wide use, we transformed the IGB codebase into a modular, extensible platform for developers to create and deploy all-new visualizations of genomic data. Availability: IGB is open source and is freely available from http://bioviz.org/igb.

Download Full-text

Accounting for genotype uncertainty in the estimation of allele frequencies in autopolyploids

10.1101/021907 ◽

2015 ◽

Author(s):

Paul D Blischak ◽

Laura S Kubatko ◽

Andrea D Wolfe

Keyword(s):

High Throughput ◽

Large Scale ◽

High Throughput Sequencing ◽

Frequency Estimation ◽

Estimation Error ◽

Simulated Data ◽

Allele Frequencies ◽

Data Sets ◽

Sequencing Data ◽

Model Adequacy

Despite the increasing opportunity to collect large-scale data sets for population genomic analyses, the use of high throughput sequencing to study populations of polyploids has seen little application. This is due in large part to problems associated with determining allele copy number in the genotypes of polyploid individuals (allelic dosage uncertainty--ADU), which complicates the calculation of important quantities such as allele frequencies. Here we describe a statistical model to estimate biallelic SNP frequencies in a population of autopolyploids using high throughput sequencing data in the form of read counts.We bridge the gap from data collection (using restriction enzyme based techniques [e.g., GBS, RADseq]) to allele frequency estimation in a unified inferential framework using a hierarchical Bayesian model to sum over genotype uncertainty. Simulated data sets were generated under various conditions for tetraploid, hexaploid and octoploid populations to evaluate the model's performance and to help guide the collection of empirical data. We also provide an implementation of our model in the R package POLYFREQS and demonstrate its use with two example analyses that investigate (i) levels of expected and observed heterozygosity and (ii) model adequacy. Our simulations show that the number of individuals sampled from a population has a greater impact on estimation error than sequencing coverage. The example analyses also show that our model and software can be used to make inferences beyond the estimation of allele frequencies for autopolyploids by providing assessments of model adequacy and estimates of heterozygosity.

Download Full-text