FOCUS2: agile and sensitive classification of metagenomics data using a reduced database

Mapping Intimacies ◽

10.1101/046425 ◽

2016 ◽

Cited By ~ 2

Author(s):

Genivaldo Gueiros Z. Silva ◽

Bas E. Dutilh ◽

Robert A. Edwards

Keyword(s):

Microbial Community ◽

Dna Sequences ◽

Computational Method ◽

Environmental Research ◽

Supplementary Information ◽

Sequence Classification ◽

Computationally Efficient ◽

Link Type ◽

Metagenomics Data

ABSTRACTSummaryMetagenomics approaches rely on identifying the presence of organisms in the microbial community from a set of unknown DNA sequences. Sequence classification has valuable applications in multiple important areas of medical and environmental research. Here we introduce FOCUS2, an update of the previously published computational method FOCUS. FOCUS2 was tested with 10 simulated and 543 real metagenomes demonstrating that the program is more sensitive, faster, and more computationally efficient than existing methods.AvailabilityThe Python implementation is freely available at https://edwards.sdsu.edu/FOCUS2.Supplementary informationavailable at Bioinformatics online.

Download Full-text

Higher-order Markov models for metagenomic sequence classification

Bioinformatics ◽

10.1093/bioinformatics/btaa562 ◽

2020 ◽

Vol 36 (14) ◽

pp. 4130-4136

Author(s):

David J Burks ◽

Rajeev K Azad

Keyword(s):

Dna Sequences ◽

Markov Models ◽

Fragment Size ◽

Higher Order ◽

Training Data ◽

Supplementary Information ◽

Local Alignment ◽

Metagenomic Sequence ◽

Higher Order Models

Abstract Motivation Alignment-free, stochastic models derived from k-mer distributions representing reference genome sequences have a rich history in the classification of DNA sequences. In particular, the variants of Markov models have previously been used extensively. Higher-order Markov models have been used with caution, perhaps sparingly, primarily because of the lack of enough training data and computational power. Advances in sequencing technology and computation have enabled exploitation of the predictive power of higher-order models. We, therefore, revisited higher-order Markov models and assessed their performance in classifying metagenomic sequences. Results Comparative assessment of higher-order models (HOMs, 9th order or higher) with interpolated Markov model, interpolated context model and lower-order models (8th order or lower) was performed on metagenomic datasets constructed using sequenced prokaryotic genomes. Our results show that HOMs outperform other models in classifying metagenomic fragments as short as 100 nt at all taxonomic ranks, and at lower ranks when the fragment size was increased to 250 nt. HOMs were also found to be significantly more accurate than local alignment which is widely relied upon for taxonomic classification of metagenomic sequences. A novel software implementation written in C++ performs classification faster than the existing Markovian metagenomic classifiers and can therefore be used as a standalone classifier or in conjunction with existing taxonomic classifiers for more robust classification of metagenomic sequences. Availability and implementation The software has been made available at https://github.com/djburks/SMM. Contact [email protected] Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

CRAFT: Compact genome Representation towards large-scale Alignment-Free daTabase

10.1101/2020.07.10.196741 ◽

2020 ◽

Author(s):

Yang Young Lu ◽

Jiaxing Bai ◽

Yiwen Wang ◽

Ying Wang ◽

Fengzhu Sun

Keyword(s):

Dna Sequences ◽

Sequence Comparison ◽

Large Scale ◽

High Throughput Sequencing ◽

Sequence Data ◽

Practical Interest ◽

Supplementary Information ◽

Computationally Efficient ◽

Sequencing Technologies ◽

Alignment Free

AbstractMotivationRapid developments in sequencing technologies have boosted generating high volumes of sequence data. To archive and analyze those data, one primary step is sequence comparison. Alignment-free sequence comparison based on k-mer frequencies offers a computationally efficient solution, yet in practice, the k-mer frequency vectors for large k of practical interest lead to excessive memory and storage consumption.ResultsWe report CRAFT, a general genomic/metagenomic search engine to learn compact representations of sequences and perform fast comparison between DNA sequences. Specifically, given genome or high throughput sequencing (HTS) data as input, CRAFT maps the data into a much smaller embedding space and locates the best matching genome in the archived massive sequence repositories. With 102 – 104-fold reduction of storage space, CRAFT performs fast query for gigabytes of data within seconds or minutes, achieving comparable performance as six state-of-the-art alignment-free measures.AvailabilityCRAFT offers a user-friendly graphical user interface with one-click installation on Windows and Linux operating systems, freely available at https://github.com/jiaxingbai/[email protected]; [email protected] informationSupplementary data are available at Bioinformatics online.

Download Full-text

CellProfiler Analyst: interactive data exploration, analysis, and classification of large biological image sets

10.1101/057976 ◽

2016 ◽

Cited By ~ 1

Author(s):

D. Dao ◽

A. N. Fraser ◽

J. Hung ◽

V. Ljosa ◽

S. Singh ◽

...

Keyword(s):

Supervised Machine Learning ◽

Supplementary Information ◽

Link Type ◽

Learning Capabilities ◽

Visualization Tools ◽

Interactive Data ◽

Supplementary Text ◽

Microsoft Windows ◽

Benchmarking Performance

AbstractSummaryCellProfiler Analyst allows the exploration and visualization of image-based data, together with the classification of complex biological phenotypes, via an interactive user interface designed for biologists and data scientists. CellProfiler Analyst 2.0, completely rewritten in Python, builds on these features and adds enhanced supervised machine learning capabilities (in Classifier), as well as visualization tools to overview an experiment (Plate Viewer and Image Gallery).AvailabilityCellProfiler Analyst 2.0 is free and open source, available at http://www.cellprofiler.org/releases and from GitHub (https://github.com/CellProfiler/CellProfiler-Analyst) under the BSD license. It is available as a packaged application for Mac OS X and Microsoft Windows and can be compiled for Linux. We implemented an automatic build process that supports nightly updates and regular release cycles for the [email protected] informationSupplementary Text 1: Manual to CellProfiler Analyst; updated versions are available at CellProfiler.org/CPASupplementary Data 1: Benchmarking performance of classifiers in CPA 2.0 versus CPA 1.0

Download Full-text

DTI-Voodoo: machine learning over interaction networks and ontology-based background knowledge predicts drug–target interactions

10.1101/2021.04.28.441733 ◽

2021 ◽

Author(s):

Tilman Hinnerichs ◽

Robert Hoehndorf

Keyword(s):

Drug Target ◽

Drug Targets ◽

Interaction Network ◽

Drug Repurposing ◽

Computational Method ◽

Interaction Networks ◽

Supplementary Information ◽

Prediction Methods ◽

Link Type ◽

Molecular Features

AbstractMotivationIn silico drug–target interaction (DTI) prediction is important for drug discovery and drug repurposing. Approaches to predict DTIs can proceed indirectly, top-down, using phenotypic effects of drugs to identify potential drug targets, or they can be direct, bottom-up and use molecular information to directly predict binding potentials. Both approaches can be combined with information about interaction networks.ResultsWe developed DTI-Voodoo as a computational method that combines molecular features and ontology-encoded phenotypic effects of drugs with protein–protein interaction networks, and uses a graph convolutional neural network to predict DTIs. We demonstrate that drug effect features can exploit information in the interaction network whereas molecular features do not. DTI-Voodoo is designed to predict candidate drugs for a given protein; we use this formulation to show that common DTI datasets contain intrinsic biases with major affects on performance evaluation and comparison of DTI prediction methods. Using a modified evaluation scheme, we demonstrate that DTI-Voodoo improves significantly over state of the art DTI prediction methods.AvailabilityDTI-Voodoo source code and data necessary to reproduce results are freely available at https://github.com/THinnerichs/DTI-VOODOO.Supplementary informationSupplementary data are available at https://github.com/ THinnerichs/DTI-VOODOO.

Download Full-text

Gkmexplain: Fast and Accurate Interpretation of Nonlinear Gapped k-mer SVMs Using Integrated Gradients

10.1101/457606 ◽

2018 ◽

Cited By ~ 1

Author(s):

Avanti Shrikumar ◽

Eva Prakash ◽

Anshul Kundaje

Keyword(s):

Dna Sequences ◽

Motif Discovery ◽

Chromatin Accessibility ◽

Support Vector ◽

Computationally Efficient ◽

Link Type ◽

Novel Approach ◽

Mutation Impact ◽

Regulatory Dna

AbstractSupport Vector Machines with gapped k-mer kernels (gkm-SVMs) have been used to learn predictive models of regulatory DNA sequence. However, interpreting predictive sequence patterns learned by gkm-SVMs can be challenging. Existing interpretation methods such as deltaSVM, in-silico mutagenesis (ISM), or SHAP either do not scale well or make limiting assumptions about the model that can produce misleading results when the gkm kernel is combined with nonlinear kernels. Here, we propose gkmexplain: a novel approach inspired by the method of Integrated Gradients for interpreting gkm-SVM models. Using simulated regulatory DNA sequences, we show that gkmexplain identifies predictive patterns with high accuracy while avoiding pitfalls of deltaSVM and ISM and being orders of magnitude more computationally efficient than SHAP. We use a novel motif discovery method called TF-MoDISco to recover consolidated TF motifs from gkm-SVM models of in vivo TF binding by aggregating predictive patterns identified by gkmexplain. Finally, we find that mutation impact scores derived through gkmexplain using gkm-SVM models of chromatin accessibility in lymphoblastoid cell-lines consistently outperform deltaSVM and ISM at identifying regulatory genetic variants (dsQTLs). Code and example notebooks replicating the workflow are available at https://github.com/kundajelab/gkmexplain. Explanatory videos available at http://bit.ly/gkmexplainvids.

Download Full-text

Taxonomic identification from metagenomic and metabarcoding data using any genetic marker

10.1101/253377 ◽

2018 ◽

Author(s):

Johan Bengtsson-Palme ◽

Rodney T. Richardson ◽

Marco Meola ◽

Christian Wurzbacher ◽

Émilie D. Tremblay ◽

...

Keyword(s):

Genetic Marker ◽

Dna Sequences ◽

Sequence Data ◽

Taxonomic Diversity ◽

Taxonomic Classification ◽

Taxonomic Identification ◽

Link Type

Correct taxonomic identification of DNA sequences is central to studies of biodiversity using both shotgun metagenomic and metabarcoding approaches. However, there is no genetic marker that gives sufficient performance across all the biological kingdoms, hampering studies of taxonomic diversity in many groups of organisms. We here present a major update to Metaxa2 (http://microbiology.se/software/metaxa2/) that enables the use of any genetic marker for taxonomic classification of metagenome and amplicon sequence data.

Download Full-text

Pavian: Interactive analysis of metagenomics data for microbiomics and pathogen identification

10.1101/084715 ◽

2016 ◽

Cited By ~ 25

Author(s):

Florian P. Breitwieser ◽

Steven L. Salzberg

Keyword(s):

Web Application ◽

Disease Diagnosis ◽

Supplementary Information ◽

Special Focus ◽

Web Browser ◽

R Language ◽

Interactive Analysis ◽

Link Type ◽

Metagenomics Data ◽

Flow Diagrams

AbstractSummaryPavian is a web application for exploring metagenomics classification results, with a special focus on infectious disease diagnosis. Pinpointing pathogens in metagenomics classification results is often complicated by host and laboratory contaminants as well as many non-pathogenic microbiota. With Pavian, researchers can analyze, display and transform results from the Kraken and Centrifuge classifiers using interactive tables, heatmaps and flow diagrams. Pavian also provides an alignment viewer for validation of matches to a particular genome.Availability and implementationPavian is implemented in the R language and based on the Shiny framework. It can be hosted on Windows, Mac OS X and Linux systems, and used with any contemporary web browser. It is freely available under a GPL-3 license from http://github.com/fbreitwieser/pavian. Furthermore a Docker image is provided at https://hub.docker.com/r/florianbw/[email protected] informationSupplementary data is available at Bioinformatics online.

Download Full-text

CRAFT: Compact genome Representation toward large-scale Alignment-Free daTabase

Bioinformatics ◽

10.1093/bioinformatics/btaa699 ◽

2020 ◽

Author(s):

Yang Young Lu ◽

Jiaxing Bai ◽

Yiwen Wang ◽

Ying Wang ◽

Fengzhu Sun

Keyword(s):

Dna Sequences ◽

Sequence Comparison ◽

Large Scale ◽

High Throughput Sequencing ◽

Sequence Data ◽

Practical Interest ◽

Supplementary Information ◽

Sequencing Data ◽

Computationally Efficient ◽

Alignment Free

Abstract Motivation Rapid developments in sequencing technologies have boosted generating high volumes of sequence data. To archive and analyze those data, one primary step is sequence comparison. Alignment-free sequence comparison based on k-mer frequencies offers a computationally efficient solution, yet in practice, the k-mer frequency vectors for large k of practical interest lead to excessive memory and storage consumption. Results We report CRAFT, a general genomic/metagenomic search engine to learn compact representations of sequences and perform fast comparison between DNA sequences. Specifically, given genome or high throughput sequencing data as input, CRAFT maps the data into a much smaller embedding space and locates the best matching genome in the archived massive sequence repositories. With 102−104-fold reduction of storage space, CRAFT performs fast query for gigabytes of data within seconds or minutes, achieving comparable performance as six state-of-the-art alignment-free measures. Availability and implementation CRAFT offers a user-friendly graphical user interface with one-click installation on Windows and Linux operating systems, freely available at https://github.com/jiaxingbai/CRAFT. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

A deep learning approach to pattern recognition for short DNA sequences

10.1101/353474 ◽

2018 ◽

Cited By ~ 10

Author(s):

Akosua Busia ◽

George E. Dahl ◽

Clara Fannjiang ◽

David H. Alexander ◽

Elizabeth Dorfman ◽

...

Keyword(s):

Deep Learning ◽

Dna Sequences ◽

Distinct Species ◽

Training Data ◽

Supplementary Information ◽

Learning Approach ◽

Biological Sequences ◽

Public Research ◽

Species Classification ◽

Link Type

AbstractMotivationInferring properties of biological sequences--such as determining the species-of-origin of a DNA sequence or the function of an amino-acid sequence--is a core task in many bioinformatics applications. These tasks are often solved using string-matching to map query sequences to labeled database sequences or via Hidden Markov Model-like pattern matching. In the current work we describe and assess an deep learning approach which trains a deep neural network (DNN) to predict database-derived labels directly from query sequences.ResultsWe demonstrate this DNN performs at state-of-the-art or above levels on a difficult, practically important problem: predicting species-of-origin from short reads of 16S ribosomal DNA. When trained on 16S sequences of over 13,000 distinct species, our DNN achieves read-level species classification accuracy within 2.0% of perfect memorization of training data, and produces more accurate genus-level assignments for reads from held-out species thank-mer, alignment, and taxonomic binning baselines. Moreover, our models exhibit greater robustness than these existing approaches to increasing noise in the query sequences. Finally, we show that these DNNs perform well on experimental 16S mock community dataset. Overall, our results constitute a first step towards our long-term goal of developing a general-purpose deep learning approach to predicting meaningful labels from short biological sequences.AvailabilityTensorFlow training code is available through GitHub (https://github.com/tensorflow/models/tree/master/research). Data in TensorFlow TFRecord format is available on Google Cloud Storage (gs://brain-genomics-public/research/seq2species/)[email protected] informationSupplementary data are available in a separate document.

Download Full-text

Detection and assembly of novel sequence insertions using Linked-Read technology

10.1101/551028 ◽

2019 ◽

Cited By ~ 3

Author(s):

Dmitry Meleshko ◽

Patrick Marks ◽

Stephen Williams ◽

Iman Hajirasouliha

Keyword(s):

Dna Sequences ◽

De Novo Assembly ◽

De Novo ◽

Supplementary Information ◽

Computational Techniques ◽

Whole Genome ◽

Structural Variations ◽

Short Read ◽

Link Type ◽

Long Read

AbstractMotivationEmerging Linked-Read (aka read-cloud) technologies such as the 10x Genomics Chromium system have great potential for accurate detection and phasing of largescale human genome structural variations (SVs). By leveraging the long-range information encoded in Linked-Read sequencing, computational techniques are able to detect and characterize complex structural variations that are previously undetectable by short-read methods. However, there is no available Linked-Read method for detection and assembly of novel sequence insertions, DNA sequences present in a given sequenced sample but missing in the reference genome, without requiring whole genome de novo assembly. In this paper, we propose a novel integrated alignment-based and local-assembly-based algorithm, Novel-X, that effectively uses the barcode information encoded in Linked-Read sequencing datasets to improve detection of such events without the need of whole genome de novo assembly. We evaluated our method on two haploid human genomes, CHM1 and CHM13, sequenced on the 10x Genomics Chromium system. These genomes have been also characterized with high coverage PacBio long-reads recently. We also tested our method on NA12878, the wellknown HapMap CEPH diploid genome and the child genome in a Yoruba trio (NA19240) which was recently studied on multiple sequencing platforms. Detecting insertion events is very challenging using short reads and the only viable available solution is by long-read sequencing (e.g. PabBio or ONT). Our experiments, however, show that Novel-X finds many insertions that cannot be found by state of the art tools using short-read sequencing data but present in PacBio data. Since Linked-Read sequencing is significantly cheaper than long-read sequencing, our method using Linked-Reads enables routine large-scale screenings of sequenced genomes for novel sequence insertions.AvailabilitySoftware is freely available at https://github.com/1dayac/[email protected] informationSupplementary data are available at https://github.com/1dayac/novel_insertions_supplementary

Download Full-text