Gist – an ensemble approach to the taxonomic classification of metatranscriptomic sequence data

Mapping Intimacies ◽

10.1101/081026 ◽

2016 ◽

Author(s):

Samantha Halliday ◽

John Parkinson

Keyword(s):

Species Interactions ◽

Sequence Data ◽

Rna Seq ◽

Sequencing Data ◽

Community Profiling ◽

Search Tool ◽

Functional Identities ◽

Taxonomic Assignments ◽

Generation Sequencing

ABSTRACTThe study of whole microbial communities through RNA-seq, or metatranscriptomics, offers a unique view of the relative levels of activity for different genes across a large number of species simultaneously. To make sense of these sequencing data, it is necessary to be able to assign both taxonomic and functional identities to each sequenced read. High-quality identifications are important not only for community profiling, but to also ensure that functional assignments of sequence reads are correctly attributed to their source taxa. Such assignments allow biochemical pathways to be appropriately allocated to discrete species, enabling the capture of cross-species interactions. Typically read annotation is performed by a single alignment-based search tool such as BLAST. However, due to the vast extent of bacterial diversity, these approaches tend to be highly error prone, particularly for taxonomic assignments. Here we introduce a novel program for generating taxonomic assignments, called Gist, which integrates information from a number of machine learning methods and the Burrows-Wheeler Aligner. Uniquely Gist establishes the most appropriate weightings of methods for individual genomes, facilitating high classification accuracy on next-generation sequencing reads. We validate our approach using a synthetic metatranscriptome generator based on Flux Simulator, termed Genepuddle. Further, unlike previous taxonomic classifiers, we demonstrate the capacity of composition-based techniques to accurately inform on taxonomic origin without resorting to longer scanning windows that mimic alignment-based methods. Gist is made freely available under the terms of the GNU General Public License at compsysbio.org/gist.

Download Full-text

Computational classification of microRNAs in next-generation sequencing data

Theoretical Chemistry Accounts ◽

10.1007/s00214-009-0684-z ◽

2009 ◽

Vol 125 (3-6) ◽

pp. 637-642

Author(s):

Joshua Riback ◽

Artemis G. Hatzigeorgiou ◽

Martin Reczko

Keyword(s):

Next Generation Sequencing ◽

Next Generation Sequencing Data ◽

Next Generation ◽

Sequencing Data ◽

Generation Sequencing

Download Full-text

An iterative and automated computational pipeline for untargeted strain-level identification using MS/MS spectra from pathogenic samples

10.1101/812313 ◽

2019 ◽

Author(s):

Mathias Kuhring ◽

Joerg Doellinger ◽

Andreas Nitsche ◽

Thilo Muth ◽

Bernhard Y. Renard

Keyword(s):

Statistical Power ◽

Sequence Data ◽

A Priori ◽

Search Space ◽

Strain Level ◽

Reference Sequence ◽

Viral Origin ◽

Identification Of Species ◽

Taxonomic Assignments

AbstractUntargeted accurate strain-level classification of a priori unidentified organisms using tandem mass spectrometry is a challenging task. Reference databases often lack taxonomic depth, limiting peptide assignments to the species level. However, the extension with detailed strain information increases runtime and decreases statistical power. In addition, larger databases contain a higher number of similar proteomes.We present TaxIt, an iterative workflow to address the increasing search space required for MS/MS-based strain-level classification of samples with unknown taxonomic origin. TaxIt first applies reference sequence data for initial identification of species candidates, followed by automated acquisition of relevant strain sequences for low level classification. Furthermore, proteome similarities resulting in ambiguous taxonomic assignments are addressed with an abundance weighting strategy to improve candidate confidence.We apply our iterative workflow on several samples of bacterial and viral origin. In comparison to non-iterative approaches using unique peptides or advanced abundance correction, TaxIt identifies microbial strains correctly in all examples presented (with one tie), thereby demonstrating the potential for untargeted and deeper taxonomic classification. TaxIt makes extensive use of public, unrestricted and continuously growing sequence resources such as the NCBI databases and is available under open-source license at https://gitlab.com/rki_bioinformatics.

Download Full-text

Rapid Mycobacterium tuberculosis spoligotyping from uncorrected long reads using Galru

10.1101/2020.05.31.126490 ◽

2020 ◽

Author(s):

Andrew J. Page ◽

Nabil-Fareed Alikhan ◽

Michael Strinden ◽

Thanh Le Viet ◽

Timofey Skvortsov

Keyword(s):

Mycobacterium Tuberculosis ◽

State Of The Art ◽

Sequence Data ◽

Human Pathogen ◽

Sequencing Data ◽

Short Read ◽

Short Read Sequencing ◽

Long Reads ◽

Long Read

AbstractSpoligotyping of Mycobacterium tuberculosis provides a subspecies classification of this major human pathogen. Spoligotypes can be predicted from short read genome sequencing data; however, no methods exist for long read sequence data such as from Nanopore or PacBio. We present a novel software package Galru, which can rapidly detect the spoligotype of a Mycobacterium tuberculosis sample from as little as a single uncorrected long read. It allows for near real-time spoligotyping from long read data as it is being sequenced, giving rapid sample typing. We compare it to the existing state of the art software and find it performs identically to the results obtained from short read sequencing data. Galru is freely available from https://github.com/quadram-institute-bioscience/galru under the GPLv3 open source licence.

Download Full-text

RAPSearch2: a fast and memory-efficient protein similarity search tool for next-generation sequencing data

Bioinformatics ◽

10.1093/bioinformatics/btr595 ◽

2011 ◽

Vol 28 (1) ◽

pp. 125-126 ◽

Cited By ~ 258

Author(s):

Y. Zhao ◽

H. Tang ◽

Y. Ye

Keyword(s):

Next Generation Sequencing ◽

Similarity Search ◽

Next Generation Sequencing Data ◽

Next Generation ◽

Sequencing Data ◽

Search Tool ◽

Generation Sequencing ◽

Memory Efficient

Download Full-text

Methods for analyzing next-generation sequencing data III. From setting a Linux environment to manipulating Lactobacillus RNA-seq data

Japanese Journal of Lactic Acid Bacteria ◽

10.4109/jslab.26.32 ◽

2015 ◽

Vol 26 (1) ◽

pp. 32-41

Author(s):

Jianqiang Sun ◽

Aya Miura ◽

Kentaro Shimizu ◽

Koji Kadota

Keyword(s):

Next Generation Sequencing ◽

Next Generation Sequencing Data ◽

Rna Seq ◽

Next Generation ◽

Sequencing Data ◽

Generation Sequencing

Download Full-text

A Framework for the RNA-Seq Based Classification and Prediction of Disease

10.20944/preprints201901.0068.v1 ◽

2019 ◽

Author(s):

Naiyar Iqbal ◽

Pradeep Kumar

Keyword(s):

Microarray Gene Expression Data ◽

Biological Data ◽

Disease Classification ◽

Rna Seq ◽

Microarray Gene Expression ◽

Early Detection Of Disease ◽

Medical Practitioners ◽

Next Generation Sequencing Ngs ◽

Generation Sequencing

Disease classification based on biological data is an important area in bioinformatics and biomedical research. It helps the doctors and medical practitioners for the early detection of disease and support them as a computer-aided diagnostic tool for accurate diagnosis, prognosis, and treatment of disease. Earlier Microarray gene expression data have wide application for the classification of disease, but now Next-generation sequencing (NGS) has replaced the Microarray technology. From the last few years, RNA sequence (RNA-Seq) data are widely used for the transcriptomic analysis. Hence, RNA-Seq based classification of disease is in its infancy. In this article, we present a general framework for the classification of disease constructed on RNA-Seq data. This framework will guide the researchers to process RNA-Seq, extract relevant features and apply the appropriate classifier to classify any kind of disease.

Download Full-text

Exogene: A performant workflow for detecting viral integrations from paired-end next-generation sequencing data

10.1101/2021.04.19.440427 ◽

2021 ◽

Author(s):

Jean-Pierre Kocher ◽

Zachary Stephens ◽

Daniel O'Brien ◽

Mrunal Dehankar ◽

Lewis Roberts ◽

...

Keyword(s):

Next Generation Sequencing ◽

Sequence Data ◽

Next Generation Sequencing Data ◽

Next Generation ◽

Sequencing Data ◽

Long Read ◽

Breakpoint Detection ◽

Targeted Capture ◽

Genome Heterogeneity ◽

Generation Sequencing

The integration of viruses into the human genome is known to be associated with tumorigenesis in many cancers, but the accurate detection of integration breakpoints from short read sequencing data is made difficult by human-viral homologies, viral genome heterogeneity, coverage limitations, and other factors. To address this, we present Exogene, a sensitive and efficient workflow for detecting viral integrations from paired-end next generation sequencing data. Exogene's read filtering and breakpoint detection strategies yield integration coordinates that are highly concordant with those found in long read validation sets. We demonstrate this concordance across 6 TCGA Hepatocellular carcinoma (HCC) tumor samples, identifying integrations of hepatitis B virus that are validated by long reads. Additionally, we applied Exogene to targeted capture data from 426 previously studied HCC samples, achieving 98.9% concordance with existing methods and identifying 238 high-confidence integrations that were not previously reported. Exogene is applicable to multiple types of paired-end sequence data, including genome, exome, RNA-Seq or targeted capture.

Download Full-text

Exome-Wide Analysis of the DiscovEHR Cohort Reveals Novel Candidate Pharmacogenomic Variants for Clinical Pharmacogenomics

Genes ◽

10.3390/genes11050561 ◽

2020 ◽

Vol 11 (5) ◽

pp. 561

Author(s):

Maria-Theodora Pandi ◽

Marc S. Williams ◽

Peter van der Spek ◽

Maria Koromina ◽

George P. Patrinos

Keyword(s):

Genetic Variation ◽

Sequence Data ◽

Sequencing Data ◽

Sequencing Technology ◽

Next Generation Sequencing Technology ◽

Exome Sequencing Data ◽

Whole Exome ◽

Whole Exome Sequencing Data ◽

Generation Sequencing

Recent advances in next-generation sequencing technology have led to the production of an unprecedented volume of genomic data, thus further advancing our understanding of the role of genetic variation in clinical pharmacogenomics. In the present study, we used whole exome sequencing data from 50,726 participants, as derived from the DiscovEHR cohort, to identify pharmacogenomic variants of potential clinical relevance, according to their occurrence within the PharmGKB database. We further assessed the distribution of the identified rare and common pharmacogenomics variants amongst different GnomAD subpopulations. Overall, our findings show that the use of publicly available sequence data, such as the DiscovEHR dataset and GnomAD, provides an opportunity for a deeper understanding of genetic variation in pharmacogenes with direct implications in clinical pharmacogenomics.

Download Full-text

Pyfastx: a robust Python package for fast random access to sequences from plain and gzipped FASTA/Q files

Briefings in Bioinformatics ◽

10.1093/bib/bbaa368 ◽

2020 ◽

Author(s):

Lianming Du ◽

Qin Liu ◽

Zhenxin Fan ◽

Jie Tang ◽

Xiuyue Zhang ◽

...

Keyword(s):

Sequence Data ◽

Random Access ◽

Biological Data ◽

Next Generation Sequencing Data ◽

Sequencing Data ◽

Limited Memory ◽

Data Formats ◽

Low Efficiency ◽

Python Package ◽

Generation Sequencing

Abstract FASTA and FASTQ are the most widely used biological data formats that have become the de facto standard to exchange sequence data between bioinformatics tools. With the avalanche of next-generation sequencing data, the amount of sequence data being deposited and accessed in FASTA/Q formats is increasing dramatically. However, the existing tools have very low efficiency at random retrieval of subsequences due to the requirement of loading the entire index into memory. In addition, most existing tools have no capability to build index for large FASTA/Q files because of the limited memory. Furthermore, the tools do not provide support to randomly accessing sequences from FASTA/Q files compressed by gzip, which is extensively adopted by most public databases to compress data for saving storage. In this study, we developed pyfastx as a versatile Python package with commonly used command-line tools to overcome the above limitations. Compared to other tools, pyfastx yielded the highest performance in terms of building index and random access to sequences, particularly when dealing with large FASTA/Q files with hundreds of millions of sequences. A key advantage of pyfastx over other tools is that it offers an efficient way to randomly extract subsequences directly from gzip compressed FASTA/Q files without needing to uncompress beforehand. Pyfastx can easily be installed from PyPI (https://pypi.org/project/pyfastx) and the source code is freely available at https://github.com/lmdu/pyfastx.

Download Full-text