DeepMicrobes: taxonomic classification for metagenomics with deep learning

Mapping Intimacies ◽

10.1101/694851 ◽

2019 ◽

Cited By ~ 1

Author(s):

Qiaoxing Liang ◽

Paul W. Bible ◽

Yu Liu ◽

Bin Zou ◽

Lai Wei

Keyword(s):

Deep Learning ◽

Large Scale ◽

Genomic Sequence ◽

Taxonomic Classification ◽

Sequencing Data ◽

Computational Framework ◽

Genome Wide ◽

Disease Diagnostics ◽

Genomic Sequence Analysis ◽

Microbial Genomic

AbstractTaxonomic classification is a crucial step for metagenomics applications including disease diagnostics, microbiome analyses, and outbreak tracing. Yet it is unknown what deep learning architecture can capture microbial genome-wide features relevant to this task. We report DeepMicrobes (https://github.com/MicrobeLab/DeepMicrobes), a computational framework that can perform large-scale training on > 10,000 RefSeq complete microbial genomes and accurately predict the species-of-origin of whole metagenome shotgun sequencing reads. We show the advantage of DeepMicrobes over state-of-the-art tools in precisely identifying species from microbial community sequencing data. Therefore, DeepMicrobes expands the toolbox of taxonomic classification for metagenomics and enables the development of further deep learning-based bioinformatics algorithms for microbial genomic sequence analysis.

Download Full-text

DeepMicrobes: taxonomic classification for metagenomics with deep learning

NAR Genomics and Bioinformatics ◽

10.1093/nargab/lqaa009 ◽

2020 ◽

Vol 2 (1) ◽

Cited By ~ 7

Author(s):

Qiaoxing Liang ◽

Paul W Bible ◽

Yu Liu ◽

Bin Zou ◽

Lai Wei

Keyword(s):

Deep Learning ◽

Large Scale ◽

Taxonomic Classification ◽

Reference Database ◽

Computational Framework ◽

Bowel Diseases ◽

Comparable Accuracy ◽

Inflammatory Bowel ◽

Genome Assemblies ◽

Taxonomic Tree

Abstract Large-scale metagenomic assemblies have uncovered thousands of new species greatly expanding the known diversity of microbiomes in specific habitats. To investigate the roles of these uncultured species in human health or the environment, researchers need to incorporate their genome assemblies into a reference database for taxonomic classification. However, this procedure is hindered by the lack of a well-curated taxonomic tree for newly discovered species, which is required by current metagenomics tools. Here we report DeepMicrobes, a deep learning-based computational framework for taxonomic classification that allows researchers to bypass this limitation. We show the advantage of DeepMicrobes over state-of-the-art tools in species and genus identification and comparable accuracy in abundance estimation. We trained DeepMicrobes on genomes reconstructed from gut microbiomes and discovered potential novel signatures in inflammatory bowel diseases. DeepMicrobes facilitates effective investigations into the uncharacterized roles of metagenomic species.

Download Full-text

Predicting Chromosome Flexibility from the Genomic Sequence Based on Deep Learning Neural Networks

Current Bioinformatics ◽

10.2174/1574893616666210827095829 ◽

2021 ◽

Vol 16 ◽

Author(s):

Jinghao Peng ◽

Jiajie Peng ◽

Haiyin Piao ◽

Zhang Luo ◽

Kelin Xia ◽

...

Keyword(s):

Deep Learning ◽

High Performance ◽

Genomic Sequence ◽

Sequence Data ◽

Function Analysis ◽

Double Helix ◽

Gm12878 Cell ◽

Genomic Sequence Analysis ◽

And Function ◽

Nuclear Processes

Background: The open and accessible regions of the chromosome are more likely to be bound by transcription factors which are important for nuclear processes and biological functions. Studying the change of chromosome flexibility can help to discover and analyze disease markers and improve the efficiency of clinical diagnosis. Current methods for predicting chromosome flexibility based on Hi-C data include the flexibility-rigidity index (FRI) and the Gaussian network model (GNM), which have been proposed to characterize chromosome flexibility. However, these methods require the chromosome structure data based on 3D biological experiments, which is time-consuming and expensive. Objective: Generally, the folding and curling of the double helix sequence of DNA have a great impact on chromosome flexibility and function. Motivated by the success of genomic sequence analysis in biomolecular function analysis, we hope to propose a method to predict chromosome flexibility only based on genomic sequence data. Method: We propose a new method (named "DeepCFP") using deep learning models to predict chromosome flexibility based on only genomic sequence features. The model has been tested in the GM12878 cell line. Results: The maximum accuracy of our model has reached 91%. The performance of DeepCFP is close to FRI and GNM. Conclusion: The DeepCFP can achieve high performance only based on genomic sequence.

Download Full-text

Genome-Wide Identification and Characterization of Long Non-Coding RNAs in Peanut

Genes ◽

10.3390/genes10070536 ◽

2019 ◽

Vol 10 (7) ◽

pp. 536 ◽

Cited By ~ 2

Author(s):

Xiaobo Zhao ◽

Liming Gan ◽

Caixia Yan ◽

Chunjuan Li ◽

Quanxi Sun ◽

...

Keyword(s):

Large Scale ◽

Target Genes ◽

Sequencing Data ◽

Regulatory Processes ◽

Genome Wide ◽

Non Coding Rnas ◽

Identification And Characterization ◽

Lower Expression ◽

Weighted Correlation

Long non-coding RNAs (lncRNAs) are involved in various regulatory processes although they do not encode protein. Presently, there is little information regarding the identification of lncRNAs in peanut (Arachis hypogaea Linn.). In this study, 50,873 lncRNAs of peanut were identified from large-scale published RNA sequencing data that belonged to 124 samples involving 15 different tissues. The average lengths of lncRNA and mRNA were 4335 bp and 954 bp, respectively. Compared to the mRNAs, the lncRNAs were shorter, with fewer exons and lower expression levels. The 4713 co-expression lncRNAs (expressed in all samples) were used to construct co-expression networks by using the weighted correlation network analysis (WGCNA). LncRNAs correlating with the growth and development of different peanut tissues were obtained, and target genes for 386 hub lncRNAs of all lncRNAs co-expressions were predicted. Taken together, these findings can provide a comprehensive identification of lncRNAs in peanut.

Download Full-text

SSCC: a novel computational framework for rapid and accurate clustering large single cell RNA-seq data

10.1101/344242 ◽

2018 ◽

Cited By ~ 2

Author(s):

Xianwen Ren ◽

Liangtao Zheng ◽

Zemin Zhang

Keyword(s):

Single Cell ◽

Rna Sequencing ◽

Large Scale ◽

Random Projection ◽

Rna Seq ◽

Sequencing Data ◽

Computational Framework ◽

Human Blood Cells ◽

Single Cell Rna Sequencing ◽

Data Volume

ABSTRACTClustering is a prevalent analytical means to analyze single cell RNA sequencing data but the rapidly expanding data volume can make this process computational challenging. New methods for both accurate and efficient clustering are of pressing needs. Here we proposed a new clustering framework based on random projection and feature construction for large scale single-cell RNA sequencing data, which greatly improves clustering accuracy, robustness and computational efficacy for various state-of-the-art algorithms benchmarked on multiple real datasets. On a dataset with 68,578 human blood cells, our method reached 20% improvements for clustering accuracy and 50-fold acceleration but only consumed 66% memory usage compared to the widely-used software package SC3. Compared to k-means, the accuracy improvement can reach 3-fold depending on the concrete dataset. An R implementation of the framework is available from https://github.com/Japrin/sscClust.

Download Full-text

Population-level genome-wide STR typing in Plasmodium species reveals higher resolution population structure and genetic diversity relative to SNP typing

10.1101/2021.05.19.444768 ◽

2021 ◽

Author(s):

Jiru Han ◽

Jacob E Munro ◽

Anthony Kocoski ◽

Alyssa E Barry ◽

Melanie Bahlo

Keyword(s):

Genetic Diversity ◽

Large Scale ◽

Tandem Repeats ◽

Plasmodium Species ◽

Whole Genome Sequence ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Sequencing Data ◽

Genome Wide ◽

Field Samples

Short tandem repeats (STRs) are highly informative genetic markers that have been used extensively in population genetics analysis. They are an important source of genetic diversity and can also have functional impact. Despite the availability of bioinformatic methods that permit large-scale genome-wide genotyping of STRs from whole genome sequencing data, they have not previously been applied to sequencing data from large collections of malaria parasite field samples. Here, we have genotyped STRs using HipSTR in more than 3,000 Plasmodium falciparum and 174 Plasmodium vivax published whole-genome sequence data from samples collected across the globe. High levels of noise and variability in the resultant callset necessitated the development of a novel method for quality control of STR genotype calls. A set of high-quality STR loci (6,768 from P. falciparum and 3,496 from P. vivax) were used to study Plasmodium genetic diversity, population structures and genomic signatures of selection and these were compared to genome-wide single nucleotide polymorphism (SNP) genotyping data. In addition, the genome-wide information about genetic variation and other characteristics of STRs in P. falciparum and P. vivax have been made available in an interactive web-based R Shiny application PlasmoSTR (https://github.com/bahlolab/PlasmoSTR).

Download Full-text

CaSpER: Identification, visualization and integrative analysis of CNV events in multiscale resolution using single-cell or bulk RNA sequencing data

10.1101/426122 ◽

2018 ◽

Cited By ~ 1

Author(s):

Akdes Serin Harmancı ◽

Arif O. Harmanci ◽

Xiaobo Zhou

Keyword(s):

Single Cell ◽

Rna Sequencing ◽

Copy Number ◽

Large Scale ◽

Complete Characterization ◽

Integrative Analysis ◽

Sequencing Data ◽

Length Scales ◽

Expression Levels ◽

Genome Wide

AbstractRNA sequencing experiments generate large amounts of information about expression levels of genes. Although they are mainly used for quantifying expression levels, they contain much more biologically important information such as copy number variants (CNV). Here, we propose CaSpER, a signal processing approach for identification, visualization, and integrative analysis of focal and large-scale CNV events in multiscale resolution using either bulk or single-cell RNA sequencing data. CaSpER performs smoothing of the genome-wide RNA sequencing signal profiles in different multiscale resolutions, identifying CNV events at different length scales. CaSpER also employs a novel methodology for generation of genome-wide B-allele frequency (BAF) signal profile from the reads and utilizes it in multiscale fashion for correction of CNV calls. The shift in allelic signal is used to quantify the loss-of-heterozygosity (LOH) which is valuable for CNV identification. CaSpER uses Hidden Markov Models (HMM) to assign copy number states to regions. The multiscale nature of CaSpER enables comprehensive analysis of focal and large-scale CNVs and LOH segments. CaSpER performs well in accuracy compared to gold standard SNP genotyping arrays. In particular, analysis of single cell Glioblastoma (GBM) RNA sequencing data with CaSpER reveals novel mutually exclusive and co-occurring CNV sub-clones at different length scales. Moreover, CaSpER discovers gene expression signatures of CNV sub-clones, performs gene ontology (GO) enrichment analysis and identifies potential therapeutic targets for the sub-clones. CaSpER increases the utility of RNA-sequencing datasets and complements other tools for complete characterization and visualization of the genomic and transcriptomic landscape of single cell and bulk RNA sequencing data, especially in cancer research.

Download Full-text

Deep learning based on stacked sparse autoencoder applied to viral genome classification of SARS-CoV-2 virus

10.1101/2021.10.14.464414 ◽

2021 ◽

Author(s):

Gracielly G. F. Coutinho ◽

Gabriel B. M. Câmara ◽

Raquel de M. Barbosa ◽

Marcelo A. C. Fernandes

Keyword(s):

Deep Learning ◽

Viral Genome ◽

Genomic Sequence ◽

Confusion Matrix ◽

Taxonomic Classification ◽

Classification Problems ◽

Virus Identification ◽

Sparse Autoencoder ◽

Stacked Sparse Autoencoder

Since December 2019, the world has been intensely affected by the COVID-19 pandemic, caused by the SARS-CoV-2 virus, first identified in Wuhan, China. In the case of a novel virus identification, the early elucidation of taxonomic classification and origin of the virus genomic sequence is essential for strategic planning, containment, and treatments. Deep learning techniques have been successfully used in many viral classification problems associated with viral infections diagnosis, metagenomics, phylogenetic, and analysis. This work proposes to generate an efficient viral genome classifier for the SARS-CoV-2 virus using the deep neural network (DNN) based on the stacked sparse autoencoder (SSAE) technique. We performed four different experiments to provide different levels of taxonomic classification of the SARS-CoV-2 virus. The confusion matrix presented the validation and test sets and the ROC curve for the validation set. In all experiments, the SSAE technique provided great performance results. In this work, we explored the utilization of image representations of the complete genome sequences as the SSAE input to provide a viral classification of the SARS-CoV-2. For that, a dataset based on k-mers image representation, with k=6, was applied. The results indicated the applicability of using this deep learning technique in genome classification problems.

Download Full-text

DeepT3 2.0: improving type III secreted effector predictions by an integrative deep learning framework

NAR Genomics and Bioinformatics ◽

10.1093/nargab/lqab086 ◽

2021 ◽

Vol 3 (4) ◽

Author(s):

Runyu Jing ◽

Tingke Wen ◽

Chengxiang Liao ◽

Li Xue ◽

Fengjuan Liu ◽

...

Keyword(s):

Deep Learning ◽

Large Scale ◽

Amino Acid Sequences ◽

Secretion Systems ◽

Gram Negative ◽

Type Iii ◽

Learning Framework ◽

Genome Wide ◽

Type Iii Secretion Systems ◽

Animal Pathogens

Abstract Type III secretion systems (T3SSs) are bacterial membrane-embedded nanomachines that allow a number of humans, plant and animal pathogens to inject virulence factors directly into the cytoplasm of eukaryotic cells. Export of effectors through T3SSs is critical for motility and virulence of most Gram-negative pathogens. Current computational methods can predict type III secreted effectors (T3SEs) from amino acid sequences, but due to algorithmic constraints, reliable and large-scale prediction of T3SEs in Gram-negative bacteria remains a challenge. Here, we present DeepT3 2.0 (http://advintbioinforlab.com/deept3/), a novel web server that integrates different deep learning models for genome-wide predicting T3SEs from a bacterium of interest. DeepT3 2.0 combines various deep learning architectures including convolutional, recurrent, convolutional-recurrent and multilayer neural networks to learn N-terminal representations of proteins specifically for T3SE prediction. Outcomes from the different models are processed and integrated for discriminating T3SEs and non-T3SEs. Because it leverages diverse models and an integrative deep learning framework, DeepT3 2.0 outperforms existing methods in validation datasets. In addition, the features learned from networks are analyzed and visualized to explain how models make their predictions. We propose DeepT3 2.0 as an integrated and accurate tool for the discovery of T3SEs.

Download Full-text

mRIN for direct assessment of genome-wide and gene-specific mRNA integrity from large-scale RNA-sequencing data

Nature Communications ◽

10.1038/ncomms8816 ◽

2015 ◽

Vol 6 (1) ◽

Cited By ~ 38

Author(s):

Huijuan Feng ◽

Xuegong Zhang ◽

Chaolin Zhang

Keyword(s):

Rna Sequencing ◽

Large Scale ◽

Sequencing Data ◽

Direct Assessment ◽

Genome Wide ◽

Specific Mrna

Download Full-text

RiboVIEW: a computational framework for visualization, quality control and statistical analysis of ribosome profiling data

Nucleic Acids Research ◽

10.1093/nar/gkz1074 ◽

2019 ◽

Vol 48 (2) ◽

pp. e7-e7 ◽

Cited By ~ 1

Author(s):

Carine Legrand ◽

Francesca Tuorto

Keyword(s):

Quality Control ◽

Statistical Analysis ◽

Computational Analysis ◽

High Throughput Sequencing ◽

Ribosome Profiling ◽

Confounding Factors ◽

Batch Effects ◽

Sequencing Data ◽

Computational Framework ◽

Genome Wide

Abstract Recently, newly developed ribosome profiling methods based on high-throughput sequencing of ribosome-protected mRNA footprints allow to study genome-wide translational changes in detail. However, computational analysis of the sequencing data still represents a bottleneck for many laboratories. Further, specific pipelines for quality control and statistical analysis of ribosome profiling data, providing high levels of both accuracy and confidence, are currently lacking. In this study, we describe automated bioinformatic and statistical diagnoses to perform robust quality control of ribosome profiling data (RiboQC), to efficiently visualize ribosome positions and to estimate ribosome speed (RiboMine) in an unbiased way. We present an R pipeline to setup and undertake the analyses that offers the user an HTML page to scan own data regarding the following aspects: periodicity, ligation and digestion of footprints; reproducibility and batch effects of replicates; drug-related artifacts; unbiased codon enrichment including variability between mRNAs, for A, P and E sites; mining of some causal or confounding factors. We expect our pipeline to allow an optimal use of the wealth of information provided by ribosome profiling experiments.

Download Full-text