Plant MicroRNA Prediction by Supervised Machine Learning Using C5.0 Decision Trees

Journal of Nucleic Acids ◽

10.1155/2012/652979 ◽

2012 ◽

Vol 2012 ◽

pp. 1-10 ◽

Cited By ~ 14

Author(s):

Philip H. Williams ◽

Rod Eyles ◽

Georg Weiller

Keyword(s):

Machine Learning ◽

Sequence Data ◽

Mature Mirnas ◽

Read Count ◽

Supervised Machine Learning ◽

Sequencing Data ◽

Tree Model ◽

Rigorous Testing ◽

Plant Mirna ◽

Leave One Out

MicroRNAs (miRNAs) are nonprotein coding RNAs between 20 and 22 nucleotides long that attenuate protein production. Different types of sequence data are being investigated for novel miRNAs, including genomic and transcriptomic sequences. A variety of machine learning methods have successfully predicted miRNA precursors, mature miRNAs, and other nonprotein coding sequences. MirTools, mirDeep2, and miRanalyzer require “read count” to be included with the input sequences, which restricts their use to deep-sequencing data. Our aim was to train a predictor using a cross-section of different species to accurately predict miRNAs outside the training set. We wanted a system that did not require read-count for prediction and could therefore be applied to short sequences extracted from genomic, EST, or RNA-seq sources. A miRNA-predictive decision-tree model has been developed by supervised machine learning. It only requires that the corresponding genome or transcriptome is available within a sequence window that includes the precursor candidate so that the required sequence features can be collected. Some of the most critical features for training the predictor are the miRNA:miRNA∗duplex energy and the number of mismatches in the duplex. We present a cross-species plant miRNA predictor with 84.08% sensitivity and 98.53% specificity based on rigorous testing by leave-one-out validation.

Download Full-text

Prediction of Compound-Protein Interactions with Machine Learning Methods

Chemoinformatics and Advanced Machine Learning Perspectives ◽

10.4018/978-1-61520-911-8.ch016 ◽

2011 ◽

pp. 304-317

Author(s):

Yoshihiro Yamanishi ◽

Hisashi Kashima

Keyword(s):

Machine Learning ◽

Protein Interactions ◽

Chemical Structure ◽

Genomic Sequence ◽

Sequence Data ◽

Binary Classification ◽

Biological Data ◽

Supervised Machine Learning ◽

Learning Methods ◽

Machine Learning Methods

In silico prediction of compound-protein interactions from heterogeneous biological data is critical in the process of drug development. In this chapter the authors review several supervised machine learning methods to predict unknown compound-protein interactions from chemical structure and genomic sequence information simultaneously. The authors review several kernel-based algorithms from two different viewpoints: binary classification and dimension reduction. In the results, they demonstrate the usefulness of the methods on the prediction of drug-target interactions and ligand-protein interactions from chemical structure data and genomic sequence data.

Download Full-text

Machine learning using intrinsic genomic signatures for rapid classification of novel pathogens: COVID-19 case study

10.1101/2020.02.03.932350 ◽

2020 ◽

Cited By ~ 10

Author(s):

Gurjit S. Randhawa ◽

Maximillian P.M. Soltysiak ◽

Hadi El Roz ◽

Camila P.E. de Souza ◽

Kathleen A. Hill ◽

...

Keyword(s):

Machine Learning ◽

Death Rate ◽

Genomic Sequence ◽

Sequence Data ◽

Rank Correlation ◽

Taxonomic Classification ◽

Supervised Machine Learning ◽

Biological Knowledge ◽

Alignment Free

AbstractAs of February 20, 2020, the 2019 novel coronavirus (renamed to COVID-19) spread to 30 countries with 2130 deaths and more than 75500 confirmed cases. COVID-19 is being compared to the infamous SARS coronavirus, which resulted, between November 2002 and July 2003, in 8098 confirmed cases worldwide with a 9.6% death rate and 774 deaths. Though COVID-19 has a death rate of 2.8% as of 20 February, the 75752 confirmed cases in a few weeks (December 8, 2019 to February 20, 2020) are alarming, with cases likely being under-reported given the comparatively longer incubation period. Such outbreaks demand elucidation of taxonomic classification and origin of the virus genomic sequence, for strategic planning, containment, and treatment. This paper identifies an intrinsic COVID-19 genomic signature and uses it together with a machine learning-based alignment-free approach for an ultra-fast, scalable, and highly accurate classification of whole COVID-19 genomes. The proposed method combines supervised machine learning with digital signal processing for genome analyses, augmented by a decision tree approach to the machine learning component, and a Spearman’s rank correlation coefficient analysis for result validation. These tools are used to analyze a large dataset of over 5000 unique viral genomic sequences, totalling 61.8 million bp. Our results support a hypothesis of a bat origin and classify COVID-19 as Sarbecovirus, within Betacoronavirus. Our method achieves high levels of classification accuracy and discovers the most relevant relationships among over 5,000 viral genomes within a few minutes, ab initio, using raw DNA sequence data alone, and without any specialized biological knowledge, training, gene or genome annotations. This suggests that, for novel viral and pathogen genome sequences, this alignment-free whole-genome machine-learning approach can provide a reliable real-time option for taxonomic classification.

Download Full-text

Prediction of Compound-protein Interactions with Machine Learning Methods

Machine Learning ◽

10.4018/978-1-60960-818-7.ch315 ◽

2012 ◽

pp. 616-630

Author(s):

Yoshihiro Yamanishi ◽

Hisashi Kashima

Keyword(s):

Machine Learning ◽

Protein Interactions ◽

Chemical Structure ◽

Genomic Sequence ◽

Sequence Data ◽

Binary Classification ◽

Biological Data ◽

Supervised Machine Learning ◽

Learning Methods ◽

Machine Learning Methods

Download Full-text

Prediction of MicroRNA-Disease Associations Based on Social Network Analysis Methods

BioMed Research International ◽

10.1155/2015/810514 ◽

2015 ◽

Vol 2015 ◽

pp. 1-9 ◽

Cited By ~ 69

Author(s):

Quan Zou ◽

Jinjin Li ◽

Qingqi Hong ◽

Ziyu Lin ◽

Yun Wu ◽

...

Keyword(s):

Machine Learning ◽

Social Network ◽

Social Network Analysis ◽

Network Analysis ◽

Cross Validation ◽

Supervised Machine Learning ◽

Rna Molecules ◽

Disease Associations ◽

Endogenous Genes ◽

Leave One Out

MicroRNAs constitute an important class of noncoding, single-stranded, ~22 nucleotide long RNA molecules encoded by endogenous genes. They play an important role in regulating gene transcription and the regulation of normal development. MicroRNAs can be associated with disease; however, only a few microRNA-disease associations have been confirmed by traditional experimental approaches. We introduce two methods to predict microRNA-disease association. The first method, KATZ, focuses on integrating the social network analysis method with machine learning and is based on networks derived from known microRNA-disease associations, disease-disease associations, and microRNA-microRNA associations. The other method, CATAPULT, is a supervised machine learning method. We applied the two methods to 242 known microRNA-disease associations and evaluated their performance using leave-one-out cross-validation and 3-fold cross-validation. Experiments proved that our methods outperformed the state-of-the-art methods.

Download Full-text

Algebraic Shortcuts for Leave-One-Out Cross-Validation in Supervised Network Inference

10.1101/242321 ◽

2018 ◽

Author(s):

Michiel Stock ◽

Tapio Pahikkala ◽

Antti Airola ◽

Willem Waegeman ◽

Bernard De Baets

Keyword(s):

Machine Learning ◽

Biological Networks ◽

Regulatory Networks ◽

Network Inference ◽

Cross Validation ◽

Supervised Machine Learning ◽

Machine Learning Techniques ◽

Ligand Interaction ◽

Learning Techniques ◽

Leave One Out

AbstractMotivationSupervised machine learning techniques have traditionally been very successful at reconstructing biological networks, such as protein-ligand interaction, protein-protein interaction and gene regulatory networks. Recently, much emphasis has been placed on the correct evaluation of such supervised models. It is vital to distinguish between using the model to either predict new interactions in a given network or to predict interactions for a new vertex not present in the original network. Specific cross-validation schemes need to be used to assess the performance in such different prediction settings.ResultsWe present a series of leave-one-out cross-validation shortcuts to rapidly estimate the performance of state-of-the-art kernel-based network inference techniques.AvailabilityThe machine learning techniques with the algebraic shortcuts are implemented in the RLScore software package.

Download Full-text

A semi-supervised machine learning framework for microRNA classification

Human Genomics ◽

10.1186/s40246-019-0221-7 ◽

2019 ◽

Vol 13 (S1) ◽

Cited By ~ 2

Author(s):

Mohsen Sheikh Hassani ◽

James R. Green

Keyword(s):

Machine Learning ◽

Sequence Data ◽

Regulation Of Gene Expression ◽

Classification Performance ◽

Unlabeled Data ◽

Training Data ◽

Supervised Machine Learning ◽

Novel Mirna ◽

Multi Stage ◽

Wet Lab

Abstract Background MicroRNAs (miRNAs) are a family of short, non-coding RNAs that have been linked to critical cellular activities, most notably regulation of gene expression. The identification of miRNA is a cross-disciplinary approach that requires both computational identification methods and wet-lab validation experiments, making it a resource-intensive procedure. While numerous machine learning methods have been developed to increase classification accuracy and thus reduce validation costs, most methods use supervised learning and thus require large labeled training data sets, often not feasible for less-sequenced species. On the other hand, there is now an abundance of unlabeled RNA sequence data due to the emergence of high-throughput wet-lab experimental procedures, such as next-generation sequencing. Results This paper explores the application of semi-supervised machine learning for miRNA classification in order to maximize the utility of both labeled and unlabeled data. We here present the novel combination of two semi-supervised approaches: active learning and multi-view co-training. Results across six diverse species show that this multi-stage semi-supervised approach is able to improve classification performance using very small numbers of labeled instances, effectively leveraging the available unlabeled data. Conclusions The proposed semi-supervised miRNA classification pipeline holds the potential to identify novel miRNA with high recall and precision while requiring very small numbers of previously known miRNA. Such a method could be highly beneficial when studying miRNA in newly sequenced genomes of niche species with few known examples of miRNA.

Download Full-text

16S rRNA sequence embeddings: Meaningful numeric feature representations of nucleotide sequences that are convenient for downstream analyses

10.1101/314260 ◽

2018 ◽

Cited By ~ 1

Author(s):

Stephen Woloszynek ◽

Zhengqiao Zhao ◽

Jian Chen ◽

Gail L. Rosen

Keyword(s):

Machine Learning ◽

16S Rrna ◽

Nucleotide Sequences ◽

The Body ◽

Supervised Machine Learning ◽

Numerical Representation ◽

Body Site ◽

Sequencing Data ◽

Machine Learning Applications ◽

Microbiome Data

AbstractAdvances in high-throughput sequencing have increased the availability of microbiome sequencing data that can be exploited to characterize microbiome community structure in situ. We explore using word and sentence embedding approaches for nucleotide sequences since they may be a suitable numerical representation for downstream machine learning applications (especially deep learning). This work involves first encoding (“embedding”) each sequence into a dense, low-dimensional, numeric vector space. Here, we use Skip-Gram word2vec to embed k-mers, obtained from 16S rRNA amplicon surveys, and then leverage an existing sentence embedding technique to embed all sequences belonging to specific body sites or samples. We demonstrate that these representations are biologically meaningful, and hence the embedding space can be exploited as a form of feature extraction for exploratory analysis. We show that sequence embeddings preserve relevant information about the sequencing data such as k-mer context, sequence taxonomy, and sample class. Specifically, the sequence embedding space resolved differences among phyla, as well as differences among genera within the same family. Distances between sequence embeddings had similar qualities to distances between alignment identities, and embedding multiple sequences can be thought of as generating a consensus sequence. Using sample embeddings for body site classification resulted in negligible performance loss compared to using OTU abundance data. Lastly, the k-mer embedding space captured distinct k-mer profiles that mapped to specific regions of the 16S rRNA gene and corresponded with particular body sites. Together, our results show that embedding sequences results in meaningful representations that can be used for exploratory analyses or for downstream machine learning applications that require numeric data. Moreover, because the embeddings are trained in an unsupervised manner, unlabeled data can be embedded and used to bolster supervised machine learning tasks.Author summaryImprovements in the way genomes are sequenced have led to an abundance of microbiome data. With the right approaches, researchers use this data to thoroughly characterize how microbes interact with each other and their host, but sequencing data is of a form (sequences of letters) not ideal for many data analysis approaches. We therefore present an approach to transform sequencing data into arrays of numbers that can capture interesting qualities of the data at the sub-sequence, full-sequence, and sample levels. This allows us to measure the importance of certain microbial sequences with respect to the type of microbe and the condition of the host. Also, representing sequences in this way improves our ability to use other complicated modeling approaches. Using microbiome data from human samples, we show that our numeric representations captured differences between different types of microbes, as well as differences in the body site location from which the samples were collected.

Download Full-text

Revised annotation and characterization of novel Aedes albopictus miRNAs and their potential functions in dengue virus infection

10.1101/2020.03.01.972398 ◽

2020 ◽

Author(s):

Azali Azlan ◽

Muhammad Amir Yunus ◽

Ghows Azzam

Keyword(s):

Aedes Albopictus ◽

Target Genes ◽

Sequence Data ◽

Target Prediction ◽

Mature Mirnas ◽

Denv Infection ◽

Future Research ◽

Small Rna Sequencing ◽

Sequencing Data ◽

Public Datasets

AbstractThe Asian tiger mosquito, Aedes albopictus (Ae. albopictus), is a highly invasive species that transmit several arboviruses including dengue (DENV), Zika (ZIKV), and chikungunya (CHIKV). Although several studies have identified microRNAs (miRNAs) in Ae. albopictus, it is crucial to extend and improve current annotations with the newly improved genome assembly, and the increase number of small RNA-sequencing data. We combined our high-depth sequence data and 26 public datasets to re-annotate Ae. albopictus miRNAs, and found a total of 110 novel mature miRNAs. We discovered that the expression of novel miRNAs was lower than known miRNAs. Furthermore, compared to known miRNAs, novel miRNAs are prone to be expressed in stage-specific manner. Upon DENV infection, a total of 59 novel miRNAs were differentially expressed, and target prediction analysis revealed that miRNA-target genes were involved in lipid metabolism and protein processing in endoplasmic reticulum. Taken together, miRNA annotation profile provided here is the most comprehensive to date, and we believed that this will facilitate future research in understanding virus-host interactions particularly on the role of miRNAs.

Download Full-text

S/HIC: Robust identification of soft and hard sweeps using machine learning

10.1101/024547 ◽

2015 ◽

Cited By ~ 1

Author(s):

Daniel R. Schrider ◽

Andrew D. Kern

Keyword(s):

Machine Learning ◽

Population Sample ◽

Natural Populations ◽

Supervised Machine Learning ◽

Whole Genome Sequencing Data ◽

Human Populations ◽

Demographic Model ◽

Selective Sweeps ◽

Sequencing Data ◽

Standing Variation

ABSTRACTDetecting the targets of adaptive natural selection from whole genome sequencing data is a central problem for population genetics. However, to date most methods have shown sub-optimal performance under realistic demographic scenarios. Moreover, over the past decade there has been a renewed interest in determining the importance of selection from standing variation in adaptation of natural populations, yet very few methods for inferring this model of adaptation at the genome scale have been introduced. Here we introduce a new method, S/HIC, which uses supervised machine learning to precisely infer the location of both hard and soft selective sweeps. We show that S/HIC has unrivaled accuracy for detecting sweeps under demographic histories that are relevant to human populations, and distinguishing sweeps from linked as well as neutrally evolving regions. Moreover we show that S/HIC is uniquely robust among its competitors to model misspecification. Thus even if the true demographic model of a population differs catastrophically from that specified by the user, S/HIC still retains impressive discriminatory power. Finally we apply S/HIC to the case of resequencing data from human chromosome 18 in a European population sample and demonstrate that we can reliably recover selective sweeps that have been identified earlier using less specific and sensitive methods.

Download Full-text

Comparative machine learning approach for biomarker identification using multiomics data from patients with endometriosis

10.32469/10355/73840 ◽

2019 ◽

Author(s):

◽

Sadia Akter

Keyword(s):

Machine Learning ◽

Molecular Mechanisms ◽

Sequence Data ◽

Supervised Machine Learning ◽

Support Vector ◽

Learning Tools ◽

Next Generation ◽

University Of Missouri ◽

Gynecological Disorder ◽

Ngs Data

[ACCESS RESTRICTED TO THE UNIVERSITY OF MISSOURI AT REQUEST OF AUTHOR.] Endometriosis is a complex and common gynecological disorder yet a poorly understood disease affecting about 176 million women worldwide, and causing significant impact on their quality of life and economic burden. Neither a definitive clinical symptom nor a minimally invasive diagnostic method is available thus leading to an average of 10 years of diagnostic latency. Discovery of relevant biological patterns from microarray expression or next generation sequence (NGS) data has been advanced over the last several decades by applying various machine learning tools. The overall objective of this project was to identify diagnostic molecular mechanisms and biomarkers of endometriosis using a multi-omics approach and various machine learning classifiers. This objective was fulfilled by three related but independent aims: (1) mining rna-seq data to discover molecular mechanisms of endometriosis, (2) to discover diagnostics features of endometriosis in the DNA-methylation profile of the endometrium, and (3) develop innovative machine learning-based differential classification models using whole genome high throughput next generation sequence data. We experimented how well various supervised machine learning methods such as decision tree, Partial least squares-discriminant analysis, support vector machine, random forest and a newly developed method called GenomeForest perform in classifying endometriosis from the control samples trained on both transcriptomics and methylomics data.

Download Full-text