Benchmarking and Testing Machine Learning Approaches with BARRA:CuRDa, a Curated RNA-Seq Database for Cancer Research

CuMiDa: An Extensively Curated Microarray Database for Benchmarking and Testing of Machine Learning Approaches in Cancer Research

Journal of Computational Biology ◽

10.1089/cmb.2018.0238 ◽

2019 ◽

Vol 26 (4) ◽

pp. 376-386

Author(s):

Bruno César Feltes ◽

Eduardo Bassani Chandelier ◽

Bruno Iochins Grisci ◽

Márcio Dorn

Keyword(s):

Machine Learning ◽

Cancer Research ◽

Learning Approaches ◽

Microarray Database

Download Full-text

Cross-platform normalization of microarray and RNA-seq data for machine learning applications

10.7287/peerj.preprints.1460 ◽

2015 ◽

Author(s):

Jeffrey A Thompson ◽

Jie Tan ◽

Casey S Greene

Keyword(s):

Gene Expression ◽

Machine Learning ◽

Machine Learning Algorithms ◽

Quantile Normalization ◽

Learning Approaches ◽

Rna Seq ◽

Distribution Matching ◽

Machine Learning Applications ◽

Cross Platform ◽

R Programming

Large, publicly available gene expression datasets are often analyzed with the aid of machine learning algorithms. Although RNA-seq is increasingly the technology of choice, a wealth of expression data already exist in the form of microarray data. If machine learning models built from legacy data can be applied to RNA-seq data, larger, more diverse training datasets can be created and validation can be performed on newly generated data. We developed Training Distribution Matching (TDM), which transforms RNA-seq data for use with models constructed from legacy platforms. We evaluated TDM, as well as quantile normalization and a simple log2 transformation, on both simulated and biological datasets of gene expression. Our evaluation included both supervised and unsupervised machine learning approaches. We found that TDM exhibited consistently strong performance across settings and that quantile normalization also performed well in many circumstances. We also provide a TDM package for the R programming language.

Download Full-text

Cross-platform normalization of microarray and RNA-seq data for machine learning applications

PeerJ ◽

10.7717/peerj.1621 ◽

2016 ◽

Vol 4 ◽

pp. e1621 ◽

Cited By ~ 42

Author(s):

Jeffrey A. Thompson ◽

Jie Tan ◽

Casey S. Greene

Keyword(s):

Gene Expression ◽

Machine Learning ◽

Machine Learning Algorithms ◽

Quantile Normalization ◽

Learning Approaches ◽

Rna Seq ◽

Distribution Matching ◽

Machine Learning Applications ◽

Cross Platform ◽

R Programming

Large, publicly available gene expression datasets are often analyzed with the aid of machine learning algorithms. Although RNA-seq is increasingly the technology of choice, a wealth of expression data already exist in the form of microarray data. If machine learning models built from legacy data can be applied to RNA-seq data, larger, more diverse training datasets can be created and validation can be performed on newly generated data. We developed Training Distribution Matching (TDM), which transforms RNA-seq data for use with models constructed from legacy platforms. We evaluated TDM, as well as quantile normalization, nonparanormal transformation, and a simplelog2transformation, on both simulated and biological datasets of gene expression. Our evaluation included both supervised and unsupervised machine learning approaches. We found that TDM exhibited consistently strong performance across settings and that quantile normalization also performed well in many circumstances. We also provide a TDM package for the R programming language.

Download Full-text

Cross-platform normalization of microarray and RNA-seq data for machine learning applications

10.7287/peerj.preprints.1460v1 ◽

2015 ◽

Author(s):

Jeffrey A Thompson ◽

Jie Tan ◽

Casey S Greene

Keyword(s):

Gene Expression ◽

Machine Learning ◽

Machine Learning Algorithms ◽

Quantile Normalization ◽

Learning Approaches ◽

Rna Seq ◽

Distribution Matching ◽

Machine Learning Applications ◽

Cross Platform ◽

R Programming

Large, publicly available gene expression datasets are often analyzed with the aid of machine learning algorithms. Although RNA-seq is increasingly the technology of choice, a wealth of expression data already exist in the form of microarray data. If machine learning models built from legacy data can be applied to RNA-seq data, larger, more diverse training datasets can be created and validation can be performed on newly generated data. We developed Training Distribution Matching (TDM), which transforms RNA-seq data for use with models constructed from legacy platforms. We evaluated TDM, as well as quantile normalization and a simple log2 transformation, on both simulated and biological datasets of gene expression. Our evaluation included both supervised and unsupervised machine learning approaches. We found that TDM exhibited consistently strong performance across settings and that quantile normalization also performed well in many circumstances. We also provide a TDM package for the R programming language.

Download Full-text

PIC-Me: paralogs and isoforms classifier based on machine-learning approaches

BMC Bioinformatics ◽

10.1186/s12859-021-04229-x ◽

2021 ◽

Vol 22 (S11) ◽

Author(s):

Jooseong Oh ◽

Sung-Gwon Lee ◽

Chungoo Park

Keyword(s):

Machine Learning ◽

Large Scale ◽

Gene Annotation ◽

Sequence Similarity ◽

Global Analysis ◽

Model Organism ◽

Model Organisms ◽

Support Vector ◽

Learning Approaches ◽

Rna Seq

Abstract Background Paralogs formed through gene duplication and isoforms formed through alternative splicing have been important processes for increasing protein diversity and maintaining cellular homeostasis. Despite their recognized importance and the advent of large-scale genomic and transcriptomic analyses, paradoxically, accurate annotations of all gene loci to allow the identification of paralogs and isoforms remain surprisingly incomplete. In particular, the global analysis of the transcriptome of a non-model organism for which there is no reference genome is especially challenging. Results To reliably discriminate between the paralogs and isoforms in RNA-seq data, we redefined the pre-existing sequence features (sequence similarity, inverse count of consecutive identical or non-identical blocks, and match-mismatch fraction) previously derived from full-length cDNAs and EST sequences and described newly discovered genomic and transcriptomic features (twilight zone of protein sequence alignment and expression level difference). In addition, the effectiveness and relevance of the proposed features were verified with two widely used support vector machine (SVM) and random forest (RF) models. From nine RNA-seq datasets, all AUC (area under the curve) scores of ROC (receiver operating characteristic) curves were over 0.9 in the RF model and significantly higher than those in the SVM model. Conclusions In this study, using an RF model with five proposed RNA-seq features, we implemented our method called Paralogs and Isoforms Classifier based on Machine-learning approaches (PIC-Me) and showed that it outperformed an existing method. Finally, we envision that our tool will be a valuable computational resource for the genomics community to help with gene annotation and will aid in comparative transcriptomics and evolutionary genomics studies, especially those on non-model organisms.

Download Full-text

Traffic Flow Breakdown Prediction using Machine Learning Approaches

Transportation Research Record Journal of the Transportation Research Board ◽

10.1177/0361198120934480 ◽

2020 ◽

Vol 2674 (10) ◽

pp. 560-570

Author(s):

Monika Filipovska ◽

Hani S. Mahmassani

Keyword(s):

Machine Learning ◽

Traffic Flow ◽

Testing Machine ◽

True Positive Rate ◽

Probabilistic Methods ◽

Support Vector ◽

Learning Approaches ◽

Traffic Dynamics ◽

True Negative ◽

Prediction Approach

Traffic flow breakdown is the abrupt shift from operation at free-flow conditions to congested conditions and is typically the result of complex interactions in traffic dynamics. Because of its stochastic nature, breakdown is commonly predicted only in a probabilistic manner. This paper focuses on using stationary aggregated traffic data to capture traffic dynamics, developing and testing machine learning (ML) approaches for traffic breakdown prediction and comparing them with the traditionally used probabilistic approaches. The contribution of this study is three-fold: it explores the usefulness of temporally and spatially lagged detector data in predicting traffic flow breakdown occurrence, it develops and tests ML approaches for traffic breakdown prediction using this data, and it compares the predictive power and performance of these approaches with the traditionally used probabilistic methods. Feature selection results indicate that breakdown prediction benefits greatly from the inclusion of temporally and spatially lagged variables. Comparing the performance of the ML methods with the probabilistic approaches, ML methods achieve better prediction performance in relation to the class-balanced accuracy, true positive rate (recall), true negative rate (specificity), and positive predictive value (precision). Depending on the application of the prediction approach, the method selection criteria may differ on a case-by-case basis. Overall, the best performance was achieved by the neural network and support vector machine approaches with class balancing, and with the random forest approach without class balancing. Recommendations on the choice of prediction approaches based on the specific application objectives are also given.

Download Full-text

Testing machine learning approaches for wind plants power output

2019 International Youth Conference on Radio Electronics, Electrical and Power Engineering (REEPE) ◽

10.1109/reepe.2019.8708815 ◽

2019 ◽

Author(s):

Alexander Malakhov ◽

Fyodor Goncharov ◽

Elena Gryazina

Keyword(s):

Machine Learning ◽

Power Output ◽

Testing Machine ◽

Learning Approaches

Download Full-text

blkbox: Integration of multiple machine learning approaches to identify disease biomarkers

10.1101/123430 ◽

2017 ◽

Author(s):

Boris Guennewig ◽

Zachary Davies ◽

Mark Pinese ◽

Antony A Cooper

Keyword(s):

Machine Learning ◽

Feature Selection ◽

R Package ◽

Learning Approaches ◽

Rna Seq ◽

Disease Biomarkers ◽

Potential Biomarker ◽

Selection Step ◽

High Dimensional Datasets ◽

Biomarker Selection

AbstractMotivationMachine learning (ML) is a powerful tool to create supervised models that can distinguish between classes and facilitate biomarker selection in high-dimensional datasets, including RNA Sequencing (RNA-Seq). However, it is variable as to which is the best performing ML algorithm(s) for a specific dataset, and identifying the optimal match is time consuming. blkbox is a software package including a shiny frontend, that integrates nine ML algorithms to select the best performing classifier for a specific dataset. blkbox accepts a simple abundance matrix as input, includes extensive visualization, and also provides an easy to use feature selection step to enable convenient and rapid potential biomarker selection, all without requiring parameter optimization.ResultsFeature selection makes blkbox computationally inexpensive while multi-functionality, including nested cross-fold validation (NCV), ensures robust results. blkbox identified algorithms that outperformed prior published ML results. Applying NCV identifies features, which are utilized to gain high accuracy.AvailabilityThe software is available as a CRAN R package and as a developer version with extended functionality on github (https://github.com/gboris/blkbox)[email protected]

Download Full-text

AImmune: a new blood-based machine learning approach to improving immune profiling analysis on COVID-19 patients

10.1101/2021.11.26.21266883 ◽

2021 ◽

Author(s):

Xi Tom Zhang ◽

Runpeng Harris Han

Keyword(s):

Machine Learning ◽

High Performance ◽

Mononuclear Cells ◽

Data Sets ◽

Learning Approaches ◽

Rna Seq ◽

Real World Data ◽

Novel Approach ◽

Massive Number ◽

Immune Profiling

A massive number of transcriptomic profiles of blood samples from COVID-19 patients has been produced since pandemic COVID-19 begins, however, these big data from primary studies have not been well integrated by machine learning approaches. Taking advantage of modern machine learning arthrograms, we integrated and collected single cell RNA-seq (scRNA-seq) data from three independent studies, identified genes potentially available for interpretation of severity, and developed a high-performance deep learning-based deconvolution model AImmune that can predict the proportion of seven different immune cells from the bulk RNA-seq results of human peripheral mononuclear cells. This novel approach which can be used for clinical blood testing of COVID-19 on the ground that previous research shows that mRNA alternations in blood-derived PBMCs may serve as a severity indicator. Assessed on real-world data sets, the AImmune model outperformed the most recognized immune profiling model CIBERSORTx. The presented study showed the results obtained by the true scRNA-seq route can be consistently reproduced through the new approach AImmune, indicating a potential replacing the costly scRNA-seq technique for the analysis of circulating blood cells for both clinical and research purposes.

Download Full-text

Recent Machine Learning Approaches for Single-Cell RNA-seq Data Analysis

Advanced Computational Intelligence in Healthcare-7 - Studies in Computational Intelligence ◽

10.1007/978-3-662-61114-2_5 ◽

2020 ◽

pp. 65-79

Author(s):

Aristidis G. Vrahatis ◽

Sotiris K. Tasoulis ◽

Ilias Maglogiannis ◽

Vassilis P. Plagianakos

Keyword(s):

Machine Learning ◽

Data Analysis ◽

Single Cell ◽

Learning Approaches ◽

Rna Seq

Download Full-text