Cross-platform normalization of microarray and RNA-seq data for machine learning applications

PeerJ ◽

10.7717/peerj.1621 ◽

2016 ◽

Vol 4 ◽

pp. e1621 ◽

Cited By ~ 42

Author(s):

Jeffrey A. Thompson ◽

Jie Tan ◽

Casey S. Greene

Keyword(s):

Gene Expression ◽

Machine Learning ◽

Machine Learning Algorithms ◽

Quantile Normalization ◽

Learning Approaches ◽

Rna Seq ◽

Distribution Matching ◽

Machine Learning Applications ◽

Cross Platform ◽

R Programming

Large, publicly available gene expression datasets are often analyzed with the aid of machine learning algorithms. Although RNA-seq is increasingly the technology of choice, a wealth of expression data already exist in the form of microarray data. If machine learning models built from legacy data can be applied to RNA-seq data, larger, more diverse training datasets can be created and validation can be performed on newly generated data. We developed Training Distribution Matching (TDM), which transforms RNA-seq data for use with models constructed from legacy platforms. We evaluated TDM, as well as quantile normalization, nonparanormal transformation, and a simplelog2transformation, on both simulated and biological datasets of gene expression. Our evaluation included both supervised and unsupervised machine learning approaches. We found that TDM exhibited consistently strong performance across settings and that quantile normalization also performed well in many circumstances. We also provide a TDM package for the R programming language.

Download Full-text

Cross-platform normalization of microarray and RNA-seq data for machine learning applications

10.7287/peerj.preprints.1460v1 ◽

2015 ◽

Author(s):

Jeffrey A Thompson ◽

Jie Tan ◽

Casey S Greene

Keyword(s):

Gene Expression ◽

Machine Learning ◽

Machine Learning Algorithms ◽

Quantile Normalization ◽

Learning Approaches ◽

Rna Seq ◽

Distribution Matching ◽

Machine Learning Applications ◽

Cross Platform ◽

R Programming

Large, publicly available gene expression datasets are often analyzed with the aid of machine learning algorithms. Although RNA-seq is increasingly the technology of choice, a wealth of expression data already exist in the form of microarray data. If machine learning models built from legacy data can be applied to RNA-seq data, larger, more diverse training datasets can be created and validation can be performed on newly generated data. We developed Training Distribution Matching (TDM), which transforms RNA-seq data for use with models constructed from legacy platforms. We evaluated TDM, as well as quantile normalization and a simple log2 transformation, on both simulated and biological datasets of gene expression. Our evaluation included both supervised and unsupervised machine learning approaches. We found that TDM exhibited consistently strong performance across settings and that quantile normalization also performed well in many circumstances. We also provide a TDM package for the R programming language.

Download Full-text

Compendiums of cancer transcriptomes for machine learning applications

Scientific Data ◽

10.1038/s41597-019-0207-2 ◽

2019 ◽

Vol 6 (1) ◽

Cited By ~ 2

Author(s):

Su Bin Lim ◽

Swee Jin Tan ◽

Wan-Teck Lim ◽

Chwee Teck Lim

Keyword(s):

Machine Learning ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Data Reuse ◽

Rna Seq ◽

Genomic Landscape ◽

Source Data ◽

Machine Learning Applications ◽

Cancer Types ◽

Data Source

Abstract There are massive transcriptome profiles in the form of microarray. The challenge is that they are processed using diverse platforms and preprocessing tools, requiring considerable time and informatics expertise for cross-dataset analyses. If there exists a single, integrated data source, data-reuse can be facilitated for discovery, analysis, and validation of biomarker-based clinical strategy. Here, we present merged microarray-acquired datasets (MMDs) across 11 major cancer types, curating 8,386 patient-derived tumor and tumor-free samples from 95 GEO datasets. Using machine learning algorithms, we show that diagnostic models trained from MMDs can be directly applied to RNA-seq-acquired TCGA data with high classification accuracy. Machine learning optimized MMD further aids to reveal immune landscape across various carcinomas critically needed in disease management and clinical interventions. This unified data source may serve as an excellent training or test set to apply, develop, and refine machine learning algorithms that can be tapped to better define genomic landscape of human cancers.

Download Full-text

Peer Review #2 of "Cross-platform normalization of microarray and RNA-seq data for machine learning applications (v0.1)"

10.7287/peerj.1621v0.1/reviews/2 ◽

2016 ◽

Author(s):

CT Brown

Keyword(s):

Machine Learning ◽

Peer Review ◽

Rna Seq ◽

Machine Learning Applications ◽

Cross Platform

Download Full-text

Peer Review #1 of "Cross-platform normalization of microarray and RNA-seq data for machine learning applications (v0.1)"

10.7287/peerj.1621v0.1/reviews/1 ◽

2016 ◽

Author(s):

K Choi

Keyword(s):

Machine Learning ◽

Peer Review ◽

Rna Seq ◽

Machine Learning Applications ◽

Cross Platform

Download Full-text

Cross-Platform Normalization Enables Machine Learning Model Training On Microarray And RNA-Seq Data Simultaneously

10.1101/118349 ◽

2017 ◽

Cited By ~ 5

Author(s):

Jaclyn N Taroni ◽

Casey S Greene

Keyword(s):

Machine Learning ◽

Differential Expression ◽

Differential Expression Analysis ◽

Rna Seq ◽

Normalization Methods ◽

Machine Learning Model ◽

Distribution Matching ◽

Z Scores ◽

Cross Platform ◽

Model Training

Motivation: Large compendia of gene expression data have proven valuable for the discovery of novel biological relationships. The majority of available RNA assays are run on microarray, while RNA-seq is becoming the platform of choice for new experiments. The data structure and distributions between the platforms differ, making it challenging to combine them. We performed supervised and unsupervised machine learning evaluations, as well as differential expression analyses, to assess which normalization methods are best suited for combining microarray and RNA-seq data. Results: We find that quantile and Training Distribution Matching normalization allow for supervised and unsupervised model training on microarray and RNA-seq data simultaneously. Nonparanormal normalization and z-scores are also appropriate for some applications, including differential expression analysis. Availability and Implementation: These analyses were performed in R and are available at https://www.github.com/greenelab/RNAseq_titration_results under a BSD-3 clause license.

Download Full-text

Classification of Phonocardiography Signals Using Imbalanced Machine Learning Techniques

Journal of Intelligent Systems with Applications ◽

10.54856/jiswa.202012128 ◽

2020 ◽

pp. 103-106

Author(s):

Mustafa Berkant Selek ◽

Sude Pehlivan ◽

Yalcin Isler

Keyword(s):

Machine Learning ◽

Frequency Domain ◽

Cardiovascular Health ◽

Machine Learning Algorithms ◽

Machine Learning Techniques ◽

Learning Approaches ◽

Random Forest Algorithm ◽

Learning Techniques ◽

Machine Learning Applications

Cardiovascular diseases, which involve heart and blood vessel dysfunctions, cause a higher number of deaths than any other disease in the world. Throughout history, many approaches have been developed to analyze cardiovascular health by diagnosing such conditions. One of the methodologies is recording and analyzing heart sounds to distinguish normal and abnormal functioning of the heart, which is called Phonocardiography. With the emergence of machine learning applications in healthcare, this process can be automated via the extraction of various features from phonocardiography signals and performing classification with several machine learning algorithms. Many studies have been conducted to extract time and frequency domain features from the phonocardiography signals by segmenting them first into individual heart cycles, and then by classifying them using different machine learning and deep learning approaches. In this study, various time and frequency domain features have been extracted using the complete signal rather than just segments of it. Random Forest algorithm was found to be the most successful algorithm in terms of accuracy as well as recall and precision.

Download Full-text

Machine Learning Applications in Nanomedicine and Nanotoxicology

International Journal of Applied Nanotechnology Research ◽

10.4018/ijanr.2019010101 ◽

2019 ◽

Vol 4 (1) ◽

pp. 1-7

Author(s):

Gerardo M. Casañola-Martin ◽

Hai Pham-The

Keyword(s):

Artificial Intelligence ◽

Machine Learning ◽

Cloud Computing ◽

Machine Learning Algorithms ◽

Learning Approaches ◽

Future Perspectives ◽

Computational Tools ◽

Machine Learning Applications ◽

Nanoparticles Characterization

The development of machine learning algorithms together with the availability of computational tools nowadays have given an increase in the application of artificial intelligence methodologies in different fields. However, the use of these machine learning approaches in nanomedicine remains still underexplored in certain areas, despite the development in hardware and software tools. In this review, the recent advances in the conjunction of machine learning with nanomedicine are shown. Examples dealing with biomedical properties of nanoparticles, characterization of nanomaterials, text mining, and image analysis are also presented. Finally, some future perspectives in the integration of nanomedicine with cloud computing, deep learning and other techniques are discussed.

Download Full-text

Prediction of K562 Cells Functional Inhibitors Based on Machine Learning Approaches

Current Pharmaceutical Design ◽

10.2174/1381612825666191107092214 ◽

2020 ◽

Vol 25 (40) ◽

pp. 4296-4302 ◽

Cited By ~ 2

Author(s):

Yuan Zhang ◽

Zhenyan Han ◽

Qian Gao ◽

Xiaoyi Bai ◽

Chi Zhang ◽

...

Keyword(s):

Machine Learning ◽

Inclusion Bodies ◽

Cross Validation ◽

Independent Set ◽

K562 Cells ◽

Machine Learning Algorithms ◽

Learning Approaches ◽

Validation Test ◽

Excess Number ◽

Fold Cross Validation

Background: β thalassemia is a common monogenic genetic disease that is very harmful to human health. The disease arises is due to the deletion of or defects in β-globin, which reduces synthesis of the β-globin chain, resulting in a relatively excess number of α-chains. The formation of inclusion bodies deposited on the cell membrane causes a decrease in the ability of red blood cells to deform and a group of hereditary haemolytic diseases caused by massive destruction in the spleen. Methods: In this work, machine learning algorithms were employed to build a prediction model for inhibitors against K562 based on 117 inhibitors and 190 non-inhibitors. Results: The overall accuracy (ACC) of a 10-fold cross-validation test and an independent set test using Adaboost were 83.1% and 78.0%, respectively, surpassing Bayes Net, Random Forest, Random Tree, C4.5, SVM, KNN and Bagging. Conclusion: This study indicated that Adaboost could be applied to build a learning model in the prediction of inhibitors against K526 cells.

Download Full-text