Random Forest with Random Projection to Impute Missing Gene Expression Data

Inference of Genetic Networks From Time-Series and Static Gene Expression Data: Combining a Random-Forest-Based Inference Method With Feature Selection Methods

Frontiers in Genetics ◽

10.3389/fgene.2020.595912 ◽

2020 ◽

Vol 11 ◽

Author(s):

Shuhei Kimura ◽

Ryo Fukutomi ◽

Masato Tokuhisa ◽

Mariko Okada

Keyword(s):

Gene Expression ◽

Feature Selection ◽

Random Forest ◽

Gene Expression Data ◽

Computational Cost ◽

Expression Data ◽

Selection Methods ◽

Inference Method ◽

Combined Application ◽

Inference Methods

Several researchers have focused on random-forest-based inference methods because of their excellent performance. Some of these inference methods also have a useful ability to analyze both time-series and static gene expression data. However, they are only of use in ranking all of the candidate regulations by assigning them confidence values. None have been capable of detecting the regulations that actually affect a gene of interest. In this study, we propose a method to remove unpromising candidate regulations by combining the random-forest-based inference method with a series of feature selection methods. In addition to detecting unpromising regulations, our proposed method uses outputs from the feature selection methods to adjust the confidence values of all of the candidate regulations that have been computed by the random-forest-based inference method. Numerical experiments showed that the combined application with the feature selection methods improved the performance of the random-forest-based inference method on 99 of the 100 trials performed on the artificial problems. However, the improvement tends to be small, since our combined method succeeded in removing only 19% of the candidate regulations at most. The combined application with the feature selection methods moreover makes the computational cost higher. While a bigger improvement at a lower computational cost would be ideal, we see no impediments to our investigation, given that our aim is to extract as much useful information as possible from a limited amount of gene expression data.

Download Full-text

3145 An Evaluation of Machine Learning and Traditional Statistical Methods for Discovery in Large-Scale Translational Data

Journal of Clinical and Translational Science ◽

10.1017/cts.2019.8 ◽

2019 ◽

Vol 3 (s1) ◽

pp. 2-2

Author(s):

Megan C Hollister ◽

Jeffrey D. Blume

Keyword(s):

Gene Expression ◽

Machine Learning ◽

Random Forest ◽

Gene Expression Data ◽

Large Scale ◽

Second Generation ◽

A Priori ◽

Expression Data ◽

P Values ◽

Machine Learning Methods

OBJECTIVES/SPECIFIC AIMS: To examine and compare the claims in Bzdok, Altman, and Brzywinski under a broader set of conditions by using unbiased methods of comparison. To explore how to accurately use various machine learning and traditional statistical methods in large-scale translational research by estimating their accuracy statistics. Then we will identify the methods with the best performance characteristics. METHODS/STUDY POPULATION: We conducted a simulation study with a microarray of gene expression data. We maintained the original structure proposed by Bzdok, Altman, and Brzywinski. The structure for gene expression data includes a total of 40 genes from 20 people, in which 10 people are phenotype positive and 10 are phenotype negative. In order to find a statistical difference 25% of the genes were set to be dysregulated across phenotype. This dysregulation forced the positive and negative phenotypes to have different mean population expressions. Additional variance was included to simulate genetic variation across the population. We also allowed for within person correlation across genes, which was not done in the original simulations. The following methods were used to determine the number of dysregulated genes in simulated data set: unadjusted p-values, Benjamini-Hochberg adjusted p-values, Bonferroni adjusted p-values, random forest importance levels, neural net prediction weights, and second-generation p-values. RESULTS/ANTICIPATED RESULTS: Results vary depending on whether a pre-specified significance level is used or the top 10 ranked values are taken. When all methods are given the same prior information of 10 dysregulated genes, the Benjamini-Hochberg adjusted p-values and the second-generation p-values generally outperform all other methods. We were not able to reproduce or validate the finding that random forest importance levels via a machine learning algorithm outperform classical methods. Almost uniformly, the machine learning methods did not yield improved accuracy statistics and they depend heavily on the a priori chosen number of dysregulated genes. DISCUSSION/SIGNIFICANCE OF IMPACT: In this context, machine learning methods do not outperform standard methods. Because of this and their additional complexity, machine learning approaches would not be preferable. Of all the approaches the second-generation p-value appears to offer significant benefit for the cost of a priori defining a region of trivially null effect sizes. The choice of an analysis method for large-scale translational data is critical to the success of any statistical investigation, and our simulations clearly highlight the various tradeoffs among the available methods.

Download Full-text

A Comparative Performance Evaluation of Random Forest Feature Selection on Classification of Hepatocellular Carcinoma Gene Expression Data

2019 3rd International Conference on Informatics and Computational Sciences (ICICoS) ◽

10.1109/icicos48119.2019.8982435 ◽

2019 ◽

Cited By ~ 1

Author(s):

Moh Abdul Latief ◽

Titin Siswantining ◽

Alhadi Bustamam ◽

Devvi Sarwinda

Keyword(s):

Gene Expression ◽

Hepatocellular Carcinoma ◽

Feature Selection ◽

Performance Evaluation ◽

Random Forest ◽

Gene Expression Data ◽

Expression Data ◽

Comparative Performance

Download Full-text

Random-Forest (RF) and Support Vector Machine (SVM) Implementation for Analysis of Gene Expression Data in Chronic Kidney Disease (CKD)

IOP Conference Series Materials Science and Engineering ◽

10.1088/1757-899x/546/5/052066 ◽

2019 ◽

Vol 546 ◽

pp. 052066 ◽

Cited By ~ 1

Author(s):

Zuherman Rustam ◽

Ely Sudarsono ◽

Devvi Sarwinda

Keyword(s):

Gene Expression ◽

Chronic Kidney Disease ◽

Support Vector Machine ◽

Random Forest ◽

Kidney Disease ◽

Gene Expression Data ◽

Support Vector ◽

Expression Data

Download Full-text

Assessment of Imputation Methods for Missing Gene Expression Data in Meta-Analysis of Distinct Cohorts of Tuberculosis Patients

Biocomputing 2020 ◽

10.1142/9789811215636_0028 ◽

2019 ◽

Author(s):

Carly A. Bobak ◽

Lauren McDonnell ◽

Matthew D. Nemesure ◽

Justin Lin ◽

Jane E. Hill

Keyword(s):

Gene Expression ◽

Gene Expression Data ◽

Meta Analysis ◽

Expression Data ◽

Tuberculosis Patients ◽

Imputation Methods ◽

Missing Gene

Download Full-text

Prediction and prioritization of autism-associated long non-coding RNAs using gene expression and sequence features

BMC Bioinformatics ◽

10.1186/s12859-020-03843-5 ◽

2020 ◽

Vol 21 (1) ◽

Author(s):

Jun Wang ◽

Liangjiang Wang

Keyword(s):

Gene Expression ◽

Random Forest ◽

Gene Expression Data ◽

Autism Spectrum ◽

Expression Data ◽

Risk Genes ◽

Sequence Features ◽

Brain Gene Expression ◽

Transcript Sequence ◽

Non Coding Rnas

Abstract Background Autism spectrum disorders (ASD) refer to a range of neurodevelopmental conditions, which are genetically complex and heterogeneous with most of the genetic risk factors also found in the unaffected general population. Although all the currently known ASD risk genes code for proteins, long non-coding RNAs (lncRNAs) as essential regulators of gene expression have been implicated in ASD. Some lncRNAs show altered expression levels in autistic brains, but their roles in ASD pathogenesis are still unclear. Results In this study, we have developed a new machine learning approach to predict candidate lncRNAs associated with ASD. Particularly, the knowledge learnt from protein-coding ASD risk genes was transferred to the prediction and prioritization of ASD-associated lncRNAs. Both developmental brain gene expression data and transcript sequence were found to contain relevant information for ASD risk gene prediction. During the pre-training phase of model construction, an autoencoder network was implemented for a representation learning of the gene expression data, and a random-forest-based feature selection was applied to the transcript-sequence-derived k-mers. Our models, including logistic regression, support vector machine and random forest, showed robust performance based on tenfold cross-validations as well as candidate prioritization with hypothetical loci. We then utilized the models to predict and prioritize a list of candidate lncRNAs, including some reported to be cis-regulators of known ASD risk genes, for further investigation. Conclusions Our results suggest that ASD risk genes can be accurately predicted using developmental brain gene expression data and transcript sequence features, and the models may provide useful information for functional characterization of the candidate lncRNAs associated with ASD.

Download Full-text

A semi-supervised rough set and random forest approach for pattern classification of gene expression data

International Journal of Reasoning-based Intelligent Systems ◽

10.1504/ijris.2016.082976 ◽

2016 ◽

Vol 8 (3/4) ◽

pp. 155 ◽

Cited By ~ 1

Author(s):

Pradeep Kumar Mallick ◽

Debahuti Mishra ◽

Srikanta Patnaik ◽

Kailash Shaw

Keyword(s):

Gene Expression ◽

Random Forest ◽

Pattern Classification ◽

Gene Expression Data ◽

Rough Set ◽

Expression Data

Download Full-text

Classification and Biomarker Genes Selection for Cancer Gene Expression Data Using Random Forest

Iranian Journal of Pathology ◽

10.30699/ijp.2017.27990 ◽

2017 ◽

Vol 12 (4) ◽

pp. 339-347 ◽

Cited By ~ 8

Author(s):

Malihe Ram ◽

Ali Najafi ◽

Mohammad Taghi Shakeri

Keyword(s):

Gene Expression ◽

Random Forest ◽

Gene Expression Data ◽

Cancer Gene ◽

Expression Data ◽

Selection For

Download Full-text

Latent-space embedding of expression data identifies gene signatures from sputum samples of asthmatic patients

10.1101/646976 ◽

2019 ◽

Author(s):

Shaoke Lou ◽

Tianxiao Li ◽

Daniel Spakowicz ◽

Geoffrey Lowell Chupp ◽

Mark Gerstein

Keyword(s):

Gene Expression ◽

Random Forest ◽

Gene Expression Data ◽

Asthma Severity ◽

Support Vector ◽

Expression Data ◽

Gene Signatures ◽

Asthmatic Patients ◽

Latent Space ◽

Key Genes

AbstractThe pathogenesis of asthma is a complex process involving multiple genes and pathways. Identifying biomarkers from asthma datasets, especially those that include heterogeneous subpopulations, is challenging. In this work, we developed a framework that incorporates a denoising autoencoder and a supervised learning approach to identify gene signatures related to asthma severity. The autoencoder embeds high-dimensional gene expression data into a lower-dimensional latent space in an unsupervised fashion, enabling us to extract distinguishing features from gene expression data. We found that the weights on hidden units in this latent space correlate well with previously defined and clinically relevant clusters of patients. Moreover, pathway analysis based on each gene’s contribution to the hidden units showed significant enrichment in known asthma-related pathways. We then used genes that contribute most to the hidden units to develop a secondary supervised classifier (based on random forest) for directly predicting asthma severity. The random-forest importance metric from this classifier identified a signature based on 50 key genes, which can predict severity with an AUROC of 0.81 and thus have potential as diagnostic biomarkers. Furthermore, the key genes could also be used for successfully estimating, via support-vector-machine regression, the FEV1/FVC ratios across patients, achieving pre- and post-treatment correlations of 0.56 and 0.65, respectively (between predicted and observed values). The 50 biomarker candidate genes can be found in supplementary. The source codes are freely available upon request.

Download Full-text

Latent-space embedding of expression data identifies gene signatures from sputum samples of asthmatic patients

10.21203/rs.2.21701/v1 ◽

2020 ◽

Author(s):

Shaoke LOU ◽

Tianxiao Li ◽

Daniel Spakowicz ◽

Xiting Yan ◽

Geoffrey Lowell Chupp ◽

...

Keyword(s):

Gene Expression ◽

Random Forest ◽

Gene Expression Data ◽

Asthma Severity ◽

Support Vector ◽

Expression Data ◽

Denoising Autoencoder ◽

Gene Signatures ◽

Latent Space ◽

Key Genes

Abstract Backgrounds: The pathogenesis of asthma is a complex process involving multiple genes and pathways. Identifying biomarkers from asthma datasets, especially those that include heterogeneous subpopulations, is challenging. In this work, we developed a framework that incorporates a denoising autoencoder and a supervised learning approach to identify gene signatures related to asthma severity. The autoencoder embeds high-dimensional gene expression data into a lower -dimensional latent space in an unsupervised fashion, enabling us to extract distinguishing features from gene expression data. Results: Using the trained autoencoder model, we found that the weights on hidden units in the latent space correlate well with previously defined and clinically relevant clusters of patients. Moreover, pathway analysis based on each gene's contribution to the hidden units showed significant enrichment in known asthma-related pathways. We then used genes that contribute most to the hidden units to develop a secondary supervised classifier (based on random forest) for directly predicting asthma severity. The random-forest importance metric from this classifier identified a signature based on 50 key genes, which can predict severity with an AUROC of 0.81 and thus have potential as diagnostic biomarkers. Furthermore, the key genes could also be used for successfully estimating, via support-vector-machine regression, the FEV1/FVC ratios across patients, achieving pre- and post-treatment correlations of 0.56 and 0.65, respectively (between predicted and observed values). Conclusions: The denoising autoencoder framework could extract meaningful functional genes and patient groups from the gene expression profile of asthma patients. These patterns may provide potential sources for biomarkers for asthma severity.

Download Full-text