Random Forest of Perfect Trees: Concept, Performance, Applications, and Perspectives

Bioinformatics ◽

10.1093/bioinformatics/btab074 ◽

2021 ◽

Author(s):

Jean-Michel Nguyen ◽

Pascal Jézéquel ◽

Pierre Gillois ◽

Luisa Silva ◽

Faouda Ben Azzouz ◽

...

Keyword(s):

Random Forest ◽

Information Criterion ◽

R Package ◽

Information Criteria ◽

Three Dimensions ◽

Supplementary Information ◽

Recursive Feature Elimination ◽

Support Vector ◽

Classification Errors ◽

New Type

Abstract Motivation The principle of Breiman's random forest (RF) is to build and assemble complementary classification trees in a way that maximizes their variability. We propose a new type of random forest that disobeys Breiman’s principles and involves building trees with no classification errors in very large quantities. We used a new type of decision tree that uses a neuron at each node as well as an in-innovative half Christmas tree structure. With these new RFs, we developed a score, based on a family of ten new statistical information criteria, called Nguyen information criteria (NICs), to evaluate the predictive qualities of features in three dimensions. Results The first NIC allowed the Akaike information criterion to be minimized more quickly than data obtained with the Gini index when the features were introduced in a logistic regression model. The selected features based on the NICScore showed a slight advantage compared to the support vector machines—recursive feature elimination (SVM-RFE) method. We demonstrate that the inclusion of artificial neurons in tree nodes allows a large number of classifiers in the same node to be taken into account simultaneously and results in perfect trees without classification errors. Availability and implementation The methods used to build the perfect trees in this article were implemented in the “ROP” R package, archived at https://cran.r-project.org/web/packages/ROP/index.html Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

cytometree: a binary tree algorithm for automatic gating in cytometry analysis

10.1101/335554 ◽

2018 ◽

Cited By ~ 1

Author(s):

Daniel Commenges ◽

Chariff Alkhassim ◽

Raphael Gottardo ◽

Boris Hejblum ◽

Rodolphe Thiébaut

Keyword(s):

Flow Cytometry ◽

Binary Tree ◽

Computation Time ◽

R Package ◽

Information Criteria ◽

Supplementary Information ◽

Flow Cytometry Data ◽

Human Immunology ◽

Supplementary Material ◽

Unsupervised Algorithms

AbstractMotivationFlow cytometry is a powerful technology that allows the high-throughput quantification of dozens of surface and intracellular proteins at the single-cell level. It has become the most widely used technology for immunophenotyping of cells over the past three decades. Due to the increasing complexity of cytometry experiments (more cells and more markers), traditional manual flow cytometry data analysis has become untenable due to its subjectivity and time-consuming nature.ResultsWe present a new unsupervised algorithm called “cytometree” to perform automated population discovery (aka gating) in flow cytometry. cytometree is based on the construction of a binary tree, the nodes of which are subpopulations of cells. At each node, the marker distributions are modeled by mixtures of normal distribution. Node splitting is done according to a normalized difference of Akaike information criteria (AIC) between the two models. Post-processing of the tree structure and derived populations allows us to complete the annotation of the derived populations. The algorithm is shown to perform better than the state-of-the-art unsupervised algorithms previously proposed on panels introduced by the Flow Cytometry: Critical Assessment of Population Identification Methods (FlowCAP I) project. The algorithm is also applied to a T-cell panel proposed by the Human Immunology Project Consortium (HIPC) program; it also outperforms the best unsupervised open-source available algorithm while requiring the shortest computation time.AvailabilityAn R package named “cytometree” is available on the CRAN [email protected]; [email protected] informationSupplementary data are available.

Download Full-text

rSeqTU – a machine-learning based R package for prediction of bacterial transcription units

10.1101/553057 ◽

2019 ◽

Author(s):

Sheng-Yong Niu ◽

Binqiang Liu ◽

Qin Ma ◽

Wen-Chi Chou

Keyword(s):

Machine Learning ◽

Random Forest ◽

Regulatory Networks ◽

Prediction Models ◽

R Package ◽

Transcription Unit ◽

Support Vector ◽

Rna Seq ◽

Accurate Identification ◽

Prediction Approach

AbstractA transcription unit (TU) is composed of one or multiple adjacent genes on the same strand that are co-transcribed in mostly prokaryotes. Accurate identification of TUs is a crucial first step to delineate the transcriptional regulatory networks and elucidate the dynamic regulatory mechanisms encoded in various prokaryotic genomes. Many genomic features, e.g., gene intergenic distance, and transcriptomic features including continuous and stable RNA-seq reads count signals, have been collected from a large amount of experimental data and integrated into classification techniques to computationally predict genome-wide TUs. Although some tools and web servers are able to predict TUs based on bacterial RNA-seq data and genome sequences, there is a need to have an improved machine-learning prediction approach and a better comprehensive pipeline handling QC, TU prediction, and TU visualization. To enable users to efficiently perform TU identification on their local computers or high-performance clusters and provide a more accurate prediction, we develop an R package, named rSeqTU. rSeqTU uses a random forest algorithm to select essential features describing TUs and then uses support vector machine (SVM) to build TU prediction models. rSeqTU (available at https://s18692001.github.io/rSeqTU/) has six computational functionalities including read quality control, read mapping, training set generation, random-forest-based feature selection, TU prediction, and TU visualization.

Download Full-text

A method for handling metabonomics data from liquid chromatography/mass spectrometry: combinational use of support vector machine recursive feature elimination, genetic algorithm and random forest for feature selection

Metabolomics ◽

10.1007/s11306-011-0274-7 ◽

2011 ◽

Vol 7 (4) ◽

pp. 549-558 ◽

Cited By ~ 40

Author(s):

Xiaohui Lin ◽

Quancai Wang ◽

Peiyuan Yin ◽

Liang Tang ◽

Yexiong Tan ◽

...

Keyword(s):

Mass Spectrometry ◽

Genetic Algorithm ◽

Support Vector Machine ◽

Feature Selection ◽

Liquid Chromatography ◽

Random Forest ◽

Recursive Feature Elimination ◽

Support Vector ◽

Liquid Chromatography Mass Spectrometry ◽

Chromatography Mass Spectrometry

Download Full-text

Reliable Identification of Oolong Tea Species: Nondestructive Testing Classification Based on Fluorescence Hyperspectral Technology and Machine Learning

Agriculture ◽

10.3390/agriculture11111106 ◽

2021 ◽

Vol 11 (11) ◽

pp. 1106

Author(s):

Yan Hu ◽

Lijia Xu ◽

Peng Huang ◽

Xiong Luo ◽

Peng Wang ◽

...

Keyword(s):

Machine Learning ◽

Random Forest ◽

Principal Component ◽

Classification Model ◽

Recursive Feature Elimination ◽

Support Vector ◽

K Nearest Neighbor ◽

Oolong Tea ◽

The Impact ◽

T Distribution

A rapid and nondestructive tea classification method is of great significance in today’s research. This study uses fluorescence hyperspectral technology and machine learning to distinguish Oolong tea by analyzing the spectral features of tea in the wavelength ranging from 475 to 1100 nm. The spectral data are preprocessed by multivariate scattering correction (MSC) and standard normal variable (SNV), which can effectively reduce the impact of baseline drift and tilt. Then principal component analysis (PCA) and t-distribution random neighborhood embedding (t-SNE) are adopted for feature dimensionality reduction and visual display. Random Forest-Recursive Feature Elimination (RF-RFE) is used for feature selection. Decision Tree (DT), Random Forest Classification (RFC), K-Nearest Neighbor (KNN) and Support Vector Machine (SVM) are used to establish the classification model. The results show that MSC-RF-RFE-SVM is the best model for the classification of Oolong tea in which the accuracy of the training set and test set is 100% and 98.73%, respectively. It can be concluded that fluorescence hyperspectral technology and machine learning are feasible to classify Oolong tea.

Download Full-text

Diagnosis of Chronic Kidney Disease Using Effective Classification Algorithms and Recursive Feature Elimination Techniques

Journal of Healthcare Engineering ◽

10.1155/2021/1004767 ◽

2021 ◽

Vol 2021 ◽

pp. 1-10

Author(s):

Ebrahime Mohammed Senan ◽

Mosleh Hmoud Al-Adhaileh ◽

Fawaz Waselallah Alsaade ◽

Theyazn H. H. Aldhyani ◽

Ahmed Abdullah Alqarni ◽

...

Keyword(s):

Chronic Kidney Disease ◽

Random Forest ◽

Kidney Disease ◽

Early Diagnosis ◽

Kidney Diseases ◽

Adult Population ◽

Machine Learning Techniques ◽

Recursive Feature Elimination ◽

Support Vector ◽

Classification Algorithms

Chronic kidney disease (CKD) is among the top 20 causes of death worldwide and affects approximately 10% of the world adult population. CKD is a disorder that disrupts normal kidney function. Due to the increasing number of people with CKD, effective prediction measures for the early diagnosis of CKD are required. The novelty of this study lies in developing the diagnosis system to detect chronic kidney diseases. This study assists experts in exploring preventive measures for CKD through early diagnosis using machine learning techniques. This study focused on evaluating a dataset collected from 400 patients containing 24 features. The mean and mode statistical analysis methods were used to replace the missing numerical and the nominal values. To choose the most important features, Recursive Feature Elimination (RFE) was applied. Four classification algorithms applied in this study were support vector machine (SVM), k-nearest neighbors (KNN), decision tree, and random forest. All the classification algorithms achieved promising performance. The random forest algorithm outperformed all other applied algorithms, reaching an accuracy, precision, recall, and F1-score of 100% for all measures. CKD is a serious life-threatening disease, with high rates of morbidity and mortality. Therefore, artificial intelligence techniques are of great importance in the early detection of CKD. These techniques are supportive of experts and doctors in early diagnosis to avoid developing kidney failure.

Download Full-text

Perbandingan Algoritma Machine Learning dalam Menilai Sebuah Lokasi Toko Ritel

Jurnal Teknik Informatika dan Sistem Informasi ◽

10.28932/jutisi.v7i1.3182 ◽

2021 ◽

Vol 7 (1) ◽

Author(s):

Kristiawan Kristiawan ◽

Andreas Widjaja

Keyword(s):

Neural Network ◽

Machine Learning ◽

Logistic Regression ◽

Random Forest ◽

Pearson Correlation ◽

Recursive Feature Elimination ◽

Support Vector ◽

Learning Technology ◽

K Nearest Neighbor ◽

Store Location

Abstract — The application of machine learning technology in various industrial fields is currently developing rapidly, including in the retail industry. This study aims to find the most accurate algorithmic model so that it can be used to help retailers choose a store location more precisely. By using several methods such as Pearson Correlation, Chi-Square Features, Recursive Feature Elimination and Tree-based to select features (predictive variables). These features are then used to train and build models using 6 different classification algorithms such as Logistic Regression, K Nearest Neighbor (KNN), Decision Tree, Random Forest, Support Vector Machine (SVM) and Neural Network to classify whether a location is recommended or not as a new store location. Keywords— Application of Machine Learning, Pearson Correlation, Random Forest, Neural Network, Logistic Regression.

Download Full-text

Apply Machine Learning Methods to Predict Failure of Glaucoma Drainage

International Journal of Data Mining & Knowledge Management Process ◽

10.5121/ijdkp.2021.11101 ◽

2021 ◽

Vol 11 (1) ◽

pp. 1-12

Author(s):

Paul Morrison ◽

Maxwell Dixon ◽

Arsham Sheybani ◽

Bahareh Rahmani

Keyword(s):

Machine Learning ◽

Logistic Regression ◽

Intraocular Pressure ◽

Random Forest ◽

Glaucoma Drainage Device ◽

Recursive Feature Elimination ◽

Support Vector ◽

Demographic Information ◽

Drainage Device ◽

Machine Learning Methods

The purpose of this retrospective study is to measure machine learning models' ability to predict glaucoma drainage device failure based on demographic information and preoperative measurements. The medical records of 165 patients were used. Potential predictors included the patients' race, age, sex, preoperative intraocular pressure (IOP), preoperative visual acuity, number of IOP-lowering medications, and number and type of previous ophthalmic surgeries. Failure was defined as final IOP greater than 18 mm Hg, reduction in intraocular pressure less than 20% from baseline, or need for reoperation unrelated to normal implant maintenance. Five classifiers were compared: logistic regression, artificial neural network, random forest, decision tree, and support vector machine. Recursive feature elimination was used to shrink the number of predictors and grid search was used to choose hyperparameters. To prevent leakage, nested cross-validation was used throughout. With a small amount of data, the best classfier was logistic regression, but with more data, the best classifier was the random forest.

Download Full-text

High-Resolution Mangrove Forests Classification with Machine Learning Using Worldview and UAV Hyperspectral Data

Remote Sensing ◽

10.3390/rs13081529 ◽

2021 ◽

Vol 13 (8) ◽

pp. 1529

Author(s):

Yufeng Jiang ◽

Li Zhang ◽

Min Yan ◽

Jianguo Qi ◽

Tianmeng Fu ◽

...

Keyword(s):

Random Forest ◽

Hyperspectral Data ◽

Recursive Feature Elimination ◽

Support Vector ◽

Biomass Estimation ◽

Accurate Information ◽

Mangrove Forests ◽

Mangrove Species ◽

Combined Data

Mangrove forests, as important ecological and economic resources, have suffered a loss in the area due to natural and human activities. Monitoring the distribution of and obtaining accurate information on mangrove species is necessary for ameliorating the damage and protecting and restoring mangrove forests. In this study, we compared the performance of UAV Rikola hyperspectral images, WorldView-2 (WV-2) satellite-based multispectral images, and a fusion of data from both in the classification of mangrove species. We first used recursive feature elimination‒random forest (RFE-RF) to select the vegetation’s spectral and texture feature variables, and then implemented random forest (RF) and support vector machine (SVM) algorithms as classifiers. The results showed that the accuracy of the combined data was higher than that of UAV and WV-2 data; the vegetation index features of UAV hyperspectral data and texture index of WV-2 data played dominant roles; the overall accuracy of the RF algorithm was 95.89% with a Kappa coefficient of 0.95, which is more accurate and efficient than SVM. The use of combined data and RF methods for the classification of mangrove species could be useful in biomass estimation and breeding cultivation.

Download Full-text

Predicting Inhibitors for Multidrug Resistance Associated Protein-2 Transporter by Machine Learning Approach

Combinatorial Chemistry & High Throughput Screening ◽

10.2174/1386207321666181024104822 ◽

2018 ◽

Vol 21 (8) ◽

pp. 557-566 ◽

Cited By ~ 3

Author(s):

Sahil Kharangarh ◽

Hardeep Sandhu ◽

Sujit Tangadpalliwar ◽

Prabha Garg

Keyword(s):

Machine Learning ◽

Support Vector Machine ◽

Multidrug Resistance ◽

Random Forest ◽

Predictive Models ◽

Computational Study ◽

Recursive Feature Elimination ◽

Efflux Transporters ◽

Support Vector ◽

Efflux Transporter

Background: The efflux transporter multidrug resistance associated protein-2 belongs to ATP-binding cassette superfamily which plays an important role in multidrug resistance and drugdrug interactions. Efflux transporters are considered to be important targets for increasing the efficacy of drugs and importance of computational study of efflux transporters for predicting substrates, non-substrates, inhibitors and non-inhibitors is well documented. Previous work on predictive models for inhibitors of multidrug resistance associated Protein-2 efflux transporter showed that machine learning methods produced good results. Objective: The aim of the present work was to develop a machine learning predictive model to classify inhibitors and non-inhibitors of multidrug resistance associated protein-2 transporter using a well refined dataset. Method: In this study, the various algorithms of machine learning were used to develop the predictive models i.e. support vector machine, random forest and k-nearest neighbor. The methods like variance threshold, SelectKBest, random forest, and recursive feature elimination were used to select the features generated by PyDPI. A total of 239 molecules consisting of 124 inhibitors and 115 non-inhibitors were used for model development. Results: The best multidrug resistance associated protein-2 inhibitor model showed prediction accuracies of 0.76, 0.72 and 0.79 for training, 5-fold cross-validation and external sets, respectively. Conclusion: It was observed that support vector machine model built on features selected using recursive feature elimination method shows the best performance. The developed model can be used in the early stages of drug discovery for identifying the inhibitors of multidrug resistance associated protein-2 efflux transporter.

Download Full-text

Identifying novel transcript biomarkers for hepatocellular carcinoma (HCC) using RNA-Seq datasets and machine learning

BMC Cancer ◽

10.1186/s12885-021-08704-9 ◽

2021 ◽

Vol 21 (1) ◽

Author(s):

Rajinder Gupta ◽

Jos Kleinjans ◽

Florian Caiment

Keyword(s):

Machine Learning ◽

Support Vector Machine ◽

Random Forest ◽

Machine Learning Algorithms ◽

Recursive Feature Elimination ◽

Support Vector ◽

Rna Seq ◽

Cell Models ◽

Novel Transcript ◽

Transcript Biomarkers

Abstract Background Hepatocellular carcinoma (HCC) is one of the leading causes of cancer death in the world owing to limitations in its prognosis. The current prognosis approaches include radiological examination and detection of serum biomarkers, however, both have limited efficiency and are ineffective in early prognosis. Due to such limitations, we propose to use RNA-Seq data for evaluating putative higher accuracy biomarkers at the transcript level that could help in early prognosis. Methods To identify such potential transcript biomarkers, RNA-Seq data for healthy liver and various HCC cell models were subjected to five different machine learning algorithms: random forest, K-nearest neighbor, Naïve Bayes, support vector machine, and neural networks. Various metrics, namely sensitivity, specificity, MCC, informedness, and AUC-ROC (except for support vector machine) were evaluated. The algorithms that produced the highest values for all metrics were chosen to extract the top features that were subjected to recursive feature elimination. Through recursive feature elimination, the least number of features were obtained to differentiate between the healthy and HCC cell models. Results From the metrics used, it is demonstrated that the efficiency of the known protein biomarkers for HCC is comparatively lower than complete transcriptomics data. Among the different machine learning algorithms, random forest and support vector machine demonstrated the best performance. Using recursive feature elimination on top features of random forest and support vector machine three transcripts were selected that had an accuracy of 0.97 and kappa of 0.93. Of the three transcripts, two were protein coding (PARP2–202 and SPON2–203) and one was a non-coding transcript (CYREN-211). Lastly, we demonstrated that these three selected transcripts outperformed randomly taken three transcripts (15,000 combinations), hence were not chance findings, and could then be an interesting candidate for new HCC biomarker development. Conclusion Using RNA-Seq data combined with machine learning approaches can aid in finding novel transcript biomarkers. The three biomarkers identified: PARP2–202, SPON2–203, and CYREN-211, presented the highest accuracy among all other transcripts in differentiating the healthy and HCC cell models. The machine learning pipeline developed in this study can be used for any RNA-Seq dataset to find novel transcript biomarkers. Code: www.github.com/rajinder4489/ML_biomarkers

Download Full-text