Trees Weighting Random Forest Method for Classifying High-Dimensional Noisy Data

Random Forest is a supervised classification method based on bagging (Bootstrap aggregating) Breiman and random selection of features. The choice of features randomly assigned to the Random Forest makes it possible that the selected feature is not necessarily informative. So it is necessary to select features in the Random Forest. The purpose of choosing this feature is to select an optimal subset of features that contain valuable information in the hope of accelerating the performance of the Random Forest method. Mainly for the execution of high-dimensional datasets such as the Parkinson, CNAE-9, and Urban Land Cover dataset. The feature selection is done using the Correlation-Based Feature Selection method, using the BestFirst method. Tests were carried out 30 times using the K-Cross Fold Validation value of 10 and dividing the dataset into 70% training and 30% testing. The experiments using the Parkinson dataset obtained a time difference of 0.27 and 0.28 seconds faster than using the Random Forest method without feature selection. Likewise, the trials in the Urban Land Cover dataset had 0.04 and 0.03 seconds, while for the CNAE-9 dataset, the difference time was 2.23 and 2.81 faster than using the Random Forest method without feature selection. These experiments showed that the Random Forest processes are faster when using the first feature selection. Likewise, the accuracy value increased in the two previous experiments, while only the CNAE-9 dataset experiment gets a lower accuracy. This research’s benefits is by first performing feature selection steps using the Correlation-Base Feature Selection method can increase the speed of performance and accuracy of the Random Forest method on high-dimensional data.

Download Full-text

The Random Forest Method in Research of Impact of Macroeconomic Indicators of Regional Development on Informal Employment Rate

Voprosy statistiki ◽

10.34023/2313-6383-2020-27-6-37-55 ◽

2020 ◽

Vol 27 (6) ◽

pp. 37-55

Author(s):

E. V. Zarova ◽

E. I. Dubravskaya

Keyword(s):

Random Forest ◽

Regional Development ◽

Russian Federation ◽

Regional Level ◽

Informal Employment ◽

Macroeconomic Factors ◽

Random Forest Method ◽

Macroeconomic Indicators ◽

The Russian Federation ◽

The Impact

The topic of quantitative research on informal employment has a consistently high relevance both in the Russian Federation and in other countries due to its high dependence on cyclicality and crisis stages in economic dynamics of countries with any level of economic development. Developing effective government policy measures to overcome the negative impact of informal employment requires special attention in theoretical and applied research to assessing the factors and conditions of informal employment in the Russian Federation including at the regional level. Such effects of informal employment as a shortfall in taxes, potential losses in production efficiency, and negative social consequences are a concern for the authorities of the federal and regional levels. Development of quantitative indicators to determine the level of informal employment in the regions, taking into account their specifics in the general spatial and economic system of Russia are necessary to overcome these negative effects. The article proposes and tests methods for solving the problem of assessing the impact of hierarchical relationships on macroeconomic factors at the regional level of informal employment in constituent entities of the Russian Federation. Majority of the works on the study of informal employment are based on basic statistical methods of spatial-dynamic analysis, as well as on the now «traditional» methods of cluster and correlation-regression analysis. Without diminishing the merits of these methods, it should be noted that they are somewhat limited in identifying hidden structural connections and interdependencies in such a complex multidimensional phenomenon as informal employment. In order to substantiate the possibility of overcoming these limitations, the article proposes indicators of regional statistics that directly and indirectly characterize informal employment and also presents the possibilities of using the «random forest» method to identify groups of constituent entities of the Russian Federation that have similar macroeconomic factors of informal employment. The novelty of this method in terms of research objectives is that it allows one to assess the impact of macroeconomic indicators of regional development on the level of informal employment, taking into account the implicit, not predetermined by the initial hypotheses, hierarchical relationships of factor indicators. Based on the generalization of the studies presented in the literature, as well as the authors’ statistical calculations using Rosstat data, the authors came to the conclusion about the high importance of macroeconomic parameters of regional development and systemic relationships of macroeconomic indicators in substantiating the differentiation of the informal level across the constituent entities of the Russian Federation.

Download Full-text

Nglyc: A Random Forest Method for Prediction of N-Glycosylation Sites in Eukaryotic Protein Sequence

Protein and Peptide Letters ◽

10.2174/0929866526666191002111404 ◽

2020 ◽

Vol 27 (3) ◽

pp. 178-186 ◽

Cited By ~ 2

Author(s):

Ganesan Pugalenthi ◽

Varadharaju Nithya ◽

Kuo-Chen Chou ◽

Govindaraju Archunan

Keyword(s):

Random Forest ◽

Protein Sequence ◽

Glycosylation Site ◽

Computational Method ◽

The Other ◽

Eukaryotic Protein ◽

Random Forest Method ◽

Glycosylation Sites ◽

Human And Mouse ◽

Better Than

Background: N-Glycosylation is one of the most important post-translational mechanisms in eukaryotes. N-glycosylation predominantly occurs in N-X-[S/T] sequon where X is any amino acid other than proline. However, not all N-X-[S/T] sequons in proteins are glycosylated. Therefore, accurate prediction of N-glycosylation sites is essential to understand Nglycosylation mechanism. Objective: In this article, our motivation is to develop a computational method to predict Nglycosylation sites in eukaryotic protein sequences. Methods: In this article, we report a random forest method, Nglyc, to predict N-glycosylation site from protein sequence, using 315 sequence features. The method was trained using a dataset of 600 N-glycosylation sites and 600 non-glycosylation sites and tested on the dataset containing 295 Nglycosylation sites and 253 non-glycosylation sites. Nglyc prediction was compared with NetNGlyc, EnsembleGly and GPP methods. Further, the performance of Nglyc was evaluated using human and mouse N-glycosylation sites. Results: Nglyc method achieved an overall training accuracy of 0.8033 with all 315 features. Performance comparison with NetNGlyc, EnsembleGly and GPP methods shows that Nglyc performs better than the other methods with high sensitivity and specificity rate. Conclusion: Our method achieved an overall accuracy of 0.8248 with 0.8305 sensitivity and 0.8182 specificity. Comparison study shows that our method performs better than the other methods. Applicability and success of our method was further evaluated using human and mouse N-glycosylation sites. Nglyc method is freely available at https://github.com/bioinformaticsML/ Ngly.

Download Full-text

DSP Based Online Power Quality Events Detection and Classification Using Hilbert Huang Transform and Random Forest Method

2020 International Conference on Computational Intelligence for Smart Power System and Sustainable Energy (CISPSSE) ◽

10.1109/cispsse49931.2020.9212224 ◽

2020 ◽

Author(s):

Mrutyunjaya Sahani ◽

Sasmita Choudhury ◽

Susanta Kumar Rout ◽

Debadatta Amaresh Gadanayak

Keyword(s):

Random Forest ◽

Power Quality ◽

Hilbert Huang Transform ◽

Random Forest Method ◽

Events Detection

Download Full-text

In silico prediction of the full United Nations Globally Harmonized System eye irritation categories of liquid chemicals by IATA-like bottom-up approach of random forest method

Journal of Toxicology and Environmental Health Part A ◽

10.1080/15287394.2021.1956661 ◽

2021 ◽

pp. 1-13

Author(s):

Yeonsoo Kang ◽

Boram Jeong ◽

Doo-Hyeon Lim ◽

Donghwan Lee ◽

Kyung-Min Lim

Keyword(s):

Random Forest ◽

United Nations ◽

In Silico ◽

In Silico Prediction ◽

Eye Irritation ◽

Bottom Up ◽

Random Forest Method

Download Full-text

A New Random Forest Method for Longitudinal Data Classification Using a Lexicographic Bi-Objective Approach

2020 IEEE Symposium Series on Computational Intelligence (SSCI) ◽

10.1109/ssci47803.2020.9308198 ◽

2020 ◽

Author(s):

Caio Ribeiro ◽

Alex Freitas

Keyword(s):

Random Forest ◽

Longitudinal Data ◽

Data Classification ◽

Random Forest Method

Download Full-text

Research of Medical High-Dimensional Imbalanced Data Classification Ensemble Feature Selection Algorithm with Random Forest

2017 International Conference on Smart Grid and Electrical Automation (ICSGEA) ◽

10.1109/icsgea.2017.158 ◽

2017 ◽

Cited By ~ 2

Author(s):

Min Zhu ◽

Bo Su ◽

Gangmin Ning

Keyword(s):

Feature Selection ◽

Random Forest ◽

Imbalanced Data ◽

Data Classification ◽

High Dimensional ◽

Selection Algorithm ◽

Feature Selection Algorithm ◽

Imbalanced Data Classification

Download Full-text

Classifying Very High-Dimensional Data with Random Forests Built from Small Subspaces

International Journal of Data Warehousing and Mining ◽

10.4018/jdwm.2012040103 ◽

2012 ◽

Vol 8 (2) ◽

pp. 44-63 ◽

Cited By ~ 30

Author(s):

Baoxun Xu ◽

Joshua Zhexue Huang ◽

Graham Williams ◽

Qiang Wang ◽

Yunming Ye

Keyword(s):

Random Forest ◽

High Dimensional Data ◽

Real Life ◽

Classification Performance ◽

Feature Weighting ◽

Random Forest Model ◽

High Dimensional ◽

Forest Model ◽

Forest Models ◽

Random Forest Models

The selection of feature subspaces for growing decision trees is a key step in building random forest models. However, the common approach using randomly sampling a few features in the subspace is not suitable for high dimensional data consisting of thousands of features, because such data often contains many features which are uninformative to classification, and the random sampling often doesn’t include informative features in the selected subspaces. Consequently, classification performance of the random forest model is significantly affected. In this paper, the authors propose an improved random forest method which uses a novel feature weighting method for subspace selection and therefore enhances classification performance over high-dimensional data. A series of experiments on 9 real life high dimensional datasets demonstrated that using a subspace size of features where M is the total number of features in the dataset, our random forest model significantly outperforms existing random forest models.

Download Full-text

Adaptation of the random forest method

Proceedings of the 4th International Conference on Smart City Applications - SCA '19 ◽

10.1145/3368756.3369004 ◽

2019 ◽

Cited By ~ 1

Author(s):

Mourad Azhari ◽

Altaf Alaoui ◽

Zakia Achraoui ◽

Badia Ettaki ◽

Jamal Zerouaoui

Keyword(s):

Random Forest ◽

Random Forest Method

Download Full-text