Learning from High-Dimensional and Class-Imbalanced Datasets Using Random Forests

Barbara Pes

doi:10.3390/info12080286

Learning from High-Dimensional and Class-Imbalanced Datasets Using Random Forests

Information ◽

10.3390/info12080286 ◽

2021 ◽

Vol 12 (8) ◽

pp. 286

Author(s):

Barbara Pes

Keyword(s):

Feature Selection ◽

Learning Community ◽

Learning Strategies ◽

Hybrid Approach ◽

Real Life ◽

Class Imbalance ◽

Research Area ◽

High Dimensional ◽

Imbalance Learning ◽

Feature Selection Techniques

Class imbalance and high dimensionality are two major issues in several real-life applications, e.g., in the fields of bioinformatics, text mining and image classification. However, while both issues have been extensively studied in the machine learning community, they have mostly been treated separately, and little research has been thus far conducted on which approaches might be best suited to deal with datasets that are class-imbalanced and high-dimensional at the same time (i.e., with a large number of features). This work attempts to give a contribution to this challenging research area by studying the effectiveness of hybrid learning strategies that involve the integration of feature selection techniques, to reduce the data dimensionality, with proper methods that cope with the adverse effects of class imbalance (in particular, data balancing and cost-sensitive methods are considered). Extensive experiments have been carried out across datasets from different domains, leveraging a well-known classifier, the Random Forest, which has proven to be effective in high-dimensional spaces and has also been successfully applied to imbalanced tasks. Our results give evidence of the benefits of such a hybrid approach, when compared to using only feature selection or imbalance learning methods alone.

Download Full-text

Cost-sensitive learning strategies for high-dimensional and imbalanced data: a comparative study

PeerJ Computer Science ◽

10.7717/peerj-cs.832 ◽

2021 ◽

Vol 7 ◽

pp. e832

Author(s):

Barbara Pes ◽

Giuseppina Lai

Keyword(s):

Feature Selection ◽

Comparative Study ◽

Learning Strategies ◽

Class Imbalance ◽

Imbalanced Data ◽

High Dimensionality ◽

Problem Instance ◽

High Dimensional ◽

Cost Sensitive Learning ◽

Interesting Insight

High dimensionality and class imbalance have been largely recognized as important issues in machine learning. A vast amount of literature has indeed investigated suitable approaches to address the multiple challenges that arise when dealing with high-dimensional feature spaces (where each problem instance is described by a large number of features). As well, several learning strategies have been devised to cope with the adverse effects of imbalanced class distributions, which may severely impact on the generalization ability of the induced models. Nevertheless, although both the issues have been largely studied for several years, they have mostly been addressed separately, and their combined effects are yet to be fully understood. Indeed, little research has been so far conducted to investigate which approaches might be best suited to deal with datasets that are, at the same time, high-dimensional and class-imbalanced. To make a contribution in this direction, our work presents a comparative study among different learning strategies that leverage both feature selection, to cope with high dimensionality, as well as cost-sensitive learning methods, to cope with class imbalance. Specifically, different ways of incorporating misclassification costs into the learning process have been explored. Also different feature selection heuristics have been considered, both univariate and multivariate, to comparatively evaluate their effectiveness on imbalanced data. The experiments have been conducted on three challenging benchmarks from the genomic domain, gaining interesting insight into the beneficial impact of combining feature selection and cost-sensitive learning, especially in the presence of highly skewed data distributions.

Download Full-text

CLASSIFICATION OF HIGH-DIMENSIONAL MICROARRAY DATA WITH A TWO-STEP PROCEDURE VIA A WILCOXON CRITERION AND MULTILAYER PERCEPTRON

International Journal of Computational Intelligence and Applications ◽

10.1142/s1469026811002969 ◽

2011 ◽

Vol 10 (01) ◽

pp. 1-14

Author(s):

VLADIMIR NIKULIN ◽

TIAN-HSIANG HUANG ◽

GEOFFREY J. MCLACHLAN

Keyword(s):

Data Mining ◽

Feature Selection ◽

High Dimensional ◽

Second Step ◽

Support Vector ◽

Step Procedure ◽

Leave One Out ◽

Natural Combination ◽

Feature Selection Techniques

The method presented in this paper is novel as a natural combination of two mutually dependent steps. Feature selection is a key element (first step) in our classification system, which was employed during the 2010 International RSCTC data mining (bioinformatics) Challenge. The second step may be implemented using any suitable classifier such as linear regression, support vector machine or neural networks. We conducted leave-one-out (LOO) experiments with several feature selection techniques and classifiers. Based on the LOO evaluations, we decided to use feature selection with the separation type Wilcoxon-based criterion for all final submissions. The method presented in this paper was tested successfully during the RSCTC data mining Challenge, where we achieved the top score in the Basic track.

Download Full-text

Feature Selection Techniques to Counter Class Imbalance Problem for Aging Related Bug Prediction

Proceedings of the 11th Innovations in Software Engineering Conference on - ISEC '18 ◽

10.1145/3172871.3172872 ◽

2018 ◽

Cited By ~ 1

Author(s):

Lov Kumar ◽

Ashish Sureka

Keyword(s):

Feature Selection ◽

Class Imbalance ◽

Class Imbalance Problem ◽

Imbalance Problem ◽

Feature Selection Techniques

Download Full-text

A New Diversity Technique for Imbalance Learning Ensembles

International Journal of Engineering & Technology ◽

10.14419/ijet.v7i2.11251 ◽

2018 ◽

Vol 7 (2.14) ◽

pp. 478 ◽

Cited By ~ 2

Author(s):

Hartono . ◽

Opim Salim Sitompul ◽

Erna Budhiarti Nababan ◽

Tulus . ◽

Dahlan Abdullah ◽

...

Keyword(s):

Hybrid Approach ◽

Class Imbalance ◽

Machine Learning Techniques ◽

Classifier Ensembles ◽

Classification Problems ◽

Class Imbalance Problem ◽

Weighting Method ◽

Imbalance Problem ◽

Learning Ensembles ◽

Imbalance Learning

Data mining and machine learning techniques designed to solve classification problems require balanced class distribution. However, in reality sometimes the classification of datasets indicates the existence of a class represented by a large number of instances whereas there are classes with far fewer instances. This problem is known as the class imbalance problem. Classifier Ensembles is a method often used in overcoming class imbalance problems. Data Diversity is one of the cornerstones of ensembles. An ideal ensemble system should have accurrate individual classifiers and if there is an error it is expected to occur on different objects or instances. This research will present the results of overview and experimental study using Hybrid Approach Redefinition (HAR) Method in handling class imbalance and at the same time expected to get better data diversity. This research will be conducted using 6 datasets with different imbalanced ratios and will be compared with SMOTEBoost which is one of the Re-Weighting method which is often used in handling class imbalance. This study shows that the data diversity is related to performance in the imbalance learning ensembles and the proposed methods can obtain better data diversity.

Download Full-text

A proposed framework on hybrid feature selection techniques for handling high dimensional educational data

10.1063/1.5005463 ◽

2017 ◽

Cited By ~ 1

Author(s):

Amirah Mohamed Shahiri ◽

Wahidah Husain ◽

Nur’Aini Abd Rashid

Keyword(s):

Feature Selection ◽

High Dimensional ◽

Feature Selection Techniques

Download Full-text

A Dynamic Ensemble Framework for Mining Textual Streams with Class Imbalance

The Scientific World JOURNAL ◽

10.1155/2014/497354 ◽

2014 ◽

Vol 2014 ◽

pp. 1-11 ◽

Cited By ~ 2

Author(s):

Ge Song ◽

Yunming Ye

Keyword(s):

Large Scale ◽

State Of The Art ◽

Concept Drift ◽

Real Life ◽

Class Imbalance ◽

High Dimensional ◽

Adaptive Selection ◽

Stream Classification ◽

Rare Class

Textual stream classification has become a realistic and challenging issue since large-scale, high-dimensional, and non-stationary streams with class imbalance have been widely used in various real-life applications. According to the characters of textual streams, it is technically difficult to deal with the classification of textual stream, especially in imbalanced environment. In this paper, we propose a new ensemble framework, clustering forest, for learning from the textual imbalanced stream with concept drift (CFIM). The CFIM is based on ensemble learning by integrating a set of clustering trees (CTs). An adaptive selection method, which flexibly chooses the useful CTs by the property of the stream, is presented in CFIM. In particular, to deal with the problem of class imbalance, we collect and reuse both rare-class instances and misclassified instances from the historical chunks. Compared to most existing approaches, it is worth pointing out that our approach assumes that both majority class and rareclass may suffer from concept drift. Thus the distribution of resampled instances is similar to the current concept. The effectiveness of CFIM is examined in five real-world textual streams under an imbalanced nonstationary environment. Experimental results demonstrate that CFIM achieves better performance than four state-of-the-art ensemble models.

Download Full-text

Threshold-based feature selection techniques for high-dimensional bioinformatics data

Network Modeling Analysis in Health Informatics and Bioinformatics ◽

10.1007/s13721-012-0006-6 ◽

2012 ◽

Vol 1 (1-2) ◽

pp. 47-61 ◽

Cited By ~ 27

Author(s):

Jason Van Hulse ◽

Taghi M. Khoshgoftaar ◽

Amri Napolitano ◽

Randall Wald

Keyword(s):

Feature Selection ◽

High Dimensional ◽

Feature Selection Techniques

Download Full-text

Feature Selection for Small Sample Sets with High Dimensional Data Using Heuristic Hybrid Approach

International Journal of Engineering ◽

10.5829/ije.2020.33.02b.05 ◽

2020 ◽

Vol 33 (2) ◽

Keyword(s):

Feature Selection ◽

High Dimensional Data ◽

Hybrid Approach ◽

Small Sample ◽

High Dimensional ◽

Selection For

Download Full-text

Feature selection using autoencoders with Bayesian methods to high-dimensional data

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-211348 ◽

2021 ◽

pp. 1-10

Author(s):

Lei Shu ◽

Kun Huang ◽

Wenhao Jiang ◽

Wenming Wu ◽

Hongling Liu

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Bayesian Methods ◽

Large Scale ◽

High Dimensional Data ◽

Hybrid Approach ◽

High Dimensional ◽

Real World Data ◽

Learning Tasks ◽

Low Dimensional

It is easy to lead to poor generalization in machine learning tasks using real-world data directly, since such data is usually high-dimensional dimensionality and limited. Through learning the low dimensional representations of high-dimensional data, feature selection can retain useful features for machine learning tasks. Using these useful features effectively trains machine learning models. Hence, it is a challenge for feature selection from high-dimensional data. To address this issue, in this paper, a hybrid approach consisted of an autoencoder and Bayesian methods is proposed for a novel feature selection. Firstly, Bayesian methods are embedded in the proposed autoencoder as a special hidden layer. This of doing is to increase the precision during selecting non-redundant features. Then, the other hidden layers of the autoencoder are used for non-redundant feature selection. Finally, compared with the mainstream approaches for feature selection, the proposed method outperforms them. We find that the way consisted of autoencoders and probabilistic correction methods is more meaningful than that of stacking architectures or adding constraints to autoencoders as regards feature selection. We also demonstrate that stacked autoencoders are more suitable for large-scale feature selection, however, sparse autoencoders are beneficial for a smaller number of feature selection. We indicate that the value of the proposed method provides a theoretical reference to analyze the optimality of feature selection.

Download Full-text

A hybrid approach using rough set theory and hypergraph for feature selection on high-dimensional medical datasets

Soft Computing ◽

10.1007/s00500-019-03818-6 ◽

2019 ◽

Vol 23 (23) ◽

pp. 12655-12672 ◽

Cited By ~ 3

Author(s):

M. R. Gauthama Raman ◽

Somu Nivethitha ◽

Krithivasan Kannan ◽

V. S. Shankar Sriram

Keyword(s):

Feature Selection ◽

Set Theory ◽

Rough Set ◽

Rough Set Theory ◽

Hybrid Approach ◽

High Dimensional

Download Full-text