Unbiased Feature Selection in Learning Random Forests for High-Dimensional Data

The Scientific World JOURNAL ◽

10.1155/2015/471371 ◽

2015 ◽

Vol 2015 ◽

pp. 1-18 ◽

Cited By ~ 16

Author(s):

Thanh-Tung Nguyen ◽

Joshua Zhexue Huang ◽

Thuy Thi Nguyen

Keyword(s):

Feature Selection ◽

Random Forests ◽

Selection Process ◽

High Dimensional Data ◽

Feature Weighting ◽

High Dimensional ◽

Feature Subset ◽

Value Assessment ◽

Statistical Measures ◽

Real World Datasets

Random forests (RFs) have been widely used as a powerful classification method. However, with the randomization in both bagging samples and feature selection, the trees in the forest tend to select uninformative features for node splitting. This makes RFs have poor accuracy when working with high-dimensional data. Besides that, RFs have bias in the feature selection process where multivalued features are favored. Aiming at debiasing feature selection in RFs, we propose a new RF algorithm, called xRF, to select good features in learning RFs for high-dimensional data. We first remove the uninformative features usingp-value assessment, and the subset of unbiased features is then selected based on some statistical measures. This feature subset is then partitioned into two subsets. A feature weighting sampling technique is used to sample features from these two subsets for building trees. This approach enables one to generate more accurate trees, while allowing one to reduce dimensionality and the amount of data needed for learning RFs. An extensive set of experiments has been conducted on 47 high-dimensional real-world datasets including image datasets. The experimental results have shown that RFs with the proposed approach outperformed the existing random forests in increasing the accuracy and the AUC measures.

Download Full-text

A Clustering-Guided Integer Brain Storm Optimizer for Feature Selection in High-Dimensional Data

Discrete Dynamics in Nature and Society ◽

10.1155/2021/8462493 ◽

2021 ◽

Vol 2021 ◽

pp. 1-10

Author(s):

Jia Yun-Tao ◽

Zhang Wan-Qiu ◽

He Chun-Lin

Keyword(s):

Feature Selection ◽

High Dimensional Data ◽

High Dimensional ◽

Second Phase ◽

Feature Subset ◽

Search Performance ◽

Brain Storm Optimization ◽

Real World Datasets ◽

Optimal Feature Subset ◽

Optimal Feature

For high-dimensional data with a large number of redundant features, existing feature selection algorithms still have the problem of “curse of dimensionality.” In view of this, the paper studies a new two-phase evolutionary feature selection algorithm, called clustering-guided integer brain storm optimization algorithm (IBSO-C). In the first phase, an importance-guided feature clustering method is proposed to group similar features, so that the search space in the second phase can be reduced obviously. The second phase applies oneself to finding optimal feature subset by using an improved integer brain storm optimization. Moreover, a new encoding strategy and a time-varying integer update method for individuals are proposed to improve the search performance of brain storm optimization in the second phase. Since the number of feature clusters is far smaller than the size of original features, IBSO-C can find an optimal feature subset fast. Compared with several existing algorithms on some real-world datasets, experimental results show that IBSO-C can find feature subset with high classification accuracy at less computation cost.

Download Full-text

An Asymmetric Chaotic Competitive Swarm Optimization Algorithm for Feature Selection in High-Dimensional Data

Symmetry ◽

10.3390/sym12111782 ◽

2020 ◽

Vol 12 (11) ◽

pp. 1782

Author(s):

Supailin Pichai ◽

Khamron Sunat ◽

Sirapat Chiewchanwattana

Keyword(s):

Feature Selection ◽

Optimization Algorithm ◽

Chaotic Map ◽

High Dimensional Data ◽

Quality Criteria ◽

High Dimensional ◽

Feature Subset ◽

Swarm Optimization ◽

Candidate Solution ◽

Dimensional Classification

This paper presents a method for feature selection in a high-dimensional classification context. The proposed method finds a candidate solution based on quality criteria using subset searching. In this study, the competitive swarm optimization (CSO) algorithm was implemented to solve feature selection problems in high-dimensional data. A new asymmetric chaotic function was proposed and used to generate the population and search for a CSO solution. Its histogram is right-skewed. The proposed method is named an asymmetric chaotic competitive swarm optimization algorithm (ACCSO). According to the asymmetrical property of the proposed chaotic map, ACCSO prefers zero than one. Therefore, the solution is very compact and can achieve high classification accuracy with a minimal feature subset for high-dimensional datasets. The proposed method was evaluated on 12 datasets, with dimensions ranging from 4 to 10,304. ACCSO was compared to the original CSO algorithm and other metaheuristic algorithms. Experimental results show that the proposed method can increase accuracy and it reduces the number of selected features. Compared to different optimization algorithms with other wrappers, the proposed method exhibits excellent performance.

Download Full-text

Hybridization of feature selection and feature weighting for high dimensional data

Applied Intelligence ◽

10.1007/s10489-018-1348-2 ◽

2018 ◽

Vol 49 (4) ◽

pp. 1580-1596 ◽

Cited By ~ 8

Author(s):

Dalwinder Singh ◽

Birmohan Singh

Keyword(s):

Feature Selection ◽

High Dimensional Data ◽

Feature Weighting ◽

High Dimensional

Download Full-text

Enhanced Filter Feature Selection Methods for Arabic Text Categorization

International Journal of Information Retrieval Research ◽

10.4018/ijirr.2018040101 ◽

2018 ◽

Vol 8 (2) ◽

pp. 1-24 ◽

Cited By ~ 1

Author(s):

Abdullah Saeed Ghareb ◽

Azuraliza Abu Bakara ◽

Qasem A. Al-Radaideh ◽

Abdul Razak Hamdan

Keyword(s):

Feature Selection ◽

Text Categorization ◽

Selection Process ◽

High Dimensional Data ◽

Relevant Information ◽

High Dimensional ◽

Arabic Text ◽

Relevant Feature ◽

Associative Classification ◽

Selection Methods

The filtering of a large amount of data is an important process in data mining tasks, particularly for the categorization of unstructured high dimensional data. Therefore, a feature selection process is desired to reduce the space of high dimensional data into small relevant subset dimensions that represent the best features for text categorization. In this article, three enhanced filter feature selection methods, Category Relevant Feature Measure, Modified Category Discriminated Measure, and Odd Ratio2, are proposed. These methods combine the relevant information about features in both the inter- and intra-category. The effectiveness of the proposed methods with Naïve Bayes and associative classification is evaluated by traditional measures of text categorization, namely, macro-averaging of precision, recall, and F-measure. Experiments are conducted on three Arabic text datasets used for text categorization. The experimental results showed that the proposed methods are able to achieve better and comparable results when compared to 12 well known traditional methods.

Download Full-text

Feature Selection

Encyclopedia of Data Warehousing and Mining, Second Edition ◽

10.4018/978-1-60566-010-3.ch135 ◽

2011 ◽

pp. 878-882

Author(s):

Damien François

Keyword(s):

Feature Selection ◽

Time Series Prediction ◽

High Dimensional Data ◽

Principal Component ◽

Point Of View ◽

High Dimensional ◽

Feature Subset ◽

Selection Methods ◽

Reduction Techniques ◽

Dimensionality Reduction Techniques

In many applications, like function approximation, pattern recognition, time series prediction, and data mining, one has to build a model relating some features describing the data to some response value. Often, the features that are relevant for building the model are not known in advance. Feature selection methods allow removing irrelevant and/or redundant features to only keep the feature subset that are most useful to build a prediction model. The model is simpler and easier to interpret, reducing the risks of overfitting, non-convergence, etc. By contrast with other dimensionality reduction techniques such as principal component analysis or more recent nonlinear projection techniques (Lee & Verleysen 2007), which build a new, smaller set of features, the features that are selected by feature selection methods preserve their initial meaning, potentially bringing extra information about the process being modeled (Guyon 2006). Recently, the advent of high-dimensional data has raised new challenges for feature selection methods, both from the algorithmic point of view and the conceptual point of view (Liu & Motoda 2007). The problem of feature selection is exponential in nature, and many approximate algorithms are cubic with respect to the initial number of features, which may be intractable when the dimensionality of the data is large. Furthermore, high-dimensional data are often highly redundant, and two distinct subsets of features may have very similar predictive power, which can make it difficult to identify the best subset.

Download Full-text

Improved Nonnegative Matrix Factorization Based Feature Selection for High Dimensional Data Analysis

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.347-350.2344 ◽

2013 ◽

Vol 347-350 ◽

pp. 2344-2348

Author(s):

Lin Cheng Jiang ◽

Wen Tang Tan ◽

Zhen Wen Wang ◽

Feng Jing Yin ◽

Bin Ge ◽

...

Keyword(s):

Feature Selection ◽

Matrix Factorization ◽

Nonnegative Matrix Factorization ◽

High Dimensional Data ◽

Nonnegative Matrix ◽

High Dimensional ◽

Feature Subset ◽

Feature Extraction Method ◽

Optimal Feature Subset ◽

Optimal Feature

Feature selection has become the focus of research areas of applications with high dimensional data. Nonnegative matrix factorization (NMF) is a good method for dimensionality reduction but it cant select the optimal feature subset for its a feature extraction method. In this paper, a two-step strategy method based on improved NMF is proposed.The first step is to get the basis of each catagory in the dataset by NMF. Added constrains can guarantee these basises are sparse and mostly distinguish from each other which can contribute to classfication. An auxiliary function is used to prove the algorithm convergent.The classic ReliefF algorithm is used to weight each feature by all the basis vectors and choose the optimal feature subset in the second step.The experimental results revealed that the proposed method can select a representive and relevant feature subset which is effective in improving the performance of the classifier.

Download Full-text

Particle Swarm Optimisation for Feature Selection and Weighting in High-Dimensional Clustering

10.26686/wgtn.13058747 ◽

2020 ◽

Author(s):

D O'Neill ◽

Andrew Lensen ◽

Bing Xue ◽

Mengjie Zhang

Keyword(s):

Feature Selection ◽

Fitness Function ◽

Clustering Algorithms ◽

High Dimensional Data ◽

Particle Swarm ◽

Learning Task ◽

Particle Swarm Optimisation ◽

Feature Weighting ◽

High Dimensional ◽

Validation Measure

© 2018 IEEE. Clustering, an important unsupervised learning task, is very challenging on high-dimensional data, since the generated clusters can be significantly less meaningful as the number of features increases. Feature selection and/or feature weighting can address this issue by selecting and weighting only informative features. These techniques have been extensively studied in supervised learning, e.g. classification, but they are very difficult to use with clustering due to the lack of effective similarity/distance and validation measures. This paper utilises the powerful global search ability of particle swarm optimisation (PSO) on continuous problems, to propose a PSO based method for simultaneous feature selection and feature weighting for clustering on high-dimensional data, where a new validation measure is also proposed as the fitness function of the PSO method. Experiments on datasets with varying dimensionalities and different number of known clusters show that the proposed method can successfully improve clustering performance of different types of clustering algorithms over using the baseline of the original feature set.

Download Full-text

FSEFST:Feature Selection and Extraction using Feature Subset Technique in High Dimensional Data

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.b6907.129219 ◽

2019 ◽

Vol 9 (2) ◽

pp. 814-820

Keyword(s):

Feature Extraction ◽

Feature Selection ◽

Dimensionality Reduction ◽

Efficient Algorithm ◽

High Dimensional Data ◽

Suggested Method ◽

High Dimensional ◽

Feature Subset

Dimensionality reduction is one of the pre-processing phases required when large amount of data is available. Feature selection and Feature Extraction are one of the methods used to reduce the dimensionality. Till now these methods were using separately so the resultant feature contains original or transformed data. An efficient algorithm for Feature Selection and Extraction using Feature Subset Technique in High Dimensional Data (FSEFST) has been proposed in order to select and extract the efficient features by using feature subset method where it will have both original and transformed data. The results prove that the suggested method is better as compared with the existing algorithm

Download Full-text

Efficient Feature Subset Selection Algorithm for High Dimensional Data

International Journal of Electrical and Computer Engineering (IJECE) ◽

10.11591/ijece.v6i4.pp1880-1888 ◽

2016 ◽

Vol 6 (4) ◽

pp. 1880 ◽

Cited By ~ 3

Author(s):

Smita Chormunge ◽

Sudarson Jena

Keyword(s):

Feature Selection ◽

Information Gain ◽

High Dimensional Data ◽

Feature Subset Selection ◽

High Dimensional ◽

Feature Subset ◽

Selection Algorithm ◽

Feature Selection Algorithm ◽

Computational Performance ◽

Optimal Feature Subset

<p>Feature selection approach solves the dimensionality problem by removing irrelevant and redundant features. Existing Feature selection algorithms take more time to obtain feature subset for high dimensional data. This paper proposes a feature selection algorithm based on Information gain measures for high dimensional data termed as IFSA (Information gain based Feature Selection Algorithm) to produce optimal feature subset in efficient time and improve the computational performance of learning algorithms. IFSA algorithm works in two folds: First apply filter on dataset. Second produce the small feature subset by using information gain measure. Extensive experiments are carried out to compare proposed algorithm and other methods with respect to two different classifiers (Naive bayes and IBK) on microarray and text data sets. The results demonstrate that IFSA not only produces the most select feature subset in efficient time but also improves the classifier performance.</p>

Download Full-text

Feature Selection using Genetic Algorithm for Clustering high Dimensional Data

International Journal of Engineering & Technology ◽

10.14419/ijet.v7i2.11.11001 ◽

2018 ◽

Vol 7 (2.11) ◽

pp. 27 ◽

Cited By ~ 1

Author(s):

Kahkashan Kouser ◽

Amrita Priyam

Keyword(s):

Genetic Algorithm ◽

Feature Selection ◽

Clustering Algorithm ◽

High Dimensional Data ◽

Feature Space ◽

High Dimensional ◽

Feature Subset ◽

Data Set ◽

Optimal Feature Subset ◽

Optimal Feature

One of the open problems of modern data mining is clustering high dimensional data. For this in the paper a new technique called GA-HDClustering is proposed, which works in two steps. First a GA-based feature selection algorithm is designed to determine the optimal feature subset; an optimal feature subset is consisting of important features of the entire data set next, a K-means algorithm is applied using the optimal feature subset to find the clusters. On the other hand, traditional K-means algorithm is applied on the full dimensional feature space. Finally, the result of GA-HDClustering is compared with the traditional clustering algorithm. For comparison different validity matrices such as Sum of squared error (SSE), Within Group average distance (WGAD), Between group distance (BGD), Davies-Bouldin index(DBI), are used .The GA-HDClustering uses genetic algorithm for searching an effective feature subspace in a large feature space. This large feature space is made of all dimensions of the data set. The experiment performed on the standard data set revealed that the GA-HDClustering is superior to traditional clustering algorithm.

Download Full-text