Beta Distribution-Based Cross-Entropy for Feature Selection

Weixing Dai; Dianjing Guo

doi:10.3390/e21080769

Beta Distribution-Based Cross-Entropy for Feature Selection

Entropy ◽

10.3390/e21080769 ◽

2019 ◽

Vol 21 (8) ◽

pp. 769 ◽

Cited By ~ 1

Author(s):

Weixing Dai ◽

Dianjing Guo

Keyword(s):

Feature Selection ◽

Probability Density ◽

Beta Distribution ◽

Predictive Accuracy ◽

High Dimensional Data ◽

Area Under The Curve ◽

Cross Entropy ◽

High Dimensional ◽

Generalization Ability ◽

Conventional Methods

Analysis of high-dimensional data is a challenge in machine learning and data mining. Feature selection plays an important role in dealing with high-dimensional data for improvement of predictive accuracy, as well as better interpretation of the data. Frequently used evaluation functions for feature selection include resampling methods such as cross-validation, which show an advantage in predictive accuracy. However, these conventional methods are not only computationally expensive, but also tend to be over-optimistic. We propose a novel cross-entropy which is based on beta distribution for feature selection. In beta distribution-based cross-entropy (BetaDCE) for feature selection, the probability density is estimated by beta distribution and the cross-entropy is computed by the expected value of beta distribution, so that the generalization ability can be estimated more precisely than conventional methods where the probability density is learnt from data. Analysis of the generalization ability of BetaDCE revealed that it was a trade-off between bias and variance. The robustness of BetaDCE was demonstrated by experiments on three types of data. In the exclusive or-like (XOR-like) dataset, the false discovery rate of BetaDCE was significantly smaller than that of other methods. For the leukemia dataset, the area under the curve (AUC) of BetaDCE on the test set was 0.93 with only four selected features, which indicated that BetaDCE not only detected the irrelevant and redundant features precisely, but also more accurately predicted the class labels with a smaller number of features than the original method, whose AUC was 0.83 with 50 features. In the metabonomic dataset, the overall AUC of prediction with features selected by BetaDCE was significantly larger than that by the original reported method. Therefore, BetaDCE can be used as a general and efficient framework for feature selection.

Download Full-text

A Multicriteria Approach to Find Predictive and Sparse Models with Stable Feature Selection for High-Dimensional Data

Computational and Mathematical Methods in Medicine ◽

10.1155/2017/7907163 ◽

2017 ◽

Vol 2017 ◽

pp. 1-18 ◽

Cited By ~ 5

Author(s):

Andrea Bommert ◽

Jörg Rahnenführer ◽

Michel Lang

Keyword(s):

Feature Selection ◽

Predictive Model ◽

Predictive Accuracy ◽

Pearson Correlation ◽

High Dimensional Data ◽

High Dimensional ◽

Sparse Models ◽

Data Set ◽

The Stability ◽

Selection Of

Finding a good predictive model for a high-dimensional data set can be challenging. For genetic data, it is not only important to find a model with high predictive accuracy, but it is also important that this model uses only few features and that the selection of these features is stable. This is because, in bioinformatics, the models are used not only for prediction but also for drawing biological conclusions which makes the interpretability and reliability of the model crucial. We suggest using three target criteria when fitting a predictive model to a high-dimensional data set: the classification accuracy, the stability of the feature selection, and the number of chosen features. As it is unclear which measure is best for evaluating the stability, we first compare a variety of stability measures. We conclude that the Pearson correlation has the best theoretical and empirical properties. Also, we find that for the stability assessment behaviour it is most important that a measure contains a correction for chance or large numbers of chosen features. Then, we analyse Pareto fronts and conclude that it is possible to find models with a stable selection of few features without losing much predictive accuracy.

Download Full-text

BagMeLiF: stable boosting-based hybrid-ensemble feature selection algorithm for high-dimensional data

2020 International Conference on Control, Robotics and Intelligent System ◽

10.1145/3437802.3437835 ◽

2020 ◽

Author(s):

Nikita Pilnenskiy ◽

Ivan Smetannikov

Keyword(s):

Feature Selection ◽

High Dimensional Data ◽

High Dimensional ◽

Selection Algorithm ◽

Feature Selection Algorithm

Download Full-text

On fuzzy feature selection in designing fuzzy classifiers for high-dimensional data

Evolving Systems ◽

10.1007/s12530-015-9142-4 ◽

2015 ◽

Vol 7 (4) ◽

pp. 255-265 ◽

Cited By ~ 6

Author(s):

Eghbal G. Mansoori ◽

Khadijeh S. Shafiee

Keyword(s):

Feature Selection ◽

High Dimensional Data ◽

High Dimensional ◽

Fuzzy Classifiers ◽

Fuzzy Feature Selection

Download Full-text

A Hybrid Feature Selection Method Based on Symmetrical Uncertainty and Support Vector Machine for High-Dimensional Data Classification

Intelligent Information and Database Systems - Lecture Notes in Computer Science ◽

10.1007/978-3-319-54472-4_67 ◽

2017 ◽

pp. 721-727 ◽

Cited By ~ 2

Author(s):

Yongjun Piao ◽

Keun Ho Ryu

Keyword(s):

Support Vector Machine ◽

Feature Selection ◽

High Dimensional Data ◽

Feature Selection Method ◽

Data Classification ◽

Selection Method ◽

High Dimensional ◽

Support Vector ◽

Symmetrical Uncertainty

Download Full-text

High dimensional data classification and feature selection using support vector machines

European Journal of Operational Research ◽

10.1016/j.ejor.2017.08.040 ◽

2018 ◽

Vol 265 (3) ◽

pp. 993-1004 ◽

Cited By ~ 63

Author(s):

Bissan Ghaddar ◽

Joe Naoum-Sawaya

Keyword(s):

Feature Selection ◽

Support Vector Machines ◽

High Dimensional Data ◽

Data Classification ◽

High Dimensional ◽

Support Vector ◽

Vector Machines

Download Full-text

Opportunities and Challenges of Feature Selection Methods for High Dimensional Data: A Review

Ingénierie des systèmes d information ◽

10.18280/isi.260107 ◽

2021 ◽

Vol 26 (1) ◽

pp. 67-77

Author(s):

Siva Sankari Subbiah ◽

Jayakumar Chinnappan

Keyword(s):

Feature Selection ◽

Big Data ◽

Large Scale ◽

High Dimensional Data ◽

Research Work ◽

Basic Feature ◽

High Dimensional ◽

Selection Methods ◽

Fast Development ◽

Improved Accuracy

Now a day, all the organizations collecting huge volume of data without knowing its usefulness. The fast development of Internet helps the organizations to capture data in many different formats through Internet of Things (IoT), social media and from other disparate sources. The dimension of the dataset increases day by day at an extraordinary rate resulting in large scale dataset with high dimensionality. The present paper reviews the opportunities and challenges of feature selection for processing the high dimensional data with reduced complexity and improved accuracy. In the modern big data world the feature selection has a significance in reducing the dimensionality and overfitting of the learning process. Many feature selection methods have been proposed by researchers for obtaining more relevant features especially from the big datasets that helps to provide accurate learning results without degradation in performance. This paper discusses the importance of feature selection, basic feature selection approaches, centralized and distributed big data processing using Hadoop and Spark, challenges of feature selection and provides the summary of the related research work done by various researchers. As a result, the big data analysis with the feature selection improves the accuracy of the learning.

Download Full-text

Improved Nonnegative Matrix Factorization Based Feature Selection for High Dimensional Data Analysis

Proceedings of the 2nd International Conference on Computer Science and Electronics Engineering (ICCSEE 2013) ◽

10.2991/iccsee.2013.583 ◽

2013 ◽

Author(s):

Lincheng Jiang ◽

Wentang Tan ◽

Zhenwen Wang ◽

Fengjing Yin ◽

Bin Ge ◽

...

Keyword(s):

Feature Selection ◽

Data Analysis ◽

Matrix Factorization ◽

Nonnegative Matrix Factorization ◽

High Dimensional Data ◽

Nonnegative Matrix ◽

High Dimensional ◽

High Dimensional Data Analysis ◽

Selection For

Download Full-text

An Ensemble Classification Method for High-Dimensional Data Using Neighborhood Rough Set

Complexity ◽

10.1155/2021/8358921 ◽

2021 ◽

Vol 2021 ◽

pp. 1-12

Author(s):

Jing Zhang ◽

Guang Lu ◽

Jiaquan Li ◽

Chuanwen Li

Keyword(s):

Feature Selection ◽

Rough Set ◽

Small Sample Size ◽

High Dimensional Data ◽

Classification Performance ◽

Small Sample ◽

Ensemble Classification ◽

High Dimensional ◽

Sample Classification ◽

Neighborhood Rough Set

Mining useful knowledge from high-dimensional data is a hot research topic. Efficient and effective sample classification and feature selection are challenging tasks due to high dimensionality and small sample size of microarray data. Feature selection is necessary in the process of constructing the model to reduce time and space consumption. Therefore, a feature selection model based on prior knowledge and rough set is proposed. Pathway knowledge is used to select feature subsets, and rough set based on intersection neighborhood is then used to select important feature in each subset, since it can select features without redundancy and deals with numerical features directly. In order to improve the diversity among base classifiers and the efficiency of classification, it is necessary to select part of base classifiers. Classifiers are grouped into several clusters by k-means clustering using the proposed combination distance of Kappa-based diversity and accuracy. The base classifier with the best classification performance in each cluster will be selected to generate the final ensemble model. Experimental results on three Arabidopsis thaliana stress response datasets showed that the proposed method achieved better classification performance than existing ensemble models.

Download Full-text