Utility metric for unsupervised feature selection

PeerJ Computer Science ◽

10.7717/peerj-cs.477 ◽

2021 ◽

Vol 7 ◽

pp. e477

Author(s):

Amalia Villa ◽

Abhijith Mundanad Narayanan ◽

Sabine Van Huffel ◽

Alexander Bertrand ◽

Carolina Varon

Keyword(s):

Feature Selection ◽

Manifold Learning ◽

State Of The Art ◽

High Dimensional Data ◽

Subset Selection ◽

The State ◽

Computational Time ◽

High Dimensional ◽

Learning Stage ◽

Unsupervised Feature Selection

Feature selection techniques are very useful approaches for dimensionality reduction in data analysis. They provide interpretable results by reducing the dimensions of the data to a subset of the original set of features. When the data lack annotations, unsupervised feature selectors are required for their analysis. Several algorithms for this aim exist in the literature, but despite their large applicability, they can be very inaccessible or cumbersome to use, mainly due to the need for tuning non-intuitive parameters and the high computational demands. In this work, a publicly available ready-to-use unsupervised feature selector is proposed, with comparable results to the state-of-the-art at a much lower computational cost. The suggested approach belongs to the methods known as spectral feature selectors. These methods generally consist of two stages: manifold learning and subset selection. In the first stage, the underlying structures in the high-dimensional data are extracted, while in the second stage a subset of the features is selected to replicate these structures. This paper suggests two contributions to this field, related to each of the stages involved. In the manifold learning stage, the effect of non-linearities in the data is explored, making use of a radial basis function (RBF) kernel, for which an alternative solution for the estimation of the kernel parameter is presented for cases with high-dimensional data. Additionally, the use of a backwards greedy approach based on the least-squares utility metric for the subset selection stage is proposed. The combination of these new ingredients results in the utility metric for unsupervised feature selection U2FS algorithm. The proposed U2FS algorithm succeeds in selecting the correct features in a simulation environment. In addition, the performance of the method on benchmark datasets is comparable to the state-of-the-art, while requiring less computational time. Moreover, unlike the state-of-the-art, U2FS does not require any tuning of parameters.

Download Full-text

An Information-theoretic Approach to Unsupervised Feature Selection for High-Dimensional Data

10.3390/ecea-5-06697 ◽

2019 ◽

Cited By ~ 1

Author(s):

Shao-Lun Huang

Keyword(s):

Feature Selection ◽

High Dimensional Data ◽

High Dimensional ◽

Theoretic Approach ◽

Unsupervised Feature Selection ◽

Information Theoretic ◽

Selection For ◽

Information Theoretic Approach

Download Full-text

Hybrid fast unsupervised feature selection for high-dimensional data

Expert Systems with Applications ◽

10.1016/j.eswa.2019.01.016 ◽

2019 ◽

Vol 124 ◽

pp. 97-118 ◽

Cited By ~ 11

Author(s):

Zhaleh Manbari ◽

Fardin AkhlaghianTab ◽

Chiman Salavati

Keyword(s):

Feature Selection ◽

High Dimensional Data ◽

High Dimensional ◽

Unsupervised Feature Selection ◽

Selection For

Download Full-text

Subspace Clustering for High-Dimensional Data Using Cluster Structure Similarity

International Journal of Intelligent Information Technologies ◽

10.4018/ijiit.2018070103 ◽

2018 ◽

Vol 14 (3) ◽

pp. 38-55 ◽

Cited By ~ 2

Author(s):

Kavan Fatehi ◽

Mohsen Rezvani ◽

Mansoor Fateh ◽

Mohammad-Reza Pajoohan

Keyword(s):

Similarity Measure ◽

State Of The Art ◽

Clustering Algorithms ◽

Cluster Structure ◽

High Dimensional Data ◽

Subspace Clustering ◽

The State ◽

High Dimensional ◽

Running Time ◽

Structure Similarity

This article describes how recently, because of the curse of dimensionality in high dimensional data, a significant amount of research has been conducted on subspace clustering aiming at discovering clusters embedded in any possible attributes combination. The main goal of subspace clustering algorithms is to find all clusters in all subspaces. Previous studies have mostly been generating redundant subspace clusters, leading to clustering accuracy loss and also increasing the running time of the algorithms. A bottom-up density-based approach is suggested in this article, in which the cluster structure serves as a similarity measure to generate the optimal subspaces which result in raising the accuracy of the subspace clustering. Based on this idea, the algorithm discovers similar subspaces by considering similarity in their cluster structure, then combines them and the data in the new subspaces would be clustered again. Finally, the algorithm determines all the subspaces and also finds all clusters within them. Experiments on various synthetic and real datasets show that the results of the proposed approach are significantly better in quality and runtime than the state-of-the-art on clustering high-dimensional data.

Download Full-text

Feature Selection Is Important: State-of-the-Art Methods and Application Domains of Feature Selection on High-Dimensional Data

EAI/Springer Innovations in Communication and Computing - Applications in Ubiquitous Computing ◽

10.1007/978-3-030-35280-6_9 ◽

2020 ◽

pp. 177-196 ◽

Cited By ~ 1

Author(s):

G. Manikandan ◽

S. Abirami

Keyword(s):

Feature Selection ◽

State Of The Art ◽

High Dimensional Data ◽

High Dimensional ◽

Art Methods ◽

Important State

Download Full-text

HDSI: High dimensional selection with interactions algorithm on feature selection and testing

PLoS ONE ◽

10.1371/journal.pone.0246159 ◽

2021 ◽

Vol 16 (2) ◽

pp. e0246159

Author(s):

Rahi Jain ◽

Wei Xu

Keyword(s):

Feature Selection ◽

Statistical Significance ◽

High Dimensional Data ◽

Feature Selection Method ◽

Subset Selection ◽

Simulated Data ◽

Adaptive Lasso ◽

High Dimensional ◽

Statistical Techniques ◽

Interaction Terms

Feature selection on high dimensional data along with the interaction effects is a critical challenge for classical statistical learning techniques. Existing feature selection algorithms such as random LASSO leverages LASSO capability to handle high dimensional data. However, the technique has two main limitations, namely the inability to consider interaction terms and the lack of a statistical test for determining the significance of selected features. This study proposes a High Dimensional Selection with Interactions (HDSI) algorithm, a new feature selection method, which can handle high-dimensional data, incorporate interaction terms, provide the statistical inferences of selected features and leverage the capability of existing classical statistical techniques. The method allows the application of any statistical technique like LASSO and subset selection on multiple bootstrapped samples; each contains randomly selected features. Each bootstrap data incorporates interaction terms for the randomly sampled features. The selected features from each model are pooled and their statistical significance is determined. The selected statistically significant features are used as the final output of the approach, whose final coefficients are estimated using appropriate statistical techniques. The performance of HDSI is evaluated using both simulated data and real studies. In general, HDSI outperforms the commonly used algorithms such as LASSO, subset selection, adaptive LASSO, random LASSO and group LASSO.

Download Full-text

An Information-Theoretic Approach to Unsupervised Feature Selection for High-Dimensional Data

IEEE Journal on Selected Areas in Information Theory ◽

10.1109/jsait.2020.2981538 ◽

2020 ◽

Vol 1 (1) ◽

pp. 157-166

Author(s):

Shao-Lun Huang ◽

Xiangxiang Xu ◽

Lizhong Zheng

Keyword(s):

Feature Selection ◽

High Dimensional Data ◽

High Dimensional ◽

Theoretic Approach ◽

Unsupervised Feature Selection ◽

Information Theoretic ◽

Selection For ◽

Information Theoretic Approach

Download Full-text

A Redundancy Based Unsupervised Feature Selection Method for High-Dimensional Data

2021 13th International Conference on Machine Learning and Computing ◽

10.1145/3457682.3457725 ◽

2021 ◽

Author(s):

Jian Zhou ◽

Ding Liu

Keyword(s):

Feature Selection ◽

High Dimensional Data ◽

Feature Selection Method ◽

Selection Method ◽

High Dimensional ◽

Unsupervised Feature Selection

Download Full-text

Feature Selection on High Dimensional Data Using Wrapper Based Subset Selection

2017 Second International Conference on Recent Trends and Challenges in Computational Models (ICRTCCM) ◽

10.1109/icrtccm.2017.58 ◽

2017 ◽

Cited By ~ 1

Author(s):

G. Manikandan ◽

E. Susi ◽

S. Abirami

Keyword(s):

Feature Selection ◽

High Dimensional Data ◽

Subset Selection ◽

High Dimensional

Download Full-text

An information-theoretic approach to unsupervised feature selection for high-dimensional data

2017 IEEE Information Theory Workshop (ITW) ◽

10.1109/itw.2017.8277927 ◽

2017 ◽

Cited By ~ 3

Author(s):

Shao-Lun Huang ◽

Lin Zhang ◽

Lizhong Zheng

Keyword(s):

Feature Selection ◽

High Dimensional Data ◽

High Dimensional ◽

Theoretic Approach ◽

Unsupervised Feature Selection ◽

Information Theoretic ◽

Selection For ◽

Information Theoretic Approach

Download Full-text

Unsupervised Feature Selection for Efficient Exploration of High Dimensional Data

Advances in Databases and Information Systems - Lecture Notes in Computer Science ◽

10.1007/978-3-030-82472-3_14 ◽

2021 ◽

pp. 183-197

Author(s):

Arnab Chakrabarti ◽

Abhijeet Das ◽

Michael Cochez ◽

Christoph Quix

Keyword(s):

Feature Selection ◽

High Dimensional Data ◽

High Dimensional ◽

Unsupervised Feature Selection ◽

Selection For ◽

Efficient Exploration

Download Full-text