A Novel Algorithm for Clustering and Feature Selection of High Dimensional Datasets

Thulasi Bikku; Alapati Priya

doi:10.18280/ama_b.600301

An Efficient hybrid filter-wrapper metaheuristic-based gene selection method for high dimensional datasets

Scientific Reports ◽

10.1038/s41598-019-54987-1 ◽

2019 ◽

Vol 9 (1) ◽

Author(s):

Jamshid Pirgazi ◽

Mohsen Alimoradi ◽

Tahereh Esmaeili Abharian ◽

Mohammad Hossein Olyaee

Keyword(s):

Feature Selection ◽

Large Scale ◽

Gene Selection ◽

Data Classification ◽

Convergence Time ◽

High Dimensional ◽

Compact Set ◽

Feature Selection Problem ◽

High Dimensional Datasets ◽

Selection Of

AbstractFeature selection problem is one of the most significant issues in data classification. The purpose of feature selection is selection of the least number of features in order to increase accuracy and decrease the cost of data classification. In recent years, due to appearance of high-dimensional datasets with low number of samples, classification models have encountered over-fitting problem. Therefore, the need for feature selection methods that are used to remove the extensions and irrelevant features is felt. Recently, although, various methods have been proposed for selecting the optimal subset of features with high precision, these methods have encountered some problems such as instability, high convergence time, selection of a semi-optimal solution as the final result. In other words, they have not been able to fully extract the effective features. In this paper, a hybrid method based on the IWSSr method and Shuffled Frog Leaping Algorithm (SFLA) is proposed to select effective features in a large-scale gene dataset. The proposed algorithm is implemented in two phases: filtering and wrapping. In the filter phase, the Relief method is used for weighting features. Then, in the wrapping phase, by using the SFLA and the IWSSr algorithms, the search for effective features in a feature-rich area is performed. The proposed method is evaluated by using some standard gene expression datasets. The experimental results approve that the proposed approach in comparison to similar methods, has been achieved a more compact set of features along with high accuracy. The source code and testing datasets are available at https://github.com/jimy2020/SFLA_IWSSr-Feature-Selection.

Download Full-text

Simultaneous Channel and Feature Selection of Fused EEG Features Based on Sparse Group Lasso

BioMed Research International ◽

10.1155/2015/703768 ◽

2015 ◽

Vol 2015 ◽

pp. 1-13 ◽

Cited By ~ 11

Author(s):

Jin-Jia Wang ◽

Fang Xue ◽

Hui Li

Keyword(s):

Feature Selection ◽

Feature Selection Method ◽

Group Lasso ◽

High Dimensional ◽

Test Accuracy ◽

Gradient Descent Method ◽

Feature Subset ◽

Eeg Signals ◽

Sparse Group Lasso ◽

Selection Of

Feature extraction and classification of EEG signals are core parts of brain computer interfaces (BCIs). Due to the high dimension of the EEG feature vector, an effective feature selection algorithm has become an integral part of research studies. In this paper, we present a new method based on a wrapped Sparse Group Lasso for channel and feature selection of fused EEG signals. The high-dimensional fused features are firstly obtained, which include the power spectrum, time-domain statistics, AR model, and the wavelet coefficient features extracted from the preprocessed EEG signals. The wrapped channel and feature selection method is then applied, which uses the logistical regression model with Sparse Group Lasso penalized function. The model is fitted on the training data, and parameter estimation is obtained by modified blockwise coordinate descent and coordinate gradient descent method. The best parameters and feature subset are selected by using a 10-fold cross-validation. Finally, the test data is classified using the trained model. Compared with existing channel and feature selection methods, results show that the proposed method is more suitable, more stable, and faster for high-dimensional feature fusion. It can simultaneously achieve channel and feature selection with a lower error rate. The test accuracy on the data used from international BCI Competition IV reached 84.72%.

Download Full-text

Ranking Based Unsupervised Feature Selection Methods: An Empirical Comparative Study in High Dimensional Datasets

Advances in Soft Computing - Lecture Notes in Computer Science ◽

10.1007/978-3-030-04491-6_16 ◽

2018 ◽

pp. 205-218

Author(s):

Saúl Solorio-Fernández ◽

J. Ariel Carrasco-Ochoa ◽

José Fco. Martínez-Trinidad

Keyword(s):

Feature Selection ◽

Comparative Study ◽

High Dimensional ◽

Selection Methods ◽

Unsupervised Feature Selection ◽

High Dimensional Datasets

Download Full-text

Nested and Repeated Cross Validation for Classification Model With High-Dimensional Data

Revista Colombiana de Estadística ◽

10.15446/rce.v43n1.80000 ◽

2020 ◽

Vol 43 (1) ◽

pp. 103-125

Author(s):

Yi Zhong ◽

Jianghua He ◽

Prabhakar Chalise

Keyword(s):

Feature Selection ◽

Cross Validation ◽

Predictive Accuracy ◽

Simulated Data ◽

Classification Model ◽

High Dimensional ◽

Clinical Settings ◽

Feature Subset ◽

Validation Method ◽

High Dimensional Datasets

With the advent of high throughput technologies, the high-dimensional datasets are increasingly available. This has not only opened up new insight into biological systems but also posed analytical challenges. One important problem is the selection of informative feature-subset and prediction of the future outcome. It is crucial that models are not overfitted and give accurate results with new data. In addition, reliable identification of informative features with high predictive power (feature selection) is of interests in clinical settings. We propose a two-step framework for feature selection and classification model construction, which utilizes a nested and repeated cross-validation method. We evaluated our approach using both simulated data and two publicly available gene expression datasets. The proposed method showed comparatively better predictive accuracy for new cases than the standard cross-validation method.

Download Full-text

A Hybrid Scheme for Feature Selection of High Dimensional Educational Data

2019 International Conference on Communication Technologies (ComTech) ◽

10.1109/comtech.2019.8737829 ◽

2019 ◽

Cited By ~ 1

Author(s):

Usman Ali ◽

Khawaja Sarmad Arif ◽

Dr. Usman Qamar

Keyword(s):

Feature Selection ◽

High Dimensional ◽

Hybrid Scheme ◽

Selection Of

Download Full-text

A Novel Granularity Optimal Feature Selection based on Multi-Variant Clustering for High Dimensional Data

Turkish Journal of Computer and Mathematics Education (TURCOMAT) ◽

10.17762/turcomat.v12i3.2031 ◽

2021 ◽

Vol 12 (3) ◽

pp. 5051-5062

Author(s):

Srinivas Kolli Et. al.

Keyword(s):

Feature Selection ◽

High Dimensional Data ◽

Classification Performance ◽

High Dimensional ◽

Second Phase ◽

Data Sets ◽

Aggressive Approach ◽

Related Data ◽

Optimal Feature ◽

Selection Of

Clustering is the most complex in multi/high dimensional data because of sub feature selection from overall features present in categorical data sources. Sub set feature be the aggressive approach to decrease feature dimensionality in mining of data, identification of patterns. Main aim behind selection of feature with respect to selection of optimal feature and decrease the redundancy. In-order to compute with redundant/irrelevant features in high dimensional sample data exploration based on feature selection calculation with data granular described in this document. Propose aNovel Granular Feature Multi-variant Clustering based Genetic Algorithm (NGFMCGA) model to evaluate the performance results in this implementation. This model main consists two phases, in first phase, based on theoretic graph grouping procedure divide features into different clusters, in second phase, select strongly representative related feature from each cluster with respect to matching of subset of features. Features present in this concept are independent because of features select from different clusters, proposed approach clustering have high probability in processing and increasing the quality of independent and useful features.Optimal subset feature selection improves accuracy of clustering and feature classification, performance of proposed approach describes better accuracy with respect to optimal subset selection is applied on publicly related data sets and it is compared with traditional supervised evolutionary approaches

Download Full-text

Swarm Intelligence Algorithms for Feature Selection: A Review

Applied Sciences ◽

10.3390/app8091521 ◽

2018 ◽

Vol 8 (9) ◽

pp. 1521 ◽

Cited By ~ 47

Author(s):

Lucija Brezočnik ◽

Iztok Fister ◽

Vili Podgorelec

Keyword(s):

Feature Selection ◽

Swarm Intelligence ◽

Optimization Problems ◽

Relevant Information ◽

High Dimensional ◽

Comprehensive Literature Review ◽

Open Questions ◽

Common Application ◽

High Dimensional Datasets ◽

Taxonomic Categories

The increasingly rapid creation, sharing and exchange of information nowadays put researchers and data scientists ahead of a challenging task of data analysis and extracting relevant information out of data. To be able to learn from data, the dimensionality of the data should be reduced first. Feature selection (FS) can help to reduce the amount of data, but it is a very complex and computationally demanding task, especially in the case of high-dimensional datasets. Swarm intelligence (SI) has been proved as a technique which can solve NP-hard (Non-deterministic Polynomial time) computational problems. It is gaining popularity in solving different optimization problems and has been used successfully for FS in some applications. With the lack of comprehensive surveys in this field, it was our objective to fill the gap in coverage of SI algorithms for FS. We performed a comprehensive literature review of SI algorithms and provide a detailed overview of 64 different SI algorithms for FS, organized into eight major taxonomic categories. We propose a unified SI framework and use it to explain different approaches to FS. Different methods, techniques, and their settings are explained, which have been used for various FS aspects. The datasets used most frequently for the evaluation of SI algorithms for FS are presented, as well as the most common application areas. The guidelines on how to develop SI approaches for FS are provided to support researchers and analysts in their data mining tasks and endeavors while existing issues and open questions are being discussed. In this manner, using the proposed framework and the provided explanations, one should be able to design an SI approach to be used for a specific FS problem.

Download Full-text

A Master Slave Parallel Genetic Algorithm for Feature Selection in High Dimensional Datasets

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.c4184.098319 ◽

2019 ◽

Vol 8 (3) ◽

pp. 379-384

Keyword(s):

Genetic Algorithm ◽

Genetic Algorithms ◽

Feature Selection ◽

Information Gain ◽

Optimal Number ◽

Good Choice ◽

High Dimensional ◽

Parallel Genetic Algorithm ◽

Efficient Manner ◽

High Dimensional Datasets

Feature Selection in High Dimensional Datasets is a combinatorial problem as it selects the optimal subsets from N dimensional data having 2N possible subsets. Genetic Algorithms are generally a good choice for feature selection in large datasets, though for some high dimensional problems it may take varied amount of time - few seconds, few hours or even few days. Therefore, it is important to use Genetic Algorithms that can give quality results in reasonably acceptable time limit. For this purpose, it is becoming necessary to implement Genetic Algorithms in an efficient manner. In this paper, a Master Slave Parallel Genetic Algorithm is implemented as a Feature Selection procedure to diminish the time intricacies of sequential genetic algorithm. This paper describes the speed gains in parallel Master-Slave Genetic Algorithm and also discusses the theoretical analysis of optimal number of slaves required for an efficient master slave implementation. The experiments are performed on three high-dimensional gene expression data. As Genetic Algorithm is a wrapper technique and takes more time to find the importance of any feature, Information Gain technique is used first as pre-processing task to remove the irrelevant features.

Download Full-text

A sequential cosine similarity based feature selection technique for high dimensional datasets

2015 39th National Systems Conference (NSC) ◽

10.1109/natsys.2015.7489113 ◽

2015 ◽

Cited By ~ 3

Author(s):

Vimal Kumar Dubey ◽

Amit Kumar Saxena

Keyword(s):

Feature Selection ◽

Cosine Similarity ◽

High Dimensional ◽

Feature Selection Technique ◽

Selection Technique ◽

High Dimensional Datasets

Download Full-text

A hybrid algorithm based on binary chemical reaction optimization and tabu search for feature selection of high-dimensional biomedical data

Tsinghua Science & Technology ◽

10.26599/tst.2018.9010101 ◽

2018 ◽

Vol 23 (6) ◽

pp. 733-743 ◽

Cited By ~ 6

Author(s):

Chaokun Yan ◽

Jingjing Ma ◽

Huimin Luo ◽

Jianxin Wang

Keyword(s):

Feature Selection ◽

Tabu Search ◽

Chemical Reaction ◽

Hybrid Algorithm ◽

High Dimensional ◽

Biomedical Data ◽

Chemical Reaction Optimization ◽

Reaction Optimization ◽

Selection Of

Download Full-text