A Novel Algorithm for Clustering and Feature Selection of High Dimensional Datasets

2017 ◽  
Vol 60 (3) ◽  
pp. 525-538
Author(s):  
Thulasi Bikku ◽  
Alapati Priya
2019 ◽  
Vol 9 (1) ◽  
Author(s):  
Jamshid Pirgazi ◽  
Mohsen Alimoradi ◽  
Tahereh Esmaeili Abharian ◽  
Mohammad Hossein Olyaee

AbstractFeature selection problem is one of the most significant issues in data classification. The purpose of feature selection is selection of the least number of features in order to increase accuracy and decrease the cost of data classification. In recent years, due to appearance of high-dimensional datasets with low number of samples, classification models have encountered over-fitting problem. Therefore, the need for feature selection methods that are used to remove the extensions and irrelevant features is felt. Recently, although, various methods have been proposed for selecting the optimal subset of features with high precision, these methods have encountered some problems such as instability, high convergence time, selection of a semi-optimal solution as the final result. In other words, they have not been able to fully extract the effective features. In this paper, a hybrid method based on the IWSSr method and Shuffled Frog Leaping Algorithm (SFLA) is proposed to select effective features in a large-scale gene dataset. The proposed algorithm is implemented in two phases: filtering and wrapping. In the filter phase, the Relief method is used for weighting features. Then, in the wrapping phase, by using the SFLA and the IWSSr algorithms, the search for effective features in a feature-rich area is performed. The proposed method is evaluated by using some standard gene expression datasets. The experimental results approve that the proposed approach in comparison to similar methods, has been achieved a more compact set of features along with high accuracy. The source code and testing datasets are available at https://github.com/jimy2020/SFLA_IWSSr-Feature-Selection.


2015 ◽  
Vol 2015 ◽  
pp. 1-13 ◽  
Author(s):  
Jin-Jia Wang ◽  
Fang Xue ◽  
Hui Li

Feature extraction and classification of EEG signals are core parts of brain computer interfaces (BCIs). Due to the high dimension of the EEG feature vector, an effective feature selection algorithm has become an integral part of research studies. In this paper, we present a new method based on a wrapped Sparse Group Lasso for channel and feature selection of fused EEG signals. The high-dimensional fused features are firstly obtained, which include the power spectrum, time-domain statistics, AR model, and the wavelet coefficient features extracted from the preprocessed EEG signals. The wrapped channel and feature selection method is then applied, which uses the logistical regression model with Sparse Group Lasso penalized function. The model is fitted on the training data, and parameter estimation is obtained by modified blockwise coordinate descent and coordinate gradient descent method. The best parameters and feature subset are selected by using a 10-fold cross-validation. Finally, the test data is classified using the trained model. Compared with existing channel and feature selection methods, results show that the proposed method is more suitable, more stable, and faster for high-dimensional feature fusion. It can simultaneously achieve channel and feature selection with a lower error rate. The test accuracy on the data used from international BCI Competition IV reached 84.72%.


2020 ◽  
Vol 43 (1) ◽  
pp. 103-125
Author(s):  
Yi Zhong ◽  
Jianghua He ◽  
Prabhakar Chalise

With the advent of high throughput technologies, the high-dimensional datasets are increasingly available. This has not only opened up new insight into biological systems but also posed analytical challenges. One important problem is the selection of informative feature-subset and prediction of the future outcome. It is crucial that models are not overfitted and give accurate results with new data. In addition, reliable identification of informative features with high predictive power (feature selection) is of interests in clinical settings. We propose a two-step framework for feature selection and classification model construction, which utilizes a nested and repeated cross-validation method. We evaluated our approach using both simulated data and two publicly available gene expression datasets. The proposed method showed comparatively better predictive accuracy for new cases than the standard cross-validation method.


Author(s):  
Srinivas Kolli Et. al.

Clustering is the most complex in multi/high dimensional data because of sub feature selection from overall features present in categorical data sources. Sub set feature be the aggressive approach to decrease feature dimensionality in mining of data, identification of patterns. Main aim behind selection of feature with respect to selection of optimal feature and decrease the redundancy. In-order to compute with redundant/irrelevant features in high dimensional sample data exploration based on feature selection calculation with data granular described in this document. Propose aNovel Granular Feature Multi-variant Clustering based Genetic Algorithm (NGFMCGA) model to evaluate the performance results in this implementation. This model main consists two phases, in first phase, based on theoretic graph grouping procedure divide features into different clusters, in second phase, select strongly  representative related feature from each cluster with respect to matching of subset of features. Features present in this concept are independent because of features select from different clusters, proposed approach clustering have high probability in processing and increasing the quality of independent and useful features.Optimal subset feature selection improves accuracy of clustering and feature classification, performance of proposed approach describes better accuracy with respect to optimal subset selection is applied on publicly related data sets and it is compared with traditional supervised evolutionary approaches


2018 ◽  
Vol 8 (9) ◽  
pp. 1521 ◽  
Author(s):  
Lucija Brezočnik ◽  
Iztok Fister ◽  
Vili Podgorelec

The increasingly rapid creation, sharing and exchange of information nowadays put researchers and data scientists ahead of a challenging task of data analysis and extracting relevant information out of data. To be able to learn from data, the dimensionality of the data should be reduced first. Feature selection (FS) can help to reduce the amount of data, but it is a very complex and computationally demanding task, especially in the case of high-dimensional datasets. Swarm intelligence (SI) has been proved as a technique which can solve NP-hard (Non-deterministic Polynomial time) computational problems. It is gaining popularity in solving different optimization problems and has been used successfully for FS in some applications. With the lack of comprehensive surveys in this field, it was our objective to fill the gap in coverage of SI algorithms for FS. We performed a comprehensive literature review of SI algorithms and provide a detailed overview of 64 different SI algorithms for FS, organized into eight major taxonomic categories. We propose a unified SI framework and use it to explain different approaches to FS. Different methods, techniques, and their settings are explained, which have been used for various FS aspects. The datasets used most frequently for the evaluation of SI algorithms for FS are presented, as well as the most common application areas. The guidelines on how to develop SI approaches for FS are provided to support researchers and analysts in their data mining tasks and endeavors while existing issues and open questions are being discussed. In this manner, using the proposed framework and the provided explanations, one should be able to design an SI approach to be used for a specific FS problem.


Feature Selection in High Dimensional Datasets is a combinatorial problem as it selects the optimal subsets from N dimensional data having 2N possible subsets. Genetic Algorithms are generally a good choice for feature selection in large datasets, though for some high dimensional problems it may take varied amount of time - few seconds, few hours or even few days. Therefore, it is important to use Genetic Algorithms that can give quality results in reasonably acceptable time limit. For this purpose, it is becoming necessary to implement Genetic Algorithms in an efficient manner. In this paper, a Master Slave Parallel Genetic Algorithm is implemented as a Feature Selection procedure to diminish the time intricacies of sequential genetic algorithm. This paper describes the speed gains in parallel Master-Slave Genetic Algorithm and also discusses the theoretical analysis of optimal number of slaves required for an efficient master slave implementation. The experiments are performed on three high-dimensional gene expression data. As Genetic Algorithm is a wrapper technique and takes more time to find the importance of any feature, Information Gain technique is used first as pre-processing task to remove the irrelevant features.


Sign in / Sign up

Export Citation Format

Share Document