A Scalable Memetic Algorithm for Simultaneous Instance and Feature Selection

2014 ◽  
Vol 22 (1) ◽  
pp. 1-45 ◽  
Author(s):  
Nicolás García-Pedrajas ◽  
Aida de Haro-García ◽  
Javier Pérez-Rodríguez

Instance selection is becoming increasingly relevant due to the huge amount of data that is constantly produced in many fields of research. At the same time, most of the recent pattern recognition problems involve highly complex datasets with a large number of possible explanatory variables. For many reasons, this abundance of variables significantly harms classification or recognition tasks. There are efficiency issues, too, because the speed of many classification algorithms is largely improved when the complexity of the data is reduced. One of the approaches to address problems that have too many features or instances is feature or instance selection, respectively. Although most methods address instance and feature selection separately, both problems are interwoven, and benefits are expected from facing these two tasks jointly. This paper proposes a new memetic algorithm for dealing with many instances and many features simultaneously by performing joint instance and feature selection. The proposed method performs four different local search procedures with the aim of obtaining the most relevant subsets of instances and features to perform an accurate classification. A new fitness function is also proposed that enforces instance selection but avoids putting too much pressure on removing features. We prove experimentally that this fitness function improves the results in terms of testing error. Regarding the scalability of the method, an extension of the stratification approach is developed for simultaneous instance and feature selection. This extension allows the application of the proposed algorithm to large datasets. An extensive comparison using 55 medium to large datasets from the UCI Machine Learning Repository shows the usefulness of our method. Additionally, the method is applied to 30 large problems, with very good results. The accuracy of the method for class-imbalanced problems in a set of 40 datasets is shown. The usefulness of the method is also tested using decision trees and support vector machines as classification methods.

2020 ◽  
Author(s):  
Eleanor C. Williams ◽  
Anisoara Calinescu ◽  
Irina Mohorianu

AbstractmicroRNAs play a key role in RNA interference, the sequence-driven targeting of mRNAs that regulates their translation to proteins, through translation inhibition or the degradation of the mRNA. Around ~ 30% of animal genes may be tuned by microRNAs. The prediction of miRNA/mRNA interactions is hindered by the short length of the interaction (seed) region (~7- 8nt). We collate several large datasets overviewing validated interactions and propose feamiR, a novel pipeline comprising optimised classification approaches (Decision Trees/Random Forests and an efficient feature selection based on embryonic Genetic Algorithms used in conjunction with Support Vector Machines) aimed at identifying discriminative nucleotide features, on the seed, compensatory and flanking regions, that increase the prediction accuracy for interactions. Common and specific combinations of features illustrate differences between reference organisms, validation techniques or tissue/cell localisation. feamiR revealed new key positions that drive the miRNA/mRNA interactions, leading to novel questions on the mode-of-action of miRNAs.


2008 ◽  
Vol 15 (2) ◽  
pp. 203-218
Author(s):  
Luiz E. S. Oliveira ◽  
Paulo R. Cavalin ◽  
Alceu S. Britto Jr ◽  
Alessandro L. Koerich

This paper addresses the issue of detecting defects in Pine wood using features extracted from grayscale images. The feature set proposed here is based on the concept of texture and it is computed from the co-occurrence matrices. The features provide measures of properties such as smoothness, coarseness, and regularity. Comparative experiments using a color image based feature set extracted from percentile histograms are carried to demonstrate the efficiency of the proposed feature set. Two different learning paradigms, neural networks and support vector machines, and a feature selection algorithm based on multi-objective genetic algorithms were considered in our experiments. The experimental results show that after feature selection, the grayscale image based feature set achieves very competitive performance for the problem of wood defect detection relative to the color image based features.


: In this era of Internet, the issue of security of information is at its peak. One of the main threats in this cyber world is phishing attacks which is an email or website fraud method that targets the genuine webpage or an email and hacks it without the consent of the end user. There are various techniques which help to classify whether the website or an email is legitimate or fake. The major contributors in the process of detection of these phishing frauds include the classification algorithms, feature selection techniques or dataset preparation methods and the feature extraction that plays an important role in detection as well as in prevention of these attacks. This Survey Paper studies the effect of all these contributors and the approaches that are applied in the study conducted on the recent papers. Some of the classification algorithms that are implemented includes Decision tree, Random Forest , Support Vector Machines, Logistic Regression , Lazy K Star, Naive Bayes and J48 etc.


Sign in / Sign up

Export Citation Format

Share Document