scholarly journals Approach to The Selection of Significant Features in Solving Biomedical Problems of Binary Classification of Microarray Data

Author(s):  
I.Y. Boyko ◽  
D.S. Anisimov ◽  
L.L. Smolyakova ◽  
M.A. Ryazanov

In modern biomedical research aimed at finding methods for early diagnosis of cancer, microarrays containing certain biological information about patients are used. Based on these data, patients are assigned to one of two classes, corresponding to the presence and absence of some diagnosis. When solving this problem, one of the steps that have a decisive influence on the quality of classification is the significant features selection. This paper proposes a criterion for the selection of significant features, based on the ledge-coefficient of correlation. The ledge-coefficient was previously used to estimate the degree of interrelation of numerical and binary features. For two sets of microarray data, comparative examples of their binary classification are presented using three feature selection algorithms, three dimensionality reduction methods, six classification models. The use of the ledge-criterion for feature selection made it possible to obtain a classification quality comparable to the results of using common methods of feature selection, such as t-test and U-test. For the data set of the peptide microarrays considered in the paper, the effectiveness of applying the projection method to latent structures had previously been identified. The use of this method in combination with the significant features’ selection using the ledge-criterion made it possible to obtain a higher classification quality measure.

2017 ◽  
Vol 24 (1) ◽  
pp. 3-37 ◽  
Author(s):  
SANDRA KÜBLER ◽  
CAN LIU ◽  
ZEESHAN ALI SAYYED

AbstractWe investigate feature selection methods for machine learning approaches in sentiment analysis. More specifically, we use data from the cooking platform Epicurious and attempt to predict ratings for recipes based on user reviews. In machine learning approaches to such tasks, it is a common approach to use word or part-of-speech n-grams. This results in a large set of features, out of which only a small subset may be good indicators for the sentiment. One of the questions we investigate concerns the extension of feature selection methods from a binary classification setting to a multi-class problem. We show that an inherently multi-class approach, multi-class information gain, outperforms ensembles of binary methods. We also investigate how to mitigate the effects of extreme skewing in our data set by making our features more robust and by using review and recipe sampling. We show that over-sampling is the best method for boosting performance on the minority classes, but it also results in a severe drop in overall accuracy of at least 6 per cent points.


2017 ◽  
Author(s):  
Magdalena E Strauß ◽  
John E Reid ◽  
Lorenz Wernisch

AbstractMotivationA number of pseudotime methods have provided point estimates of the ordering of cells for scRNA-seq data. A still limited number of methods also model the uncertainty of the pseudotime estimate. However, there is still a need for a method to sample from complicated and multi-modal distributions of orders, and to estimate changes in the amount of the uncertainty of the order during the course of a biological development, as this can support the selection of suitable cells for the clustering of genes or for network inference.ResultsIn an application to a microarray data set our proposed method, GPseudoRank, identifies two modes of the distribution, each of them corresponding to point estimates of orders obtained by a different established method. In an application to scRNA-seq data we demonstrate the potential of GPseudoRank to identify phases of lower and higher pseudotime uncertainty during a biological process. GPseudoRank also correctly identifies cells precocious in their antiviral response.Availability and implementationOur method is available on github: https://github.com/magStra/GPseudoRank.Contactmagdalena.strauss@mrc-bsu.cam.ac.ukSupplementary informationSupplementary materials are available.


2017 ◽  
Vol 2017 ◽  
pp. 1-18 ◽  
Author(s):  
Andrea Bommert ◽  
Jörg Rahnenführer ◽  
Michel Lang

Finding a good predictive model for a high-dimensional data set can be challenging. For genetic data, it is not only important to find a model with high predictive accuracy, but it is also important that this model uses only few features and that the selection of these features is stable. This is because, in bioinformatics, the models are used not only for prediction but also for drawing biological conclusions which makes the interpretability and reliability of the model crucial. We suggest using three target criteria when fitting a predictive model to a high-dimensional data set: the classification accuracy, the stability of the feature selection, and the number of chosen features. As it is unclear which measure is best for evaluating the stability, we first compare a variety of stability measures. We conclude that the Pearson correlation has the best theoretical and empirical properties. Also, we find that for the stability assessment behaviour it is most important that a measure contains a correction for chance or large numbers of chosen features. Then, we analyse Pareto fronts and conclude that it is possible to find models with a stable selection of few features without losing much predictive accuracy.


2021 ◽  
Vol 13 (6) ◽  
pp. 41-52
Author(s):  
Rahul Deo Verma ◽  
Shefalika Ghosh Samaddar ◽  
A. B. Samaddar

The Border Gateway Protocol (BGP) provides crucial routing information for the Internet infrastructure. A problem with abnormal routing behavior affects the stability and connectivity of the global Internet. The biggest hurdles in detecting BGP attacks are extremely unbalanced data set category distribution and the dynamic nature of the network. This unbalanced class distribution and dynamic nature of the network results in the classifier's inferior performance. In this paper we proposed an efficient approach to properly managing these problems, the proposed approach tackles the unbalanced classification of datasets by turning the problem of binary classification into a problem of multiclass classification. This is achieved by splitting the majority-class samples evenly into multiple segments using Affinity Propagation, where the number of segments is chosen so that the number of samples in any segment closely matches the minority-class samples. Such sections of the dataset together with the minor class are then viewed as different classes and used to train the Extreme Learning Machine (ELM). The RIPE and BCNET datasets are used to evaluate the performance of the proposed technique. When no feature selection is used, the proposed technique improves the F1 score by 1.9% compared to state-of-the-art techniques. With the Fischer feature selection algorithm, the proposed algorithm achieved the highest F1 score of 76.3%, which was a 1.7% improvement over the compared ones. Additionally, the MIQ feature selection technique improves the accuracy by 3.5%. For the BCNET dataset, the proposed technique improves the F1 score by 1.8% for the Fisher feature selection technique. The experimental findings support the substantial improvement in performance from previous approaches by the new technique.


T-Comm ◽  
2020 ◽  
Vol 14 (10) ◽  
pp. 53-60
Author(s):  
Oleg I. Sheluhin ◽  
◽  
Valentina P. Ivannikova ◽  

A comparative analysis of statistical and model-based methods for selecting the quantity and the composition of informative features was performed using the UNSW-NB15 database for machine learning models training for attack detection. Feature selection is one of the most important steps in data preparation for machine learning tasks. It allows to increase a quality of machine learning models: it reduces sizes of the fitted models, training time and probability of overfitting. The research was conducted using Python programming language libraries: scikit-learn, which includes various machine learning models and functions for data preparation and models estimation, and FeatureSelector, which contains functions for statistical data analysis. Numerical results of experimental research of application of both statistical methods of features selection and machine learning models-based methods are provided. As the result, the reduced set of features is obtained, which allows improving the quality of classification by removing noise features that have little effect on the final result and reducing the quantity of informative features of the data set from 41 to 17. It is shown that the most effective among the analyzed methods for feature selection is the statistical method SelectKBest with the function chi2, which allows to obtain a reduced set of features providing an accuracy of classification as high as 90% in comparation with 74% provided with the full set.


2019 ◽  
Vol 26 (1) ◽  
pp. 107327481987659 ◽  
Author(s):  
Flavio S. Fogliatto ◽  
Michel J. Anzanello ◽  
Felipe Soares ◽  
Priscila G. Brust-Renck

Several statistical-based approaches have been developed to support medical personnel in early breast cancer detection. This article presents a method for feature selection aimed at classifying cases into categories based on patients’ breast tissue measures and protein microarray. The effectiveness of this feature selection strategy was evaluated against the commonly used Wisconsin Breast Cancer Database—WBCD (with several patients and fewer features) and a new protein microarray data set (with several features and fewer patients). Features were ranked according to a feature importance index that combines parameters emerging from the unsupervised method of principal component analysis and the supervised method of Bhattacharyya distance. Observations of a training set were iteratively categorized into malignant and benign cases through 3 classification techniques: k-Nearest Neighbor, linear discriminant analysis, and probabilistic neural network. After each classification, the feature with the smallest importance index was removed, and a new categorization was carried out until there was only one feature left. The subset yielding maximum accuracy was used to classify observations in the testing set. Our method yielded average 99.17% accurate classifications in the testing set while retaining average 4.61 out of 9 features in the WBCD, which is comparable to the best results reported by the literature on that data set, with the advantage of relying on simple and widely available multivariate techniques. When applied to the microarray data, the method yielded average accuracy of 98.30% while retaining average 2.17% of the original features. Our results can aid health-care professionals during early diagnosis of breast cancer.


Sign in / Sign up

Export Citation Format

Share Document