Using cross-validation to evaluate predictive accuracy of survival risk classifiers based on high-dimensional data

R. M. Simon; J. Subramanian; M.-C. Li; S. Menezes

doi:10.1093/bib/bbr001

Complexity Selection with Cross-validation for Lasso and Sparse Partial Least Squares Using High-Dimensional Data

Algorithms from and for Nature and Life - Studies in Classification, Data Analysis, and Knowledge Organization ◽

10.1007/978-3-319-00035-0_26 ◽

2013 ◽

pp. 261-268 ◽

Cited By ~ 7

Author(s):

Anne-Laure Boulesteix ◽

Adrian Richter ◽

Christoph Bernau

Keyword(s):

Least Squares ◽

Partial Least Squares ◽

Cross Validation ◽

High Dimensional Data ◽

High Dimensional

Download Full-text

Nested and Repeated Cross Validation for Classification Model With High-Dimensional Data

Revista Colombiana de Estadística ◽

10.15446/rce.v43n1.80000 ◽

2020 ◽

Vol 43 (1) ◽

pp. 103-125

Author(s):

Yi Zhong ◽

Jianghua He ◽

Prabhakar Chalise

Keyword(s):

Feature Selection ◽

Cross Validation ◽

Predictive Accuracy ◽

Simulated Data ◽

Classification Model ◽

High Dimensional ◽

Clinical Settings ◽

Feature Subset ◽

Validation Method ◽

High Dimensional Datasets

With the advent of high throughput technologies, the high-dimensional datasets are increasingly available. This has not only opened up new insight into biological systems but also posed analytical challenges. One important problem is the selection of informative feature-subset and prediction of the future outcome. It is crucial that models are not overfitted and give accurate results with new data. In addition, reliable identification of informative features with high predictive power (feature selection) is of interests in clinical settings. We propose a two-step framework for feature selection and classification model construction, which utilizes a nested and repeated cross-validation method. We evaluated our approach using both simulated data and two publicly available gene expression datasets. The proposed method showed comparatively better predictive accuracy for new cases than the standard cross-validation method.

Download Full-text

Beta Distribution-Based Cross-Entropy for Feature Selection

Entropy ◽

10.3390/e21080769 ◽

2019 ◽

Vol 21 (8) ◽

pp. 769 ◽

Cited By ~ 1

Author(s):

Weixing Dai ◽

Dianjing Guo

Keyword(s):

Feature Selection ◽

Probability Density ◽

Beta Distribution ◽

Predictive Accuracy ◽

High Dimensional Data ◽

Area Under The Curve ◽

Cross Entropy ◽

High Dimensional ◽

Generalization Ability ◽

Conventional Methods

Analysis of high-dimensional data is a challenge in machine learning and data mining. Feature selection plays an important role in dealing with high-dimensional data for improvement of predictive accuracy, as well as better interpretation of the data. Frequently used evaluation functions for feature selection include resampling methods such as cross-validation, which show an advantage in predictive accuracy. However, these conventional methods are not only computationally expensive, but also tend to be over-optimistic. We propose a novel cross-entropy which is based on beta distribution for feature selection. In beta distribution-based cross-entropy (BetaDCE) for feature selection, the probability density is estimated by beta distribution and the cross-entropy is computed by the expected value of beta distribution, so that the generalization ability can be estimated more precisely than conventional methods where the probability density is learnt from data. Analysis of the generalization ability of BetaDCE revealed that it was a trade-off between bias and variance. The robustness of BetaDCE was demonstrated by experiments on three types of data. In the exclusive or-like (XOR-like) dataset, the false discovery rate of BetaDCE was significantly smaller than that of other methods. For the leukemia dataset, the area under the curve (AUC) of BetaDCE on the test set was 0.93 with only four selected features, which indicated that BetaDCE not only detected the irrelevant and redundant features precisely, but also more accurately predicted the class labels with a smaller number of features than the original method, whose AUC was 0.83 with 50 features. In the metabonomic dataset, the overall AUC of prediction with features selected by BetaDCE was significantly larger than that by the original reported method. Therefore, BetaDCE can be used as a general and efficient framework for feature selection.

Download Full-text

A Multicriteria Approach to Find Predictive and Sparse Models with Stable Feature Selection for High-Dimensional Data

Computational and Mathematical Methods in Medicine ◽

10.1155/2017/7907163 ◽

2017 ◽

Vol 2017 ◽

pp. 1-18 ◽

Cited By ~ 5

Author(s):

Andrea Bommert ◽

Jörg Rahnenführer ◽

Michel Lang

Keyword(s):

Feature Selection ◽

Predictive Model ◽

Predictive Accuracy ◽

Pearson Correlation ◽

High Dimensional Data ◽

High Dimensional ◽

Sparse Models ◽

Data Set ◽

The Stability ◽

Selection Of

Finding a good predictive model for a high-dimensional data set can be challenging. For genetic data, it is not only important to find a model with high predictive accuracy, but it is also important that this model uses only few features and that the selection of these features is stable. This is because, in bioinformatics, the models are used not only for prediction but also for drawing biological conclusions which makes the interpretability and reliability of the model crucial. We suggest using three target criteria when fitting a predictive model to a high-dimensional data set: the classification accuracy, the stability of the feature selection, and the number of chosen features. As it is unclear which measure is best for evaluating the stability, we first compare a variety of stability measures. We conclude that the Pearson correlation has the best theoretical and empirical properties. Also, we find that for the stability assessment behaviour it is most important that a measure contains a correction for chance or large numbers of chosen features. Then, we analyse Pareto fronts and conclude that it is possible to find models with a stable selection of few features without losing much predictive accuracy.

Download Full-text

Large Sample Covariance Matrices and High-Dimensional Data Analysis

10.1017/cbo9781107588080 ◽

2015 ◽

Cited By ~ 26

Author(s):

Jianfeng Yao ◽

Shurong Zheng ◽

Zhidong Bai

Keyword(s):

Data Analysis ◽

High Dimensional Data ◽

Covariance Matrices ◽

High Dimensional ◽

Large Sample ◽

Sample Covariance Matrices ◽

Sample Covariance ◽

High Dimensional Data Analysis

Download Full-text

Fractal-Based Methods as a Technique for Estimating the Intrinsic Dimensionality of High-Dimensional Data: A Survey

Informatica ◽

10.15388/informatica.2016.84 ◽

2016 ◽

Vol 27 (2) ◽

pp. 257-281 ◽

Cited By ~ 5

Author(s):

Rasa Karbauskaitė ◽

Gintautas Dzemyda

Keyword(s):

High Dimensional Data ◽

High Dimensional ◽

Intrinsic Dimensionality

Download Full-text

A Fast Clustering Algorithm for Large-scale and High Dimensional Data

ACTA AUTOMATICA SINICA ◽

10.3724/sp.j.1004.2009.00859 ◽

2009 ◽

Vol 35 (7) ◽

pp. 859-866

Author(s):

Ming LIU ◽

Xiao-Long WANG ◽

Yuan-Chao LIU

Keyword(s):

Large Scale ◽

Clustering Algorithm ◽

High Dimensional Data ◽

High Dimensional

Download Full-text

Improved negative selection algorithm for network anomaly detection on high-dimensional data

Journal of Computer Applications ◽

10.3724/sp.j.1087.2009.00805 ◽

2009 ◽

Vol 29 (3) ◽

pp. 805-807 ◽

Cited By ~ 1

Author(s):

Wen-zhong GUO ◽

Guo-long CHEN ◽

Qing-liang CHEN

Keyword(s):

Anomaly Detection ◽

Negative Selection ◽

High Dimensional Data ◽

High Dimensional ◽

Selection Algorithm ◽

Negative Selection Algorithm ◽

Network Anomaly Detection

Download Full-text

An Advanced Mining Services in Predicting and Ranking User Vitality across Dynamic and High Dimensional Data Sets

SSRN Electronic Journal ◽

10.2139/ssrn.3395242 ◽

2019 ◽

Author(s):

Ch. Durga Bhavani ◽

Dr. A. Daveedu Raju ◽

Dr. V. Surya Narayana

Keyword(s):

High Dimensional Data ◽

High Dimensional ◽

Data Sets

Download Full-text

Outlier Detection in High Dimensional Data Based on the Anti-Hub and Regression Technique

International Journal for Research in Applied Science and Engineering Technology ◽

10.22214/ijraset.2017.8219 ◽

2017 ◽

Vol V (VIII) ◽

pp. 1543-1551

Author(s):

Golla Hemalatha

Keyword(s):

Outlier Detection ◽

High Dimensional Data ◽

Regression Technique ◽

High Dimensional

Download Full-text