Prediction and Variable Selection in High-Dimensional Misspecified Binary Classification

Konrad Furmańczyk; Wojciech Rejchel

doi:10.3390/e22050543

Prediction and Variable Selection in High-Dimensional Misspecified Binary Classification

Entropy ◽

10.3390/e22050543 ◽

2020 ◽

Vol 22 (5) ◽

pp. 543 ◽

Cited By ~ 2

Author(s):

Konrad Furmańczyk ◽

Wojciech Rejchel

Keyword(s):

Logistic Regression ◽

Variable Selection ◽

Logistic Model ◽

Binary Classification ◽

Model Misspecification ◽

High Dimensional ◽

Classification Models ◽

Computationally Efficient ◽

Class Labels ◽

Penalized Logistic Regression

In this paper, we consider prediction and variable selection in the misspecified binary classification models under the high-dimensional scenario. We focus on two approaches to classification, which are computationally efficient, but lead to model misspecification. The first one is to apply penalized logistic regression to the classification data, which possibly do not follow the logistic model. The second method is even more radical: we just treat class labels of objects as they were numbers and apply penalized linear regression. In this paper, we investigate thoroughly these two approaches and provide conditions, which guarantee that they are successful in prediction and variable selection. Our results hold even if the number of predictors is much larger than the sample size. The paper is completed by the experimental results.

Download Full-text

Penalized logistic regression with low prevalence exposures beyond high dimensional settings

PLoS ONE ◽

10.1371/journal.pone.0217057 ◽

2019 ◽

Vol 14 (5) ◽

pp. e0217057 ◽

Cited By ~ 10

Author(s):

Sam Doerken ◽

Marta Avalos ◽

Emmanuel Lagarde ◽

Martin Schumacher

Keyword(s):

Logistic Regression ◽

High Dimensional ◽

Penalized Logistic Regression ◽

Low Prevalence

Download Full-text

Penalized logistic regression based on L1/2 penalty for high-dimensional DNA methylation data

Technology and Health Care ◽

10.3233/thc-209016 ◽

2020 ◽

Vol 28 ◽

pp. 161-171

Author(s):

Hong-Kun Jiang ◽

Yong Liang

Keyword(s):

Dna Methylation ◽

Logistic Regression ◽

High Dimensional ◽

Methylation Data ◽

Penalized Logistic Regression

Download Full-text

Stable Variable Selection for High-dimensional Genomic Data with Strong Correlations

10.21203/rs.3.rs-923319/v1 ◽

2021 ◽

Author(s):

Reetika Sarkar ◽

Sithija Manage ◽

Xiaoli Gao

Keyword(s):

Variable Selection ◽

Genomic Data ◽

High Dimensional ◽

Strong Correlations ◽

Computationally Efficient ◽

Two Stage ◽

Selection Of Variables ◽

Level Variable ◽

Stable Variable ◽

Selection Of

Abstract Background: High-dimensional genomic data studies are often found to exhibit strong correlations, which results in instability and inconsistency in the estimates obtained using commonly used regularization approaches including both the Lasso and MCP, and related methods. Result: In this paper, we perform a comparative study of regularization approaches for variable selection under different correlation structures, and propose a two-stage procedure named rPGBS to address the issue of stable variable selection in various strong correlation settings. This approach involves repeatedly running of a two-stage hierarchical approach consisting of a random pseudo-group clustering and bi-level variable selection. Conclusion: Both the simulation studies and high-dimensional genomic data analysis have demonstrated the advantage of the proposed rPGBS method over most commonly used regularization methods. In particular, the rPGBS results in more stable selection of variables across a variety of correlation settings, as compared to recent work addressing variable selection with strong correlations. Moreover, the rPGBS is computationally efficient across various settings.

Download Full-text

Using a supervised principal components analysis for variable selection in high-dimensional datasets reduces false discovery rates

10.1101/2020.05.15.097774 ◽

2020 ◽

Author(s):

Insha Ullah ◽

Kerrie Mengersen ◽

Anthony Pettitt ◽

Benoit Liquet

Keyword(s):

Gene Expression ◽

Variable Selection ◽

Selection Method ◽

High Dimensional ◽

Computationally Efficient ◽

Statistical Tools ◽

Multivariate Statistical ◽

Variable Selection Method ◽

False Discovery ◽

High Dimensional Datasets

AbstractHigh-dimensional datasets, where the number of variables ‘p’ is much larger compared to the number of samples ‘n’, are ubiquitous and often render standard classification and regression techniques unreliable due to overfitting. An important research problem is feature selection — ranking of candidate variables based on their relevance to the outcome variable and retaining those that satisfy a chosen criterion. In this article, we propose a computationally efficient variable selection method based on principal component analysis. The method is very simple, accessible, and suitable for the analysis of high-dimensional datasets. It allows to correct for population structure in genome-wide association studies (GWAS) which otherwise would induce spurious associations and is less likely to overfit. We expect our method to accurately identify important features but at the same time reduce the False Discovery Rate (FDR) (the expected proportion of erroneously rejected null hypotheses) through accounting for the correlation between variables and through de-noising data in the training phase, which also make it robust to outliers in the training data. Being almost as fast as univariate filters, our method allows for valid statistical inference. The ability to make such inferences sets this method apart from most of the current multivariate statistical tools designed for today’s high-dimensional data. We demonstrate the superior performance of our method through extensive simulations. A semi-real gene-expression dataset, a challenging childhood acute lymphoblastic leukemia (CALL) gene expression study, and a GWAS that attempts to identify single-nucleotide polymorphisms (SNPs) associated with the rice grain length further demonstrate the usefulness of our method in genomic applications.Author summaryAn integral part of modern statistical research is feature selection, which has claimed various scientific discoveries, especially in the emerging genomics applications such as gene expression and proteomics studies, where data has thousands or tens of thousands of features but a limited number of samples. However, in practice, due to unavailability of suitable multivariate methods, researchers often resort to univariate filters when it comes to deal with a large number of variables. These univariate filters do not take into account the dependencies between variables because they independently assess variables one-by-one. This leads to loss of information, loss of statistical power (the probability of correctly rejecting the null hypothesis) and potentially biased estimates. In our paper, we propose a new variable selection method. Being computationally efficient, our method allows for valid inference. The ability to make such inferences sets this method apart from most of the current multivariate statistical tools designed for today’s high-dimensional data.

Download Full-text

Penalized logistic regression with the adaptive LASSO for gene selection in high-dimensional cancer classification

Expert Systems with Applications ◽

10.1016/j.eswa.2015.08.016 ◽

2015 ◽

Vol 42 (23) ◽

pp. 9326-9332 ◽

Cited By ~ 52

Author(s):

Zakariya Yahya Algamal ◽

Muhammad Hisyam Lee

Keyword(s):

Logistic Regression ◽

Gene Selection ◽

Cancer Classification ◽

Adaptive Lasso ◽

High Dimensional ◽

Penalized Logistic Regression

Download Full-text

Penalized logistic regression for high-dimensional DNA methylation data with case-control studies

Bioinformatics ◽

10.1093/bioinformatics/bts145 ◽

2012 ◽

Vol 28 (10) ◽

pp. 1368-1375 ◽

Cited By ~ 54

Author(s):

Hokeun Sun ◽

Shuang Wang

Keyword(s):

Dna Methylation ◽

Logistic Regression ◽

Case Control ◽

High Dimensional ◽

Methylation Data ◽

Case Control Studies ◽

Penalized Logistic Regression

Download Full-text

A Comparative Study of Variable Selection Methods for High Dimensional Data Based on Logistic Regression Model

Statistics and Applications ◽

10.12677/sa.2019.83062 ◽

2019 ◽

Vol 08 (03) ◽

pp. 553-559

Author(s):

丹廖

Keyword(s):

Logistic Regression ◽

Variable Selection ◽

Comparative Study ◽

Regression Model ◽

Logistic Regression Model ◽

High Dimensional Data ◽

High Dimensional ◽

Selection Methods

Download Full-text

Faculty Opinions recommendation of Panning for gold: ‘model-X’ knockoffs for high dimensional controlled variable selection.

Faculty Opinions – Post-Publication Peer Review of the Biomedical Literature ◽

10.3410/f.732703320.793559523 ◽

2019 ◽

Author(s):

Karsten Borgwardt

Keyword(s):

Variable Selection ◽

High Dimensional

Download Full-text

Machine-Learning-Based Radiomics MRI Model for Survival Prediction of Recurrent Glioblastomas Treated with Bevacizumab

Diagnostics ◽

10.3390/diagnostics11071263 ◽

2021 ◽

Vol 11 (7) ◽

pp. 1263

Author(s):

Samy Ammari ◽

Raoul Sallé de Chou ◽

Tarek Assi ◽

Mehdi Touat ◽

Emilie Chouzenoux ◽

...

Keyword(s):

Machine Learning ◽

Therapeutic Option ◽

Binary Classification ◽

Progression Free Survival ◽

Recurrent Glioblastoma ◽

Machine Learning Algorithms ◽

Survival Prediction ◽

Classification Models ◽

Angiogenic Therapy ◽

Recurrent Gbm

Anti-angiogenic therapy with bevacizumab is a widely used therapeutic option for recurrent glioblastoma (GBM). Nevertheless, the therapeutic response remains highly heterogeneous among GBM patients with discordant outcomes. Recent data have shown that radiomics, an advanced recent imaging analysis method, can help to predict both prognosis and therapy in a multitude of solid tumours. The objective of this study was to identify novel biomarkers, extracted from MRI and clinical data, which could predict overall survival (OS) and progression-free survival (PFS) in GBM patients treated with bevacizumab using machine-learning algorithms. In a cohort of 194 recurrent GBM patients (age range 18–80), radiomics data from pre-treatment T2 FLAIR and gadolinium-injected MRI images along with clinical features were analysed. Binary classification models for OS at 9, 12, and 15 months were evaluated. Our classification models successfully stratified the OS. The AUCs were equal to 0.78, 0.85, and 0.76 on the test sets (0.79, 0.82, and 0.87 on the training sets) for the 9-, 12-, and 15-month endpoints, respectively. Regressions yielded a C-index of 0.64 (0.74) for OS and 0.57 (0.69) for PFS. These results suggest that radiomics could assist in the elaboration of a predictive model for treatment selection in recurrent GBM patients.

Download Full-text

Iterative Variable Selection for High-Dimensional Data: Prediction of Pathological Response in Triple-Negative Breast Cancer

Mathematics ◽

10.3390/math9030222 ◽

2021 ◽

Vol 9 (3) ◽

pp. 222

Author(s):

Juan C. Laria ◽

M. Carmen Aguilera-Morillo ◽

Enrique Álvarez ◽

Rosa E. Lillo ◽

Sara López-Taruella ◽

...

Keyword(s):

Breast Cancer ◽

Variable Selection ◽

Triple Negative Breast Cancer ◽

Triple Negative ◽

A Priori ◽

Simulated Data ◽

Point Of View ◽

High Dimensional ◽

Whole Genome ◽

Genome Context

Over the last decade, regularized regression methods have offered alternatives for performing multi-marker analysis and feature selection in a whole genome context. The process of defining a list of genes that will characterize an expression profile remains unclear. It currently relies upon advanced statistics and can use an agnostic point of view or include some a priori knowledge, but overfitting remains a problem. This paper introduces a methodology to deal with the variable selection and model estimation problems in the high-dimensional set-up, which can be particularly useful in the whole genome context. Results are validated using simulated data and a real dataset from a triple-negative breast cancer study.

Download Full-text