A simple new approach to variable selection in regression, with application to genetic fine-mapping

Mapping Intimacies ◽

10.1101/501114 ◽

2018 ◽

Cited By ~ 33

Author(s):

Gao Wang ◽

Abhishek Sarkar ◽

Peter Carbonetto ◽

Matthew Stephens

Keyword(s):

Variable Selection ◽

Fine Mapping ◽

Posterior Distribution ◽

Zero Element ◽

Variational Approximation ◽

New Approach ◽

Stepwise Selection ◽

Fitting Procedure ◽

Highly Correlated ◽

Credible Set

We introduce a simple new approach to variable selection in linear regression, with a particular focus on quantifying uncertainty in which variables should be selected. The approach is based on a new model — the “Sum of Single Effects” (SuSiE) model — which comes from writing the sparse vector of regression coefficients as a sum of “single-effect” vectors, each with one non-zero element. We also introduce a corresponding new fitting procedure — Iterative Bayesian Stepwise Selection (IBSS) — which is a Bayesian analogue of stepwise selection methods. IBSS shares the computational simplicity and speed of traditional stepwise methods, but instead of selecting a single variable at each step, IBSS computes a distribution on variables that captures uncertainty in which variable to select. We provide a formal justification of this intuitive algorithm by showing that it optimizes a variational approximation to the posterior distribution under the SuSiE model. Further, this approximate posterior distribution naturally yields convenient novel summaries of uncertainty in variable selection, providing a Credible Set of variables for each selection. Our methods are particularly well-suited to settings where variables are highly correlated and detectable effects are sparse, both of which are characteristics of genetic fine-mapping applications. We demonstrate through numerical experiments that our methods outper-form existing methods for this task, and illustrate their application to fine-mapping genetic variants influencing alternative splicing in human cell-lines. We also discuss the potential and challenges for applying these methods to generic variable selection problems.

Download Full-text

A simple new approach to variable selection in regression, with application to genetic fine mapping

Journal of the Royal Statistical Society Series B (Statistical Methodology) ◽

10.1111/rssb.12388 ◽

2020 ◽

Vol 82 (5) ◽

pp. 1273-1300 ◽

Cited By ~ 11

Author(s):

Gao Wang ◽

Abhishek Sarkar ◽

Peter Carbonetto ◽

Matthew Stephens

Keyword(s):

Variable Selection ◽

Fine Mapping ◽

New Approach

Download Full-text

A new approach to deal with variable selection in neural networks: an application to bankruptcy prediction

Annals of Operations Research ◽

10.1007/s10479-021-04236-4 ◽

2021 ◽

Author(s):

Ilyes Abid ◽

Rim Ayadi ◽

Khaled Guesmi ◽

Farid Mkaouar

Keyword(s):

Neural Networks ◽

Variable Selection ◽

Bankruptcy Prediction ◽

New Approach

Download Full-text

Application of a random forests (RF) method as a new approach for variable selection and modelling in a QSRR study to predict the relative retention time of some polybrominated diphenylethers (PBDEs)

Analytical Methods ◽

10.1039/c2ay25484k ◽

2012 ◽

Vol 4 (11) ◽

pp. 3733 ◽

Cited By ~ 5

Author(s):

Nasser Goudarzi ◽

Davood Shahsavani

Keyword(s):

Variable Selection ◽

Random Forests ◽

Retention Time ◽

Relative Retention Time ◽

New Approach ◽

Relative Retention ◽

Polybrominated Diphenylethers

Download Full-text

Developing an Optimal Spatial Predictive Model for Seabed Sand Content Using Machine Learning, Geostatistics, and Their Hybrid Methods

Geosciences ◽

10.3390/geosciences9040180 ◽

2019 ◽

Vol 9 (4) ◽

pp. 180 ◽

Cited By ~ 3

Author(s):

Li ◽

Siwabessy ◽

Huang ◽

Nichol

Keyword(s):

Variable Selection ◽

Predictive Accuracy ◽

Hybrid Methods ◽

Predictive Modelling ◽

Sand Content ◽

Seabed Sediment ◽

Baseline Information ◽

Point Data ◽

Highly Correlated ◽

Derived Data

Seabed sediment predictions at regional and national scales in Australia are mainly based on bathymetry-related variables due to the lack of backscatter-derived data. In this study, we applied random forests (RFs), hybrid methods of RF and geostatistics, and generalized boosted regression modelling (GBM), to seabed sand content point data and acoustic multibeam data and their derived variables, to develop an accurate model to predict seabed sand content at a local scale. We also addressed relevant issues with variable selection. It was found that: (1) backscatter-related variables are more important than bathymetry-related variables for sand predictive modelling; (2) the inclusion of highly correlated predictors can improve predictive accuracy; (3) the rank orders of averaged variable importance (AVI) and accuracy contribution change with input predictors for RF and are not necessarily matched; (4) a knowledge-informed AVI method (KIAVI2) is recommended for RF; (5) the hybrid methods and their averaging can significantly improve predictive accuracy and are recommended; (6) relationships between sand and predictors are non-linear; and (7) variable selection methods for GBM need further study. Accuracy-improved predictions of sand content are generated at high resolution, which provide important baseline information for environmental management and conservation.

Download Full-text

Sufficient Dimension Reduction and Variable Selection for Large-p-Small-n Data With Highly Correlated Predictors

Journal of Computational and Graphical Statistics ◽

10.1080/10618600.2016.1164057 ◽

2017 ◽

Vol 26 (1) ◽

pp. 26-34 ◽

Cited By ~ 5

Author(s):

Haileab Hilafu ◽

Xiangrong Yin

Keyword(s):

Variable Selection ◽

Dimension Reduction ◽

Sufficient Dimension Reduction ◽

Large P Small N ◽

Selection For ◽

Small N ◽

Highly Correlated ◽

Correlated Predictors

Download Full-text

A New Approach to Variable Selection Using the TLS Approach

IEEE Transactions on Signal Processing ◽

10.1109/tsp.2006.882105 ◽

2007 ◽

Vol 55 (1) ◽

pp. 10-19 ◽

Cited By ~ 6

Author(s):

Jean-Jacques Fuchs ◽

Sbastien Maria

Keyword(s):

Variable Selection ◽

New Approach

Download Full-text

A new approach to variable selection in least squares problems

IMA Journal of Numerical Analysis ◽

10.1093/imanum/20.3.389 ◽

2000 ◽

Vol 20 (3) ◽

pp. 389-403 ◽

Cited By ~ 402

Author(s):

M. Osborne

Keyword(s):

Variable Selection ◽

Least Squares ◽

Least Squares Problems ◽

New Approach

Download Full-text

Variable Selection for Confounder Control, Flexible Modeling and Collaborative Targeted Minimum Loss-Based Estimation in Causal Inference

The International Journal of Biostatistics ◽

10.1515/ijb-2015-0017 ◽

2016 ◽

Vol 12 (1) ◽

pp. 97-115 ◽

Cited By ~ 14

Author(s):

Mireille E. Schnitzer ◽

Judith J. Lok ◽

Susan Gruber

Keyword(s):

Propensity Score ◽

Variable Selection ◽

Causal Inference ◽

Simulation Study ◽

Learning Approaches ◽

Minimum Loss ◽

Knowledge Based ◽

Flexible Modeling ◽

Selection For ◽

Highly Correlated

Abstract This paper investigates the appropriateness of the integration of flexible propensity score modeling (nonparametric or machine learning approaches) in semiparametric models for the estimation of a causal quantity, such as the mean outcome under treatment. We begin with an overview of some of the issues involved in knowledge-based and statistical variable selection in causal inference and the potential pitfalls of automated selection based on the fit of the propensity score. Using a simple example, we directly show the consequences of adjusting for pure causes of the exposure when using inverse probability of treatment weighting (IPTW). Such variables are likely to be selected when using a naive approach to model selection for the propensity score. We describe how the method of Collaborative Targeted minimum loss-based estimation (C-TMLE; van der Laan and Gruber, 2010 [27]) capitalizes on the collaborative double robustness property of semiparametric efficient estimators to select covariates for the propensity score based on the error in the conditional outcome model. Finally, we compare several approaches to automated variable selection in low- and high-dimensional settings through a simulation study. From this simulation study, we conclude that using IPTW with flexible prediction for the propensity score can result in inferior estimation, while Targeted minimum loss-based estimation and C-TMLE may benefit from flexible prediction and remain robust to the presence of variables that are highly correlated with treatment. However, in our study, standard influence function-based methods for the variance underestimated the standard errors, resulting in poor coverage under certain data-generating scenarios.

Download Full-text

Economic Predictions With Big Data: The Illusion of Sparsity

Econometrica ◽

10.3982/ecta17842 ◽

2021 ◽

Vol 89 (5) ◽

pp. 2409-2437 ◽

Cited By ~ 1

Author(s):

Domenico Giannone ◽

Michele Lenza ◽

Giorgio E. Primiceri

Keyword(s):

Big Data ◽

Variable Selection ◽

Posterior Distribution ◽

Predictive Models ◽

Sparse Model

We compare sparse and dense representations of predictive models in macroeconomics, microeconomics, and finance. To deal with a large number of possible predictors, we specify a prior that allows for both variable selection and shrinkage. The posterior distribution does not typically concentrate on a single sparse model, but on a wide set of models that often include many predictors.

Download Full-text

SNP Variable Selection by Generalized Graph Domination

10.1101/396085 ◽

2018 ◽

Author(s):

Shuzhen Sun ◽

Zhuqi Miao ◽

Blaise Ratcliffe ◽

Polly Campbell ◽

Bret Pasch ◽

...

Keyword(s):

Variable Selection ◽

High Throughput Sequencing ◽

Dominating Set ◽

Similarity Measures ◽

Correlation Coefficients ◽

Biological Research ◽

Pairwise Linkage Disequilibrium ◽

Graph Theoretic ◽

Large Numbers ◽

Highly Correlated

AbstractHigh-throughput sequencing technology has revolutionized both medical and biological research by generating exceedingly large numbers of genetic variants. The resulting datasets share a number of common characteristics that might lead to poor generalization capacity. Concerns include noise accumulated due to the large number of predictors, sparse information regarding the p ≫ n problem, and overfitting and model mis-identification resulting from spurious collinearity. Additionally, complex correlation patterns are present among variables. As a consequence, reliable variable selection techniques play a pivotal role in predictive analysis, generalization capability, and robustness in clustering, as well as interpretability of the derived models.K-dominating set, a parameterized graph-theoretic generalization model, was used to model SNP (single nucleotide polymorphism) data as a similarity network and searched for representative SNP variables. In particular, each SNP was represented as a vertex in the graph, (dis)similarity measures such as correlation coefficients or pairwise linkage disequilibrium were estimated to describe the relationship between each pair of SNPs; a pair of vertices are adjacent, i.e. joined by an edge, if the pairwise similarity measure exceeds a user-specified threshold. A minimum K-dominating set in the SNP graph was then made as the smallest subset such that every SNP that is excluded from the subset has at least k neighbors in the selected ones. The strength ofk-dominating set selection in identifying independent variables, and in culling representative variables that are highly correlated with others, was demonstrated by a simulated dataset. The advantages of k-dominating set variable selection were also illustrated in two applications: pedigree reconstruction using SNP profiles of 1,372 Douglas-fir trees, and species delineation for 226 grasshopper mouse samples. A C++ source code that implements SNP-SELECT and uses Gurobi™ optimization solver for the k-dominating set variable selection is available (https://github.com/transgenomicsosu/SNP-SELECT).

Download Full-text