Variable Selection: Determining the Explanatory Variables

2014 ◽  
pp. 107-124
2001 ◽  
Vol 5 (4) ◽  
pp. 215-234 ◽  
Author(s):  
Zvi Drezner ◽  
George A. Marcoulides ◽  
Mark Hoven Stohs

We illustrate how a comparatively new technique, a Tabu search variable selection model [Drezner, Marcoulides and Salhi (1999)], can be applied efficiently within finance when the researcher must select a subset of variables from among the whole set of explanatory variables under consideration. Several types of problems in finance, including corporate and personal bankruptcy prediction, mortgage and credit scoring, and the selection of variables for the Arbitrage Pricing Model, require the researcher to select a subset of variables from a larger set. In order to demonstrate the usefulness of the Tabu search variable selection model, we: (1) illustrate its efficiency in comparison to the main alternative search procedures, such as stepwise regression and the Maximum R2 procedure, and (2) show how a version of the Tabu search procedure may be implemented when attempting to predict corporate bankruptcy. We accomplish (2) by indicating that a Tabu Search procedure increases the predictability of corporate bankruptcy by up to 10 percentage points in comparison to Altman's (1968) Z-Score model.


2017 ◽  
Vol 2017 ◽  
pp. 1-12 ◽  
Author(s):  
Andreas Mayr ◽  
Benjamin Hofner ◽  
Elisabeth Waldmann ◽  
Tobias Hepp ◽  
Sebastian Meyer ◽  
...  

Statistical boosting algorithms have triggered a lot of research during the last decade. They combine a powerful machine learning approach with classical statistical modelling, offering various practical advantages like automated variable selection and implicit regularization of effect estimates. They are extremely flexible, as the underlying base-learners (regression functions defining the type of effect for the explanatory variables) can be combined with any kind of loss function (target function to be optimized, defining the type of regression setting). In this review article, we highlight the most recent methodological developments on statistical boosting regarding variable selection, functional regression, and advanced time-to-event modelling. Additionally, we provide a short overview on relevant applications of statistical boosting in biomedicine.


2021 ◽  
Vol 36 (4) ◽  
pp. 475-491
Author(s):  
Liu-cang Wu ◽  
Song-qin Yang ◽  
Ye Tao

AbstractAlthough there are many papers on variable selection methods based on mean model in the finite mixture of regression models, little work has been done on how to select significant explanatory variables in the modeling of the variance parameter. In this paper, we propose and study a novel class of models: a skew-normal mixture of joint location and scale models to analyze the heteroscedastic skew-normal data coming from a heterogeneous population. The problem of variable selection for the proposed models is considered. In particular, a modified Expectation-Maximization(EM) algorithm for estimating the model parameters is developed. The consistency and the oracle property of the penalized estimators is established. Simulation studies are conducted to investigate the finite sample performance of the proposed methodologies. An example is illustrated by the proposed methodologies.


2017 ◽  
Vol 21 (5) ◽  
Author(s):  
Ray-Bing Chen ◽  
Yi-Chi Chen ◽  
Chi-Hsiang Chu ◽  
Kuo-Jung Lee

AbstractWe consider the determinants of the 2008 crisis and address two main forms of model uncertainty: the uncertainty in selecting theoretical groups and the uncertainty in selecting explanatory variables. We introduce Bayesian hierarchical formulation that allows for the joint treatment of group and variable selection using the Group-wise Gibbs sampler. Our group variable selection shows that pre-crisis financial policies and trade linkages play a particularly important role in explaining the severity of the crisis, alongside institutions, and within the selected groups we identify a broader set of variables correlated with the crisis, which in turn leads to an improvement in prediction performance. In the robustness analysis we also find that our results are not qualitatively changed on alternative measures of crisis intensity, different groupings of variables, or prior assumptions. We further argue that the established results in the literature may well be attributed to different prior choices used in the analysis.


2020 ◽  
Vol 24 (5) ◽  
pp. 993-1010
Author(s):  
Hejie Lei ◽  
Xingke Chen ◽  
Ling Jian

Least absolute shrinkage and selection operator (LASSO) is one of the most commonly used methods for shrinkage estimation and variable selection. Robust variable selection methods via penalized regression, such as least absolute deviation LASSO (LAD-LASSO), etc., have gained growing attention in works of literature. However those penalized regression procedures are still sensitive to noisy data. Furthermore, “concept drift” makes learning from streaming data fundamentally different from the traditional batch learning. Focusing on the shrinkage estimation and variable selection tasks on noisy streaming data, this paper presents a noise-resilient online learning regression model, i.e. canal-LASSO. Comparing with the LASSO and LAD-LASSO, canal-LASSO is resistant to noisy data in both explanatory variables and response variables. Extensive simulation studies demonstrate satisfactory sparseness and noise-resilient performances of canal-LASSO.


Author(s):  
Verena Zuber ◽  
Korbinian Strimmer

Variable selection is a difficult problem that is particularly challenging in the analysis of high-dimensional genomic data. Here, we introduce the CAR score, a novel and highly effective criterion for variable ranking in linear regression based on Mahalanobis-decorrelation of the explanatory variables. The CAR score provides a canonical ordering that encourages grouping of correlated predictors and down-weights antagonistic variables. It decomposes the proportion of variance explained and it is an intermediate between marginal correlation and the standardized regression coefficient. As a population quantity, any preferred inference scheme can be applied for its estimation. Using simulations, we demonstrate that variable selection by CAR scores is very effective and yields prediction errors and true and false positive rates that compare favorably with modern regression techniques such as elastic net and boosting. We illustrate our approach by analyzing data concerned with diabetes progression and with the effect of aging on gene expression in the human brain. The R package “care” implementing CAR score regression is available from CRAN.


2013 ◽  
Vol 50 (1) ◽  
pp. 1-14 ◽  
Author(s):  
Anabela Marques ◽  
Ana Sousa Ferreira ◽  
Margarida G.M.S. Cardoso

Summary In Discrete Discriminant Analysis one often has to deal with dimensionality problems. In fact, even a moderate number of explanatory variables leads to an enormous number of possible states (outcomes) when compared to the number of objects under study, as occurs particularly in the social sciences, humanities and health-related elds. As a consequence, classi cation or discriminant models may exhibit poor performance due to the large number of parameters to be estimated. In the present paper, we discuss variable selection techniques which aim to address the issue of dimensionality. We speci cally perform classi cation using a combined model approach. In this setting, variable selection is particularly pertinent, enabling the handling of degrees of freedom and reducing computational cost.


Author(s):  
Nicholas A. Nechval ◽  
Konstantin N. Nechval ◽  
Maris Purgailis ◽  
Uldis Rozevskis

The problem of variable selection is one of the most pervasive model selection problems in statistical applications. Often referred to as the problem of subset selection, it arises when one wants to model the relationship between a variable of interest and a subset of potential explanatory variables or predictors, but there is uncertainty about which subset to use. Several papers have dealt with various aspects of the problem but it appears that the typical regression user has not benefited appreciably. One reason for the lack of resolution of the problem is the fact that it is has not been well defined. Indeed, it is apparent that there is not a single problem, but rather several problems for which different answers might be appropriate. The intent of this chapter is not to give specific answers but merely to present a new simple multiplicative variable selection criterion based on the parametrically penalized residual sum of squares to address the subset selection problem in multiple linear regression analysis, where the objective is to select a minimal subset of predictor variables without sacrificing any explanatory power. The variables, which optimize this criterion, are chosen to be the best variables. The authors find that the proposed criterion performs consistently well across a wide variety of variable selection problems. Practical utility of this criterion is demonstrated by numerical examples.


2015 ◽  
Vol 2015 ◽  
pp. 1-11 ◽  
Author(s):  
Susana Perez-Alvarez ◽  
Guadalupe Gómez ◽  
Christian Brander

Large datasets including an extensive number of covariates are generated these days in many different situations, for instance, in detailed genetic studies of outbreed human populations or in complex analyses of immune responses to different infections. Aiming at informing clinical interventions or vaccine design, methods for variable selection identifying those variables with the optimal prediction performance for a specific outcome are crucial. However, testing for all potential subsets of variables is not feasible and alternatives to existing methods are needed. Here, we describe a new method to handle such complex datasets, referred to as FARMS, that combines forward and all subsets regression for model selection. We apply FARMS to a host genetic and immunological dataset of over 800 individuals from Lima (Peru) and Durban (South Africa) who were HIV infected and tested for antiviral immune responses. This dataset includes more than 500 explanatory variables: around 400 variables with information on HIV immune reactivity and around 100 individual genetic characteristics. We have implemented FARMS inRstatistical language and we showed that FARMS is fast and outcompetes other comparable commonly used approaches, thus providing a new tool for the thorough analysis of complex datasets without the need for massive computational infrastructure.


2021 ◽  
Vol 11 (11) ◽  
pp. 4938
Author(s):  
Jude Chibuike Nwadiuto ◽  
Hiroyuki Okuda ◽  
Tatsuya Suzuki

This paper proposes the hybrid system model identified by a PWARX (piecewise affine autoregressive exogenous) model for modeling human driving behavior. In the proposed model, the mode segmentation is carried out automatically and the optimal number of modes is decided by a novel methodology based on consistent variable selection. In addition, model flexibility is added within the ARX (autoregressive exogenous) partitions in the form of statistical variable selection. The proposed method is able to capture both the decision-making and motion-control facets of the driving behavior. The resulting model is an optimal basal model which is not affected by the choice of data, where the explanatory variables are allowed to vary within each ARX region, thus, allowing a higher-level understanding of the motion-control aspect of the driving behavior, as well as explaining the driver’s decision-making. The proposed model is applied to model the car-following driving task based on real-road driving data, as well as to ROS-CARLA-based car-following simulation and compared to Gipp’s driver model. Obtained results that show better performance both on prediction performance and mimicking actual real-road driving demonstrates and validates the usefulness of the model.


Sign in / Sign up

Export Citation Format

Share Document