scholarly journals Variable Selection and Regularization in Quantile Regression via Minimum Covariance Determinant Based Weights

Entropy ◽  
2020 ◽  
Vol 23 (1) ◽  
pp. 33
Author(s):  
Edmore Ranganai ◽  
Innocent Mudhombo

The importance of variable selection and regularization procedures in multiple regression analysis cannot be overemphasized. These procedures are adversely affected by predictor space data aberrations as well as outliers in the response space. To counter the latter, robust statistical procedures such as quantile regression which generalizes the well-known least absolute deviation procedure to all quantile levels have been proposed in the literature. Quantile regression is robust to response variable outliers but very susceptible to outliers in the predictor space (high leverage points) which may alter the eigen-structure of the predictor matrix. High leverage points that alter the eigen-structure of the predictor matrix by creating or hiding collinearity are referred to as collinearity influential points. In this paper, we suggest generalizing the penalized weighted least absolute deviation to all quantile levels, i.e., to penalized weighted quantile regression using the RIDGE, LASSO, and elastic net penalties as a remedy against collinearity influential points and high leverage points in general. To maintain robustness, we make use of very robust weights based on the computationally intensive high breakdown minimum covariance determinant. Simulations and applications to well-known data sets from the literature show an improvement in variable selection and regularization due to the robust weighting formulation.

Author(s):  
Ibrahim Abdullahi ◽  
Abubakar Yahaya

<p>In this article, an alternative to ordinary least squares (OLS) regression based on analytical solution in the Statgraphics software is considered, and this alternative is no other than quantile regression (QR) model. We also present goodness of fit statistic as well as approximate distributions of the associated test statistics for the parameters. Furthermore, we suggest a goodness of fit statistic called the least absolute deviation (LAD) coefficient of determination. The procedure is well presented, illustrated and validated by a numerical example based on publicly available dataset on fuel consumption in miles per gallon in highway driving.</p>


2018 ◽  
Vol 21 (2) ◽  
pp. 117-124 ◽  
Author(s):  
Bakhtyar Sepehri ◽  
Nematollah Omidikia ◽  
Mohsen Kompany-Zareh ◽  
Raouf Ghavami

Aims & Scope: In this research, 8 variable selection approaches were used to investigate the effect of variable selection on the predictive power and stability of CoMFA models. Materials & Methods: Three data sets including 36 EPAC antagonists, 79 CD38 inhibitors and 57 ATAD2 bromodomain inhibitors were modelled by CoMFA. First of all, for all three data sets, CoMFA models with all CoMFA descriptors were created then by applying each variable selection method a new CoMFA model was developed so for each data set, 9 CoMFA models were built. Obtained results show noisy and uninformative variables affect CoMFA results. Based on created models, applying 5 variable selection approaches including FFD, SRD-FFD, IVE-PLS, SRD-UVEPLS and SPA-jackknife increases the predictive power and stability of CoMFA models significantly. Result & Conclusion: Among them, SPA-jackknife removes most of the variables while FFD retains most of them. FFD and IVE-PLS are time consuming process while SRD-FFD and SRD-UVE-PLS run need to few seconds. Also applying FFD, SRD-FFD, IVE-PLS, SRD-UVE-PLS protect CoMFA countor maps information for both fields.


2012 ◽  
Vol 38 (2) ◽  
pp. 57-69 ◽  
Author(s):  
Abdulghani Hasan ◽  
Petter Pilesjö ◽  
Andreas Persson

Global change and GHG emission modelling are dependent on accurate wetness estimations for predictions of e.g. methane emissions. This study aims to quantify how the slope, drainage area and the TWI vary with the resolution of DEMs for a flat peatland area. Six DEMs with spatial resolutions from 0.5 to 90 m were interpolated with four different search radiuses. The relationship between accuracy of the DEM and the slope was tested. The LiDAR elevation data was divided into two data sets. The number of data points facilitated an evaluation dataset with data points not more than 10 mm away from the cell centre points in the interpolation dataset. The DEM was evaluated using a quantile-quantile test and the normalized median absolute deviation. It showed independence of the resolution when using the same search radius. The accuracy of the estimated elevation for different slopes was tested using the 0.5 meter DEM and it showed a higher deviation from evaluation data for steep areas. The slope estimations between resolutions showed differences with values that exceeded 50%. Drainage areas were tested for three resolutions, with coinciding evaluation points. The model ability to generate drainage area at each resolution was tested by pair wise comparison of three data subsets and showed differences of more than 50% in 25% of the evaluated points. The results show that consideration of DEM resolution is a necessity for the use of slope, drainage area and TWI data in large scale modelling.


2014 ◽  
Vol 35 (8) ◽  
pp. 1657-1683 ◽  
Author(s):  
ANDY SHARMA

ABSTRACTWith the on-going ageing of the United States population, resolving health disparities continues to be a prominent and worthwhile goal, particularly in the areas of promoting minority health and reducing racial/ethnic disparities. This analysis employs the 2004 and 2005 Household Component records from the Medical Expenditures Panel Survey, which correspond to data files H89 and H97, to examine utilisation by race across the entire distribution function; more specifically, applying the behavioural model of health services utilisation and employing a Quantile Regression (QR) framework. This is a noteworthy contribution because the conditional mean may not be the best approximation for a skewed-location distribution. In contrast, QR is robust to outliers and scale effects since the estimation minimises least absolute deviation. The sample consists of 2,525 older adults at least 65 years of age with 303 corresponding to Black and 2,222 corresponding to White. Results suggest older Blacks continue to utilise health services (i.e. office or clinic visits with a physician or medical provider) at lower levels and this is more pronounced at and below the median quantile (i.e. below the 50th cut-off). Usual source of care (USC) continues to play an important role. Beliefs surrounding the need for insurance and medical intervention are also significant and explain some of the racial disparities. Although utilisation disparities persist for older Blacks, collaborative and flexible models of care can reach this group.


2005 ◽  
Vol 15 (01n02) ◽  
pp. 101-110 ◽  
Author(s):  
TIMO SIMILÄ ◽  
SAMPSA LAINE

Practical data analysis often encounters data sets with both relevant and useless variables. Supervised variable selection is the task of selecting the relevant variables based on some predefined criterion. We propose a robust method for this task. The user manually selects a set of target variables and trains a Self-Organizing Map with these data. This sets a criterion to variable selection and is an illustrative description of the user's problem, even for multivariate target data. The user also defines another set of variables that are potentially related to the problem. Our method returns a subset of these variables, which best corresponds to the description provided by the Self-Organizing Map and, thus, agrees with the user's understanding about the problem. The method is conceptually simple and, based on experiments, allows an accessible approach to supervised variable selection.


Sign in / Sign up

Export Citation Format

Share Document