The One Standard Error Rule for Model Selection: Does It Work?

Yuchen Chen; Yuhong Yang

doi:10.3390/stats4040051

The One Standard Error Rule for Model Selection: Does It Work?

Stats ◽

10.3390/stats4040051 ◽

2021 ◽

Vol 4 (4) ◽

pp. 868-892

Author(s):

Yuchen Chen ◽

Yuhong Yang

Keyword(s):

Variable Selection ◽

Standard Error ◽

Cross Validation ◽

Housing Prices ◽

Regression Function ◽

Real Data ◽

Estimation Accuracy ◽

Regression Estimation ◽

Large Sample Size ◽

Estimation Formula

Previous research provided a lot of discussion on the selection of regularization parameters when it comes to the application of regularization methods for high-dimensional regression. The popular “One Standard Error Rule” (1se rule) used with cross validation (CV) is to select the most parsimonious model whose prediction error is not much worse than the minimum CV error. This paper examines the validity of the 1se rule from a theoretical angle and also studies its estimation accuracy and performances in applications of regression estimation and variable selection, particularly for Lasso in a regression framework. Our theoretical result shows that when a regression procedure produces the regression estimator converging relatively fast to the true regression function, the standard error estimation formula in the 1se rule is justified asymptotically. The numerical results show the following: 1. the 1se rule in general does not necessarily provide a good estimation for the intended standard deviation of the cross validation error. The estimation bias can be 50–100% upwards or downwards in various situations; 2. the results tend to support that 1se rule usually outperforms the regular CV in sparse variable selection and alleviates the over-selection tendency of Lasso; 3. in regression estimation or prediction, the 1se rule often performs worse. In addition, comparisons are made over two real data sets: Boston Housing Prices (large sample size n, small/moderate number of variables p) and Bardet–Biedl data (large p, small n). Data guided simulations are done to provide insight on the relative performances of the 1se rule and the regular CV.

Download Full-text

Convergence rate of cross-validation in nonlinear wavelet regression estimation

Chinese Science Bulletin ◽

10.1007/bf02885059 ◽

1999 ◽

Vol 44 (10) ◽

pp. 898-901

Author(s):

Zhang Shuanglin ◽

Zheng Zhongguo

Keyword(s):

Convergence Rate ◽

Cross Validation ◽

Regression Estimation ◽

Wavelet Regression

Download Full-text

Nonparametric Density and Regression Estimation

The Journal of Economic Perspectives ◽

10.1257/jep.15.4.11 ◽

2001 ◽

Vol 15 (4) ◽

pp. 11-28 ◽

Cited By ~ 109

Author(s):

John DiNardo ◽

Justin L Tobias

Keyword(s):

Functional Form ◽

Regression Function ◽

Nonparametric Methods ◽

Actual Data ◽

Attractive Alternative ◽

Regression Estimation ◽

Parametric Methods ◽

Regression Functions

We provide a nontechnical review of recent nonparametric methods for estimating density and regression functions. The methods we describe make it possible for a researcher to estimate a regression function or density without having to specify in advance a particular--and hence potentially misspecified functional form. We compare these methods to more popular parametric alternatives (such as OLS), illustrate their use in several applications, and demonstrate their flexibility with actual data and generated-data experiments. We show that these methods are intuitive and easily implemented, and in the appropriate context may provide an attractive alternative to “simpler” parametric methods.

Download Full-text

syris: a flexible and efficient framework for X-ray imaging experiments simulation

Journal of Synchrotron Radiation ◽

10.1107/s1600577517012255 ◽

2017 ◽

Vol 24 (6) ◽

pp. 1283-1295 ◽

Cited By ~ 4

Author(s):

Tomáš Faragó ◽

Petr Mikulík ◽

Alexey Ershov ◽

Matthias Vogelgesang ◽

Daniel Hänschke ◽

...

Keyword(s):

Motion Estimation ◽

Data Processing ◽

Graphics Processing Units ◽

High Speed ◽

Real Data ◽

Estimation Accuracy ◽

Processing Parameter ◽

X Ray ◽

X Ray Imaging ◽

Imaging Conditions

An open-source framework for conducting a broad range of virtual X-ray imaging experiments,syris, is presented. The simulated wavefield created by a source propagates through an arbitrary number of objects until it reaches a detector. The objects in the light path and the source are time-dependent, which enables simulations of dynamic experiments,e.g.four-dimensional time-resolved tomography and laminography. The high-level interface ofsyrisis written in Python and its modularity makes the framework very flexible. The computationally demanding parts behind this interface are implemented in OpenCL, which enables fast calculations on modern graphics processing units. The combination of flexibility and speed opens new possibilities for studying novel imaging methods and systematic search of optimal combinations of measurement conditions and data processing parameters. This can help to increase the success rates and efficiency of valuable synchrotron beam time. To demonstrate the capabilities of the framework, various experiments have been simulated and compared with real data. To show the use case of measurement and data processing parameter optimization based on simulation, a virtual counterpart of a high-speed radiography experiment was created and the simulated data were used to select a suitable motion estimation algorithm; one of its parameters was optimized in order to achieve the best motion estimation accuracy when applied on the real data.syriswas also used to simulate tomographic data sets under various imaging conditions which impact the tomographic reconstruction accuracy, and it is shown how the accuracy may guide the selection of imaging conditions for particular use cases.

Download Full-text

Automatic blind deconvolution with Toeplitz-structured sparse total least squares

Geophysics ◽

10.1190/geo2018-0136.1 ◽

2018 ◽

Vol 83 (6) ◽

pp. V345-V357 ◽

Cited By ~ 5

Author(s):

Nasser Kazemi

Keyword(s):

Least Squares ◽

Cross Validation ◽

Blind Deconvolution ◽

Real Data ◽

Total Least Squares ◽

Data Matrix ◽

Seismic Survey ◽

Structural Constraints ◽

Generalized Cross Validation ◽

Free Data

Given the noise-corrupted seismic recordings, blind deconvolution simultaneously solves for the reflectivity series and the wavelet. Blind deconvolution can be formulated as a fully perturbed linear regression model and solved by the total least-squares (TLS) algorithm. However, this algorithm performs poorly when the data matrix is a structured matrix and ill-conditioned. In blind deconvolution, the data matrix has a Toeplitz structure and is ill-conditioned. Accordingly, we develop a fully automatic single-channel blind-deconvolution algorithm to improve the performance of the TLS method. The proposed algorithm, called Toeplitz-structured sparse TLS, has no assumptions about the phase of the wavelet. However, it assumes that the reflectivity series is sparse. In addition, to reduce the model space and the number of unknowns, the algorithm benefits from the structural constraints on the data matrix. Our algorithm is an alternating minimization method and uses a generalized cross validation function to define the optimum regularization parameter automatically. Because the generalized cross validation function does not require any prior information about the noise level of the data, our approach is suitable for real-world applications. We validate the proposed technique using synthetic examples. In noise-free data, we achieve a near-optimal recovery of the wavelet and the reflectivity series. For noise-corrupted data with a moderate signal-to-noise ratio (S/N), we found that the algorithm successfully accounts for the noise in its model, resulting in a satisfactory performance. However, the results deteriorate as the S/N and the sparsity level of the data are decreased. We also successfully apply the algorithm to real data. The real-data examples come from 2D and 3D data sets of the Teapot Dome seismic survey.

Download Full-text

Interpolation uncertainty of atmospheric temperature radiosoundings

10.5194/amt-2020-161 ◽

2020 ◽

Author(s):

Alessandro Fassò ◽

Michael Sommer ◽

Christoph von Rohden

Keyword(s):

Missing Data ◽

Standard Error ◽

Cross Validation ◽

Interpolation Method ◽

Average Length ◽

Linear Interpolation ◽

Atmospheric Temperature ◽

Radiosonde Data ◽

Published Data ◽

Set Up

Abstract. This paper is motivated by the fact that, although temperature readings made by Vaisala RS41 radiosondes at GRUAN sites (http://www.gruan.org) are given at 1 s resolution, for various reasons, missing data are spread along the atmospheric profile. Such a problem is quite common in radiosonde data and other profile data. Hence, (linear) interpolation is often used to fill the gaps in published data products. In this perspective, the present paper considers interpolation uncertainty. To do this, a statistical approach is introduced giving some understanding of the consequences of substituting missing data by interpolated ones. In particular, a general frame for the computation of interpolation uncertainty based on a Gaussian process (GP) set-up is developed. Using the GP characteristics, a simple formula for computing the linear interpolation standard error is given. Moreover, the GP interpolation is proposed as it provides an alternative interpolation method with its standard error. For the Vaisala RS41, the two approaches are shown to give similar interpolation performances using an extensive cross-validation approach based on the block-bootstrap technique. Statistical results about interpolation uncertainties at various GRUAN sites and for various missing gap lengths are provided. Since both provide an underestimation of the cross-validation interpolation uncertainty, a bootstrap-based correction formula is proposed. Using the root mean square error, it is found that, for short gaps, with an average length of 5 s, the average uncertainty is smaller than 0.10 K. For larger gaps, it increases up to 0.35 K for an average gap length of 30 s, and up to 0.58 K for a gap of 60 s.

Download Full-text

Wavelet regression estimation over Lp risk based on negatively associated sample

International Journal of Wavelets Multiresolution and Information Processing ◽

10.1142/s0219691320500241 ◽

2020 ◽

Vol 18 (04) ◽

pp. 2050024

Author(s):

Huijun Guo ◽

Junke Kou

Keyword(s):

Convergence Rates ◽

Regression Function ◽

Regression Estimation ◽

Negatively Associated ◽

Wavelet Regression ◽

Nonparametric Regression Estimation ◽

Independent Case ◽

Optimal Convergence Rates ◽

Negatively Associated Sample ◽

Wavelet Estimators

This paper considers wavelet estimations of a regression function based on negatively associated sample. We provide upper bound estimations over [Formula: see text] risk of linear and nonlinear wavelet estimators in Besov space, respectively. When the random sample reduces to the independent case, our convergence rates coincide with the optimal convergence rates of classical nonparametric regression estimation.

Download Full-text

Determination of Iodine Value of Palm Oils Using Partial Least Squares Regression-Fourier Transform Infrared Data

Jurnal Teknologi ◽

10.11113/jt.v70.3522 ◽

2014 ◽

Vol 70 (5) ◽

Cited By ~ 1

Author(s):

Nor Fazila Rasaruddin ◽

Mas Ezatul Nadia Mohd Ruah ◽

Mohamed Noor Hasan ◽

Mohd Zuli Jaafar

Keyword(s):

Fourier Transform ◽

Variable Selection ◽

Least Squares ◽

Partial Least Squares ◽

Correlation Coefficient ◽

Iodine Value ◽

Cross Validation ◽

Pls Regression ◽

Pure Sample

This paper shows the determination of iodine value (IV) of pure and frying palm oils using Partial Least Squares (PLS) regression with application of variable selection. A total of 28 samples consisting of pure and frying palm oils which acquired from markets. Seven of them were considered as high-priced palm oils while the remaining was low-priced. PLS regression models were developed for the determination of IV using Fourier Transform Infrared (FTIR) spectra data in absorbance mode in the range from 650 cm-1 to 4000 cm-1. Savitzky Golay derivative was applied before developing the prediction models. The models were constructed using wavelength selected in the FTIR region by adopting selectivity ratio (SR) plot and correlation coefficient to the IV parameter. Each model was validated through Root Mean Square Error Cross Validation, RMSECV and cross validation correlation coefficient, R2cv. The best model using SR plot was the model with mean centring for pure sample and model with a combination of row scaling and standardization of frying sample. The best model with the application of the correlation coefficient variable selection was the model with a combination of row scaling and standardization of pure sample and model with mean centering data pre-processing for frying sample. It is not necessary to row scaled the variables to develop the model since the effect of row scaling on model quality is insignificant.

Download Full-text

Joint variable selection and network modeling for detecting eQTLs

Statistical Applications in Genetics and Molecular Biology ◽

10.1515/sagmb-2019-0032 ◽

2020 ◽

Vol 19 (1) ◽

Author(s):

Xuan Cao ◽

Lili Ding ◽

Tesfaye B. Mersha

Keyword(s):

Variable Selection ◽

Multiple Testing ◽

Bayes Factor ◽

Graph Model ◽

Real Data ◽

Bayesian Regression ◽

Joint Estimation ◽

Eqtl Analysis ◽

Multiple Testing Correction ◽

Joint Variable

AbstractIn this study, we conduct a comparison of three most recent statistical methods for joint variable selection and covariance estimation with application of detecting expression quantitative trait loci (eQTL) and gene network estimation, and introduce a new hierarchical Bayesian method to be included in the comparison. Unlike the traditional univariate regression approach in eQTL, all four methods correlate phenotypes and genotypes by multivariate regression models that incorporate the dependence information among phenotypes, and use Bayesian multiplicity adjustment to avoid multiple testing burdens raised by traditional multiple testing correction methods. We presented the performance of three methods (MSSL – Multivariate Spike and Slab Lasso, SSUR – Sparse Seemingly Unrelated Bayesian Regression, and OBFBF – Objective Bayes Fractional Bayes Factor), along with the proposed, JDAG (Joint estimation via a Gaussian Directed Acyclic Graph model) method through simulation experiments, and publicly available HapMap real data, taking asthma as an example. Compared with existing methods, JDAG identified networks with higher sensitivity and specificity under row-wise sparse settings. JDAG requires less execution in small-to-moderate dimensions, but is not currently applicable to high dimensional data. The eQTL analysis in asthma data showed a number of known gene regulations such as STARD3, IKZF3 and PGAP3, all reported in asthma studies. The code of the proposed method is freely available at GitHub (https://github.com/xuan-cao/Joint-estimation-for-eQTL).

Download Full-text

High Order Lambda Measure Based Choquet Integral Composition Forecasting Model

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.284-287.3111 ◽

2013 ◽

Vol 284-287 ◽

pp. 3111-3114

Author(s):

Hsiang Chuan Liu ◽

Wei Sung Chen ◽

Ben Chang Shia ◽

Chia Chen Lee ◽

Shang Ling Ou ◽

...

Keyword(s):

Time Series ◽

Linear Regression ◽

Cross Validation ◽

Choquet Integral ◽

Real Data ◽

High Order ◽

Forecasting Model ◽

Mean Square ◽

Weighted Composition ◽

Fold Cross Validation

In this paper, a novel fuzzy measure, high order lambda measure, was proposed, based on the Choquet integral with respect to this new measure, a novel composition forecasting model which composed the GM(1,1) forecasting model, the time series model and the exponential smoothing model was also proposed. For evaluating the efficiency of this improved composition forecasting model, an experiment with a real data by using the 5 fold cross validation mean square error was conducted. The performances of Choquet integral composition forecasting model with the P-measure, Lambda-measure, L-measure and high order lambda measure, respectively, a ridge regression composition forecasting model and a multiple linear regression composition forecasting model and the traditional linear weighted composition forecasting model were compared. The experimental results showed that the Choquet integral composition forecasting model with respect to the high order lambda measure has the best performance.

Download Full-text

Ultrasound Techniques Applied to Body Fat Measurement in Male and Female Athletes

Journal of Athletic Training ◽

10.4085/1062-6050-44.2.142 ◽

2009 ◽

Vol 44 (2) ◽

pp. 142-147 ◽

Cited By ~ 26

Author(s):

Jean-Claude Pineau ◽

Jean Robert Filliard ◽

Michel Bocquet

Keyword(s):

Body Fat ◽

Standard Error ◽

Cross Validation ◽

Female Athletes ◽

Body Fat Percentage ◽

Fat Percentage ◽

Ultrasound Technique ◽

Weight Category ◽

Portable Ultrasound ◽

Ultrasound Device

Abstract Context: For athletes in disciplines with weight categories, it is important to assess body composition and weight fluctuations. Objective: To evaluate the accuracy of measuring body fat percentage with a portable ultrasound device possessing high accuracy and reliability versus fan-beam, dual-energy X-ray absorptiometry (DEXA). Design: Cross-validation study. Setting: Research laboratory. Patients or Other Participants: A total of 93 athletes (24 women, 69 men), aged 23.5 ± 3.7 years, with body mass index = 24.0 ± 4.2 and body fat percentage via DEXA = 9.41 ± 8.1 participated. All participants were elite athletes selected from the Institut National des Sports et de l'Education Physique. These participants practiced a variety of weight-category sports. Main Outcome Measure(s): We measured body fat and body fat percentage using an ultrasound technique associated with anthropometric values and the DEXA reference technique. Cross-validation between the ultrasound technique and DEXA was then performed. Results: Ultrasound estimates of body fat percentage were correlated closely with those of DEXA in both females (r = 0.97, standard error of the estimate = 1.79) and males (r = 0.98, standard error of the estimate = 0.96). The ultrasound technique in both sexes had a low total error (0.93). The 95% limit of agreement was −0.06 ± 1.2 for all athletes and did not show an overprediction or underprediction bias. We developed a new model to produce body fat estimates with ultrasound and anthropometric dimensions. Conclusions: The limits of agreement with the ultrasound technique compared with DEXA measurements were very good. Consequently, the use of a portable ultrasound device produced accurate body fat and body fat percentage estimates in relation to the fan-beam DEXA technique.

Download Full-text