Interpolation and extrapolation problems of multivariate regression in analytical chemistry: benchmarking the robustness on near-infrared (NIR) spectroscopy data

The Analyst ◽  
2012 ◽  
Vol 137 (7) ◽  
pp. 1604 ◽  
Author(s):  
Roman M. Balabin ◽  
Sergey V. Smirnov
Fuel ◽  
2008 ◽  
Vol 87 (12) ◽  
pp. 2745-2752 ◽  
Author(s):  
Roman M. Balabin ◽  
Ravilya Z. Safieva

2020 ◽  
Vol 23 (8) ◽  
pp. 740-756
Author(s):  
Naifei Zhao ◽  
Qingsong Xu ◽  
Man-lai Tang ◽  
Hong Wang

Aim and Objective: Near Infrared (NIR) spectroscopy data are featured by few dozen to many thousands of samples and highly correlated variables. Quantitative analysis of such data usually requires a combination of analytical methods with variable selection or screening methods. Commonly-used variable screening methods fail to recover the true model when (i) some of the variables are highly correlated, and (ii) the sample size is less than the number of relevant variables. In these cases, Partial Least Squares (PLS) regression based approaches can be useful alternatives. Materials and Methods : In this research, a fast variable screening strategy, namely the preconditioned screening for ridge partial least squares regression (PSRPLS), is proposed for modelling NIR spectroscopy data with high-dimensional and highly correlated covariates. Under rather mild assumptions, we prove that using Puffer transformation, the proposed approach successfully transforms the problem of variable screening with highly correlated predictor variables to that of weakly correlated covariates with less extra computational effort. Results: We show that our proposed method leads to theoretically consistent model selection results. Four simulation studies and two real examples are then analyzed to illustrate the effectiveness of the proposed approach. Conclusion: By introducing Puffer transformation, high correlation problem can be mitigated using the PSRPLS procedure we construct. By employing RPLS regression to our approach, it can be made more simple and computational efficient to cope with the situation where model size is larger than the sample size while maintaining a high precision prediction.


2017 ◽  
Vol 71 (10) ◽  
pp. 2253-2262 ◽  
Author(s):  
Mithilesh Prakash ◽  
Jaakko K. Sarin ◽  
Lassi Rieppo ◽  
Isaac O. Afara ◽  
Juha Töyräs

Near-infrared (NIR) spectroscopy has been successful in nondestructive assessment of biological tissue properties, such as stiffness of articular cartilage, and is proposed to be used in clinical arthroscopies. Near-infrared spectroscopic data include absorbance values from a broad wavelength region resulting in a large number of contributing factors. This broad spectrum includes information from potentially noisy variables, which may contribute to errors during regression analysis. We hypothesized that partial least squares regression (PLSR) is an optimal multivariate regression technique and requires application of variable selection methods to further improve the performance of NIR spectroscopy-based prediction of cartilage tissue properties, including instantaneous, equilibrium, and dynamic moduli and cartilage thickness. To test this hypothesis, we conducted for the first time a comparative analysis of multivariate regression techniques, which included principal component regression (PCR), PLSR, ridge regression, least absolute shrinkage and selection operator (Lasso), and least squares version of support vector machines (LS-SVM) on NIR spectral data of equine articular cartilage. Additionally, we evaluated the effect of variable selection methods, including Monte Carlo uninformative variable elimination (MC-UVE), competitive adaptive reweighted sampling (CARS), variable combination population analysis (VCPA), backward interval PLS (BiPLS), genetic algorithm (GA), and jackknife, on the performance of the optimal regression technique. The PLSR technique was found as an optimal regression tool (R2Tissue thickness = 75.6%, R2Dynamic modulus = 64.9%) for cartilage NIR data; variable selection methods simplified the prediction models enabling the use of lesser number of regression components. However, the improvements in model performance with variable selection methods were found to be statistically insignificant. Thus, the PLSR technique is recommended as the regression tool for multivariate analysis for prediction of articular cartilage properties from its NIR spectra.


2020 ◽  
pp. 1-12
Author(s):  
Lingyun Peng ◽  
Hao Cheng ◽  
Liang-Jie Wang ◽  
Dianzhen Zhu

Soil organic matter and soil particle composition play extremely important roles in soil fertility, environmental protection, and sustainable agricultural development. Visible – near-infrared reflectance (Vis–NIR) spectroscopy is a rapid, effective, and low-cost analytical method to predict soil properties. In this study, laboratory Vis–NIR spectroscopy data were used to compare the differences among partial least squares regression (PLSR), artificial neural network (ANN) and multivariate adaptive regression splines (MARSplines) based on fuzzy c-means spectral clustering and expert knowledge classification methods for soil prediction. The results showed that (1) the sand content (R2 = 0.69–0.77) had the best prediction, followed by the silt (R2 = 0.56–0.71) and organic matter (R2 = 0.54–0.69) contents, whereas the clay content (R2 = 0.29–0.65) had the poorest prediction, (2) the performance of the models followed the order of PLSR > ANN > MARSplines, and (3) the accuracies of the organic matter and sand contents were higher when applying expert knowledge classification, whereas the prediction of the clay and silt contents was better when applying spectral clustering. However, the overall accuracy of the spectral clustering method was slightly better than that of expert classification. Our findings showed that the spectral cluster-based models produced effective and interpretable prediction results for estimating soil properties. Therefore, this approach should be considered when dealing with large and heterogeneous soil samples.


2019 ◽  
Vol 2019 ◽  
pp. 1-12 ◽  
Author(s):  
Guang-Hui Fu ◽  
Min-Jie Zong ◽  
Feng-Hua Wang ◽  
Lun-Zhao Yi

Elastic net (Enet) and sparse partial least squares (SPLS) are frequently employed for wavelength selection and model calibration in analysis of near infrared spectroscopy data. Enet and SPLS can perform variable selection and model calibration simultaneously. And they also tend to select wavelength intervals rather than individual wavelengths when the predictors are multicollinear. In this paper, we focus on comparison of Enet and SPLS in interval wavelength selection and model calibration for near infrared spectroscopy data. The results from both simulation and real spectroscopy data show that Enet method tends to select less predictors as key variables than SPLS; thus it gets more parsimony model and brings advantages for model interpretation. SPLS can obtain much lower mean square of prediction error (MSE) than Enet. So SPLS is more suitable when the attention is to get better model fitting accuracy. The above conclusion is still held when coming to performing the strongly correlated NIR spectroscopy data whose predictors present group structures, Enet exhibits more sparse property than SPLS, and the selected predictors (wavelengths) are segmentally successive.


Fuel ◽  
2008 ◽  
Vol 87 (7) ◽  
pp. 1096-1101 ◽  
Author(s):  
Roman M. Balabin ◽  
Ravilya Z. Safieva

2014 ◽  
Vol 2014 ◽  
pp. 1-6 ◽  
Author(s):  
Xin-fang Xu ◽  
Li-xing Nie ◽  
Li-li Pan ◽  
Bian Hao ◽  
Shao-xiong Yuan ◽  
...  

Near-infrared spectroscopy (NIRS), a rapid and efficient tool, was used to determine the total amount of nine ginsenosides inPanax ginseng. In the study, the regression models were established using multivariate regression methods with the results from conventional chemical analytical methods as reference values. The multivariate regression methods, partial least squares regression (PLSR) and principal component regression (PCR), were discussed and the PLSR was more suitable. Multiplicative scatter correction (MSC), second derivative, and Savitzky-Golay smoothing were utilized together for the spectral preprocessing. When evaluating the final model, factors such as correlation coefficient (R2) and the root mean square error of prediction (RMSEP) were considered. The final optimal results of PLSR model showed that root mean square error of prediction (RMSEP) and correlation coefficients (R2) in the calibration set were 0.159 and 0.963, respectively. The results demonstrated that the NIRS as a new method can be applied to the quality control ofGinseng Radix et Rhizoma.


Sign in / Sign up

Export Citation Format

Share Document