A Comparison of Techniques for Modelling Data with Non-Linear Structure

2003 ◽  
Vol 11 (1) ◽  
pp. 55-70 ◽  
Author(s):  
Laila Stordrange ◽  
Olav M. Kvalheim ◽  
Per A. Hassel ◽  
Dick Malthe-Sørenssen ◽  
Fred Olav Libnau

Partial least squares (PLS) is a powerful tool for multivariate linear regression. But what if the data show a non-linear structure? Near infrared spectra from a pharmaceutical process were used as a case study. An ANOVA test revealed that the data are well described by a 2nd order polynomial. This work investigates the application of regression techniques that account for slightly non-linear data. The regression techniques investigated are: linearising data by applying transformations, local PLS, i.e. splitting of data, and quadratic PLS. These models were compared with ordinary PLS and principal component regression (PCR). The predictive ability of the models was tested on an independent data set acquired a year later. Using the knowledge of non-linear pattern and important spectral regions, simpler models with better predictive ability can be obtained.

2005 ◽  
Vol 13 (5) ◽  
pp. 241-254 ◽  
Author(s):  
Ralf Marbach

A new method for multivariate calibration is described that combines the best features of “classical” (also called “physical” or “K-matrix”) calibration and “inverse” (or “statistical” or “P-matrix”) calibration. By estimating the spectral signal in the physical way and the spectral noise in the statistical way, so to speak, the prediction accuracy of the inverse model can be combined with the low cost and ease of interpretability of the classical model, including “built-in” proof of specificity of response. The cost of calibration is significantly reduced compared to today's standard practice of statistical calibration using partial least squares or principal component regression, because the need for lab-reference values is virtually eliminated. The method is demonstrated on a data set of near-infrared spectra from pharmaceutical tablets, which is available on the web (so-called Chambersburg Shoot-out 2002 data set). Another benefit is that the correct definitions of the “limits of multivariate detection” become obvious. The sensitivity of multivariate measurements is shown to be limited by the so-called “spectral noise,” and the specificity is shown to be limited by potentially existing “unspecific correlations.” Both limits are testable from first principles, i.e. from measurable pieces of data and without the need to perform any calibration.


1993 ◽  
Vol 1 (2) ◽  
pp. 85-97 ◽  
Author(s):  
Tomas Isaksson ◽  
Ziyi Wang ◽  
Bruce Kowalski

A recently presented calibration method, called optimised scaling (OS-2) was tested and compared to multiplicative scatter correction (MSC) and principal component regression (PCR). The predictive ability of these regression methods was tested on eight data sets consisting of diffuse near infrared (NIR) reflectance and transmittance continuous spectra of meat, sausages, soya bean and designed sample sets. Calibration was performed for constituents such as fat, protein, water, carbohydrate, temperature, lactate and glucose. A total of 21 calibration models were validated and compared. OS-2 gave good or promising prediction results for the major constituents with large variation, such as prediction of fat in two of the studied meat sample sets. OS-2 gave poorer prediction results of minor constituents compared to MSC or first derivatives of the data and PCR.


2020 ◽  
Vol 16 (8) ◽  
pp. 1088-1105
Author(s):  
Nafiseh Vahedi ◽  
Majid Mohammadhosseini ◽  
Mehdi Nekoei

Background: The poly(ADP-ribose) polymerases (PARP) is a nuclear enzyme superfamily present in eukaryotes. Methods: In the present report, some efficient linear and non-linear methods including multiple linear regression (MLR), support vector machine (SVM) and artificial neural networks (ANN) were successfully used to develop and establish quantitative structure-activity relationship (QSAR) models capable of predicting pEC50 values of tetrahydropyridopyridazinone derivatives as effective PARP inhibitors. Principal component analysis (PCA) was used to a rational division of the whole data set and selection of the training and test sets. A genetic algorithm (GA) variable selection method was employed to select the optimal subset of descriptors that have the most significant contributions to the overall inhibitory activity from the large pool of calculated descriptors. Results: The accuracy and predictability of the proposed models were further confirmed using crossvalidation, validation through an external test set and Y-randomization (chance correlations) approaches. Moreover, an exhaustive statistical comparison was performed on the outputs of the proposed models. The results revealed that non-linear modeling approaches, including SVM and ANN could provide much more prediction capabilities. Conclusion: Among the constructed models and in terms of root mean square error of predictions (RMSEP), cross-validation coefficients (Q2 LOO and Q2 LGO), as well as R2 and F-statistical value for the training set, the predictive power of the GA-SVM approach was better. However, compared with MLR and SVM, the statistical parameters for the test set were more proper using the GA-ANN model.


Author(s):  
Qiang Zhao ◽  
Jianguo Sun

Statistical analysis of microarray gene expression data has recently attracted a great deal of attention. One problem of interest is to relate genes to survival outcomes of patients with the purpose of building regression models for the prediction of future patients' survival based on their gene expression data. For this, several authors have discussed the use of the proportional hazards or Cox model after reducing the dimension of the gene expression data. This paper presents a new approach to conduct the Cox survival analysis of microarray gene expression data with the focus on models' predictive ability. The method modifies the correlation principal component regression (Sun, 1995) to handle the censoring problem of survival data. The results based on simulated data and a set of publicly available data on diffuse large B-cell lymphoma show that the proposed method works well in terms of models' robustness and predictive ability in comparison with some existing partial least squares approaches. Also, the new approach is simpler and easy to implement.


1992 ◽  
Vol 46 (11) ◽  
pp. 1685-1694 ◽  
Author(s):  
Tomas Isaksson ◽  
Charles E. Miller ◽  
Tormod Næs

In this work, the abilities of near-infrared diffuse reflectance (NIR) and transmittance (NIT) spectroscopy to noninvasively determine the protein, fat, and water contents of plastic-wrapped homogenized meat are evaluated. One hundred homogenized beef samples, ranging from 1 to 23% fat, wrapped in polyamide/polyethylene laminates, were used. Results of multivariate calibration and prediction for protein, fat, and water contents are presented. The optimal test set prediction errors (root mean square error of prediction, RMSEP), obtained with the use of the principal component regression method with NIR data, were 0.45, 0.29 and 0.50 weight % for protein, fat, and water, respectively, for plastic-wrapped meat (compared to 0.40, 0.28 and 0.45 wt % for unwrapped meat). The optimal prediction errors for the NIT method were 0.31, 0.52 and 0.42 wt % for protein, fat, and water, respectively, for plastic-wrapped meat samples (compared to 0.27, 0.38, and 0.37 wt % for unwrapped meat). We can conclude that the addition of the laminate only slightly reduced the abilities of the NIR and NIT method to predict protein, fat, and water contents in homogenized meat.


Author(s):  
Abhishek Taneja

An enormous production of databases in almost every area of human endeavor particularly through web has created a great demand for new, powerful tools for turning data into useful, task-oriented knowledge. The aim of this study is to study the predictive ability of Factor Analysis a web mining technique to prevent voting, averaging, stack generalization, meta- learning and thus saving much of our time in choosing the right technique for right kind of underlying dataset. This chapter compares the three factor based techniques viz. principal component regression (PCR), Generalized Least Square (GLS) Regression, and Maximum Likelihood Regression (MLR) method and explores their predictive ability on theoretical as well as on experimental basis. All the three factor based techniques have been compared using the necessary conditions for forecasting like R-Square, Adjusted R-Square, F-Test, JB (Jarque-Bera) test of normality. This study can be further explored and enhanced using sufficient conditions for forecasting like Theil's Inequality coefficient (TIC), and Janur Quotient (JQ).


Sign in / Sign up

Export Citation Format

Share Document