Use of advanced statistical learning methods and principal component analysis in quantitative structure–genotoxicity relationship study of amines
The paper highlighted the use of advanced nonlinear modeling and subset selection techniques in the construction of a good, predictive model for genotoxicity study of amines. Essentials accounting for a reliable model were all considered carefully. Chemicals were represented by a large number of CODESSA descriptors. Division of a whole sample into the training set and the test set was performed by principal component analysis (PCA). Six descriptors selected by the best multi-linear regression (BMLR) method in CODESSA program were used as inputs to build nonlinear models, using advanced statistical learning methods such as support vector machine (SVM) and projection pursuit regression (PPR). The models were validated through three ways, i.e. internal cross-validation (CV), a test set and an independent validation set. Analysis shows that nonlinear models produced better results than linear models and PPR model outperforms the rest in the following order: PPR > SVM > linear SVM ≥ BMLR. In addition, the relationships between the descriptors and the mutagenic behavior of compounds are well discussed.