Shape-constrained Symbolic Regression – Improving Extrapolation with Prior Knowledge

Evolutionary Computation ◽

10.1162/evco_a_00294 ◽

2021 ◽

pp. 1-24

Author(s):

G. Kronberger ◽

F. O. de Franca ◽

B. Burlacu ◽

C. Haider ◽

M. Kommenda

Keyword(s):

Evolutionary Algorithms ◽

Prior Knowledge ◽

Polynomial Regression ◽

Predictive Accuracy ◽

Symbolic Regression ◽

Training Set ◽

Test Set ◽

Regression Algorithms ◽

Selection Step ◽

Regression Problems

Abstract We investigate the addition of constraints on the function image and its derivatives for the incorporation of prior knowledge in symbolic regression. The approach is called shape-constrained symbolic regression and allows us to enforce e.g. monotonicity of the function over selected inputs. The aim is to find models which conform to expected behaviour and which have improved extrapolation capabilities. We demonstrate the feasibility of the idea and propose and compare two evolutionary algorithms for shapeconstrained symbolic regression: i) an extension of tree-based genetic programming which discards infeasible solutions in the selection step, and ii) a two population evolutionary algorithm that separates the feasible from the infeasible solutions. In both algorithms we use interval arithmetic to approximate bounds for models and their partial derivatives. The algorithms are tested on a set of 19 synthetic and four real-world regression problems. Both algorithms are able to identify models which conform to shape constraints which is not the case for the unmodified symbolic regression algorithms. However, the predictive accuracy of models with constraints is worse on the training set and the test set. Shape-constrained polynomial regression produces the best results for the test set but also significantly larger models.

Download Full-text

The performance of polyploid evolutionary algorithms is improved both by having many chromosomes and by having many copies of each chromosome on symbolic regression problems

2005 IEEE Congress on Evolutionary Computation ◽

10.1109/cec.2005.1554783 ◽

2005 ◽

Cited By ~ 1

Author(s):

R. Cavill ◽

S. Smith ◽

A. Tyrrell

Keyword(s):

Evolutionary Algorithms ◽

Symbolic Regression ◽

Regression Problems

Download Full-text

Data Prognostics Using Symbolic Regression

10.31224/osf.io/fq8ze ◽

2019 ◽

Author(s):

Van Hunter Adams

Keyword(s):

Symbolic Regression ◽

Genetic Program ◽

Data Repository ◽

Training Set ◽

General Technique ◽

Test Set ◽

Turbofan Engine ◽

The Moment

This paper describes a general technique for data prognostics using symbolic regression. This analysis treats the characterization of turbofan engine degradation as a particular application for the general technique. The proposed genetic program (GP) characterizes engine degradation, and then uses that characterization to both detect engine faults and predict the remaining lifetimes of engines after a fault. The genetic program exploits the fact that engine degradation manifests itself as changing correlations between sensor outputs. The NASA Prognostics Data Repository provides a training set in which 100 simulated engines are run to failure, and a test set in which a separate set of 100 simulated engines are shut off before they fail. The GP uses the training fleet of engines to identify the sensor relationships that indicate engine fault and predict remaining lifetime, and then observes the learned sensor relationships in the test fleet. The genetic program successfully detects the moment that the fault occurs for every engine in the test fleet and accurately predicts the remaining lifetime of the engines after the fault.

Download Full-text

PREDIKSI KUALITAS AIR SUNGAI CILIWUNG DENGAN MENGGUNAKAN ALGORITMA POHON KEPUTUSAN

Jurnal Air Indonesia ◽

10.29122/jai.v12i2.4364 ◽

2021 ◽

Vol 12 (2) ◽

Author(s):

Mohammad Haekal ◽

Henki Bayu Seta ◽

Mayanda Mega Santoni

Keyword(s):

Data Mining ◽

Decision Tree ◽

Cross Validation ◽

Online Monitoring ◽

Training Set ◽

Microsoft Excel ◽

Test Set

Untuk memprediksi kualitas air sungai Ciliwung, telah dilakukan pengolahan data-data hasil pemantauan secara Online Monitoring dengan menggunakan Metode Data Mining. Pada metode ini, pertama-tama data-data hasil pemantauan dibuat dalam bentuk tabel Microsoft Excel, kemudian diolah menjadi bentuk Pohon Keputusan yang disebut Algoritma Pohon Keputusan (Decision Tree) mengunakan aplikasi WEKA. Metode Pohon Keputusan dipilih karena lebih sederhana, mudah dipahami dan mempunyai tingkat akurasi yang sangat tinggi. Jumlah data hasil pemantauan kualitas air sungai Ciliwung yang diolah sebanyak 5.476 data. Hasil klarifikasi dengan Pohon Keputusan, dari 5.476 data ini diperoleh jumlah data yang mengindikasikan sungai Ciliwung Tidak Tercemar sebanyak 1.059 data atau sebesar 19,3242%, dan yang mengindikasikan Tercemar sebanyak 4.417 data atau 80,6758%. Selanjutnya data-data hasil pemantauan ini dievaluasi menggunakan 4 Opsi Tes (Test Option) yaitu dengan Use Training Set, Supplied Test Set, Cross-Validation folds 10, dan Percentage Split 66%. Hasil evaluasi dengan 4 opsi tes yang digunakan ini, semuanya menunjukkan tingkat akurasi yang sangat tinggi, yaitu diatas 99%. Dari data-data hasil peneltian ini dapat diprediksi bahwa sungai Ciliwung terindikasi sebagai sungai tercemar bila mereferensi kepada Peraturan Pemerintah Republik Indonesia nomor 82 tahun 2001 dan diketahui pula bahwa penggunaan aplikasi WEKA dengan Algoritma Pohon Keputusan untuk mengolah data-data hasil pemantauan dengan mengambil tiga parameter (pH, DO dan Nitrat) adalah sangat akuran dan tepat. Kata Kunci : Kualitas air sungai, Data Mining, Algoritma Pohon Keputusan, Aplikasi WEKA.

Download Full-text

QSPR modelling of the octanol/water partition coefficient of organometallic substances by optimal SMILES-based descriptors

Open Chemistry ◽

10.2478/s11532-009-0095-y ◽

2009 ◽

Vol 7 (4) ◽

pp. 846-856 ◽

Cited By ~ 6

Author(s):

Andrey Toropov ◽

Alla Toropova ◽

Emilio Benfenati

Keyword(s):

Partition Coefficient ◽

Organometallic Compounds ◽

Applicability Domain ◽

Training Set ◽

Input Line ◽

Test Set ◽

Water Partition Coefficient ◽

Definition Of

AbstractUsually, QSPR is not used to model organometallic compounds. We have modeled the octanol/water partition coefficient for organometallic compounds of Na, K, Ca, Cu, Fe, Zn, Ni, As, and Hg by optimal descriptors calculated with simplified molecular input line entry system (SMILES) notations. The best model is characterized by the following statistics: n=54, r2=0.9807, s=0.677, F=2636 (training set); n=26, r2=0.9693, s=0.969, F=759 (test set). Empirical criteria for the definition of the applicability domain for these models are discussed.

Download Full-text

Feature-Weighted Sampling for Proper Evaluation of Classification Models

Applied Sciences ◽

10.3390/app11052039 ◽

2021 ◽

Vol 11 (5) ◽

pp. 2039

Author(s):

Hyunseok Shin ◽

Sejong Oh

Keyword(s):

Random Sampling ◽

Sampling Method ◽

Classification Model ◽

Training Set ◽

Test Set ◽

Feature Importance ◽

Proper Training ◽

Machine Learning Applications ◽

Test Sets ◽

The Given

In machine learning applications, classification schemes have been widely used for prediction tasks. Typically, to develop a prediction model, the given dataset is divided into training and test sets; the training set is used to build the model and the test set is used to evaluate the model. Furthermore, random sampling is traditionally used to divide datasets. The problem, however, is that the performance of the model is evaluated differently depending on how we divide the training and test sets. Therefore, in this study, we proposed an improved sampling method for the accurate evaluation of a classification model. We first generated numerous candidate cases of train/test sets using the R-value-based sampling method. We evaluated the similarity of distributions of the candidate cases with the whole dataset, and the case with the smallest distribution–difference was selected as the final train/test set. Histograms and feature importance were used to evaluate the similarity of distributions. The proposed method produces more proper training and test sets than previous sampling methods, including random and non-random sampling.

Download Full-text

A NEW LINEAR GENETIC PROGRAMMING APPROACH BASED ON STRAIGHT LINE PROGRAMS: SOME THEORETICAL AND EXPERIMENTAL ASPECTS

International Journal of Artificial Intelligence Tools ◽

10.1142/s0218213009000391 ◽

2009 ◽

Vol 18 (05) ◽

pp. 757-781 ◽

Cited By ~ 7

Author(s):

CÉSAR L. ALONSO ◽

JOSÉ LUIS MONTAÑA ◽

JORGE PUENTE ◽

CRUZ ENRIQUE BORGES

Keyword(s):

Data Structure ◽

Genetic Programming ◽

Computer Programs ◽

Symbolic Regression ◽

Programming Approach ◽

Linear Genetic Programming ◽

Straight Line ◽

Structured Representations ◽

Regression Problems ◽

Straight Line Programs

Tree encodings of programs are well known for their representative power and are used very often in Genetic Programming. In this paper we experiment with a new data structure, named straight line program (slp), to represent computer programs. The main features of this structure are described, new recombination operators for GP related to slp's are introduced and a study of the Vapnik-Chervonenkis dimension of families of slp's is done. Experiments have been performed on symbolic regression problems. Results are encouraging and suggest that the GP approach based on slp's consistently outperforms conventional GP based on tree structured representations.

Download Full-text

Weakly supervised deep learning for determining the prognostic value of 18F-FDG PET/CT in extranodal natural killer/T cell lymphoma, nasal type

European Journal of Nuclear Medicine and Molecular Imaging ◽

10.1007/s00259-021-05232-3 ◽

2021 ◽

Author(s):

Rui Guo ◽

Xiaobin Hu ◽

Haoming Song ◽

Pengpeng Xu ◽

Haoping Xu ◽

...

Keyword(s):

Deep Learning ◽

Fdg Pet ◽

Cell Lymphoma ◽

Training Set ◽

Test Set ◽

Natural Killer T Cell ◽

Pet Ct ◽

Weakly Supervised ◽

Fdg Pet Ct ◽

Killer T Cell

Abstract Purpose To develop a weakly supervised deep learning (WSDL) method that could utilize incomplete/missing survival data to predict the prognosis of extranodal natural killer/T cell lymphoma, nasal type (ENKTL) based on pretreatment 18F-FDG PET/CT results. Methods One hundred and sixty-seven patients with ENKTL who underwent pretreatment 18F-FDG PET/CT were retrospectively collected. Eighty-four patients were followed up for at least 2 years (training set = 64, test set = 20). A WSDL method was developed to enable the integration of the remaining 83 patients with incomplete/missing follow-up information in the training set. To test generalization, these data were derived from three types of scanners. Prediction similarity index (PSI) was derived from deep learning features of images. Its discriminative ability was calculated and compared with that of a conventional deep learning (CDL) method. Univariate and multivariate analyses helped explore the significance of PSI and clinical features. Results PSI achieved area under the curve scores of 0.9858 and 0.9946 (training set) and 0.8750 and 0.7344 (test set) in the prediction of progression-free survival (PFS) with the WSDL and CDL methods, respectively. PSI threshold of 1.0 could significantly differentiate the prognosis. In the test set, WSDL and CDL achieved prediction sensitivity, specificity, and accuracy of 87.50% and 62.50%, 83.33% and 83.33%, and 85.00% and 75.00%, respectively. Multivariate analysis confirmed PSI to be an independent significant predictor of PFS in both the methods. Conclusion The WSDL-based framework was more effective for extracting 18F-FDG PET/CT features and predicting the prognosis of ENKTL than the CDL method.

Download Full-text

Prediction of the Toxicity of Binary Mixtures by QSAR Approach Using the Hypothetical Descriptors

International Journal of Molecular Sciences ◽

10.3390/ijms19113423 ◽

2018 ◽

Vol 19 (11) ◽

pp. 3423 ◽

Cited By ~ 12

Author(s):

Ting Wang ◽

Lili Tang ◽

Feng Luan ◽

M. Natália D. S. Cordeiro

Keyword(s):

Correlation Coefficient ◽

Binary Mixtures ◽

Quantitative Structure Activity Relationship ◽

Training Set ◽

Statistical Parameters ◽

Test Set ◽

Qsar Models ◽

Forward Stepwise ◽

Leave One Out ◽

External Test

Organic compounds are often exposed to the environment, and have an adverse effect on the environment and human health in the form of mixtures, rather than as single chemicals. In this paper, we try to establish reliable and developed classical quantitative structure–activity relationship (QSAR) models to evaluate the toxicity of 99 binary mixtures. The derived QSAR models were built by forward stepwise multiple linear regression (MLR) and nonlinear radial basis function neural networks (RBFNNs) using the hypothetical descriptors, respectively. The statistical parameters of the MLR model provided were N (number of compounds in training set) = 79, R2 (the correlation coefficient between the predicted and observed activities)= 0.869, LOOq2 (leave-one-out correlation coefficient) = 0.864, F (Fisher’s test) = 165.494, and RMS (root mean square) = 0.599 for the training set, and Next (number of compounds in external test set) = 20, R2 = 0.853, qext2 (leave-one-out correlation coefficient for test set)= 0.825, F = 30.861, and RMS = 0.691 for the external test set. The RBFNN model gave the statistical results, namely N = 79, R2 = 0.925, LOOq2 = 0.924, F = 950.686, RMS = 0.447 for the training set, and Next = 20, R2 = 0.896, qext2 = 0.890, F = 155.424, RMS = 0.547 for the external test set. Both of the MLR and RBFNN models were evaluated by some statistical parameters and methods. The results confirm that the built models are acceptable, and can be used to predict the toxicity of the binary mixtures.

Download Full-text

Identification of Multi-omics Biomarkers and Construction of the Novel Prognostic Model for Hepatocellular Carcinoma

10.21203/rs.3.rs-452644/v1 ◽

2021 ◽

Author(s):

Xiaokai Yan ◽

Chiying Xiao ◽

Kunyan Yue ◽

Min Chen ◽

Hang Zhou

Keyword(s):

Hepatocellular Carcinoma ◽

Survival Analysis ◽

Prognostic Model ◽

Prognostic Models ◽

Prognostic Indicators ◽

Omics Data ◽

Training Set ◽

Test Set ◽

Model Based ◽

Cox Analysis

Abstract Background: Change in the genome plays a crucial role in cancerogenesis and many biomarkers can be used as effective prognostic indicators in diverse tumors. Currently, although many studies have constructed some predictive models for hepatocellular carcinoma (HCC) based on molecular signatures, the performance of which is unsatisfactory. To fill this shortcoming, we hope to construct a novel and accurate prognostic model with multi-omics data to guide prognostic assessments of HCC. Methods: The TCGA training set was used to identify crucial biomarkers and construct single-omic prognostic models through difference analysis, univariate Cox, and LASSO/stepwise Cox analysis. Then the performances of single-omic models were evaluated and validated through survival analysis, Harrell’s concordance index (C-index), and receiver operating characteristic (ROC) curve, in the TCGA test set and external cohorts. Besides, a comprehensive model based on multi-omics data was constructed via multiple Cox analysis, and the performance of which was evaluated in the TCGA training set and TCGA test set. Results: We identified 16 key mRNAs, 20 key lncRNAs, 5 key miRNAs, 5 key CNV genes, and 7 key SNPs which were significantly associated with the prognosis of HCC, and constructed 5 single-omic models which showed relatively good performance in prognostic prediction with c-index ranged from 0.63 to 0.75 in the TCGA training set and test set. Besides, we validated the mRNA model and the SNP model in two independent external datasets respectively, and good discriminating abilities were observed through survival analysis (P < 0.05). Moreover, the multi-omics model based on mRNA, lncRNA, miRNA, CNV, and SNP information presented a quite strong predictive ability with c-index over 0.80 and all AUC values at 1,3,5-years more than 0.84.Conclusion: In this study, we identified many biomarkers that may help study underlying carcinogenesis mechanisms in HCC, and constructed five single-omic models and an integrated multi-omics model that may provide effective and reliable guides for prognosis assessment and treatment decision-making.

Download Full-text

Radiomics-based model for predicting early recurrence of intrahepatic mass-forming cholangiocarcinoma after curative tumor resection

Scientific Reports ◽

10.1038/s41598-021-97796-1 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Yong Zhu ◽

Yingfan Mao ◽

Jun Chen ◽

Yudong Qiu ◽

Yue Guan ◽

...

Keyword(s):

Regression Analysis ◽

Multivariate Logistic Regression Analysis ◽

Early Recurrence ◽

Prediction Performance ◽

Multivariate Logistic Regression ◽

Training Set ◽

Combined Model ◽

Test Set ◽

Pathological Model ◽

Radiomics Signature

AbstractTo investigate the ability of CT-based radiomics signature for pre-and postoperatively predicting the early recurrence of intrahepatic mass-forming cholangiocarcinoma (IMCC) and develop radiomics-based prediction models. Institutional review board approved this study. Clinicopathological characteristics, contrast-enhanced CT images, and radiomics features of 125 IMCC patients (35 with early recurrence and 90 with non-early recurrence) were retrospectively reviewed. In the training set of 92 patients, preoperative model, pathological model, and combined model were developed by multivariate logistic regression analysis to predict the early recurrence (≤ 6 months) of IMCC, and the prediction performance of different models were compared using the Delong test. The developed models were validated by assessing their prediction performance in test set of 33 patients. Multivariate logistic regression analysis identified solitary, differentiation, energy- arterial phase (AP), inertia-AP, and percentile50th-portal venous phase (PV) to construct combined model for predicting early recurrence of IMCC [the area under the curve (AUC) = 0.917; 95% CI 0.840–0.965]. While the AUC of pathological model and preoperative model were 0.741 (95% CI 0.637–0.828) and 0.844 (95% CI 0.751–0.912), respectively. The AUC of the combined model was significantly higher than that of the preoperative model (p = 0.049) or pathological model (p = 0.002) in training set. In test set, the combined model also showed higher prediction performance. CT-based radiomics signature is a powerful predictor for early recurrence of IMCC. Preoperative model (constructed with homogeneity-AP and standard deviation-AP) and combined model (constructed with solitary, differentiation, energy-AP, inertia-AP, and percentile50th-PV) can improve the accuracy for pre-and postoperatively predicting the early recurrence of IMCC.

Download Full-text