XGBLC: an improved survival prediction model based on XGBoost

Author(s):  
Baoshan Ma ◽  
Ge Yan ◽  
Bingjie Chai ◽  
Xiaoyu Hou

Abstract Motivation Survival analysis using gene expression profiles plays a crucial role in the interpretation of clinical research and assessment of disease therapy programs. Several prediction models have been developed to explore the relationship between patients’ covariates and survival. However, the high-dimensional genomic features limit the prediction performance of the survival model. Thus, an accurate and reliable prediction model is necessary for survival analysis using high-dimensional genomic data. Results In this study, we proposed an improved survival prediction model based on XGBoost framework called XGBLC, which used Lasso-Cox to enhance the ability to analyze high-dimensional genomic data. The novel first- and second-order gradient statistics of Lasso-Cox were defined to construct the loss function of XGBLC. We extensively tested our XGBLC algorithm on both simulated and real-world datasets, and estimated the performance of models with 5-fold cross-validation. Based on 20 cancer datasets from The Cancer Genome Atlas (TCGA), XGBLC outperforms five state-of-the-art survival methods in terms of C-index, Brier score and AUC. The results show that XGBLC still keeps good accuracy and robustness by comparing the performance on the simulated datasets with different scales. The developed prediction model would be beneficial for physicians to understand the effects of patient’s genomic characteristics on survival and make personalized treatment decisions. Availability and implementation The implementation of XGBLC algorithm based on R language is available at: https://github.com/lab319/XGBLC Supplementary information Supplementary data are available at Bioinformatics online.

2019 ◽  
Vol 35 (14) ◽  
pp. i484-i491
Author(s):  
Jakob Richter ◽  
Katrin Madjar ◽  
Jörg Rahnenführer

AbstractMotivationTo obtain a reliable prediction model for a specific cancer subgroup or cohort is often difficult due to limited sample size and, in survival analysis, due to potentially high censoring rates. Sometimes similar data from other patient subgroups are available, e.g. from other clinical centers. Simple pooling of all subgroups can decrease the variance of the predicted parameters of the prediction models, but also increase the bias due to heterogeneity between the cohorts. A promising compromise is to identify those subgroups with a similar relationship between covariates and target variable and then include only these for model building.ResultsWe propose a subgroup-based weighted likelihood approach for survival prediction with high-dimensional genetic covariates. When predicting survival for a specific subgroup, for every other subgroup an individual weight determines the strength with which its observations enter into model building. MBO (model-based optimization) can be used to quickly find a good prediction model in the presence of a large number of hyperparameters. We use MBO to identify the best model for survival prediction of a specific subgroup by optimizing the weights for additional subgroups for a Cox model. The approach is evaluated on a set of lung cancer cohorts with gene expression measurements. The resulting models have competitive prediction quality, and they reflect the similarity of the corresponding cancer subgroups, with both weights close to 0 and close to 1 and medium weights.Availability and implementationmlrMBO is implemented as an R-package and is freely available at http://github.com/mlr-org/mlrMBO.


Mathematics ◽  
2021 ◽  
Vol 9 (11) ◽  
pp. 1244
Author(s):  
Lin Hao ◽  
Juncheol Kim ◽  
Sookhee Kwon ◽  
Il Do Ha

With the development of high-throughput technologies, more and more high-dimensional or ultra-high-dimensional genomic data are being generated. Therefore, effectively analyzing such data has become a significant challenge. Machine learning (ML) algorithms have been widely applied for modeling nonlinear and complicated interactions in a variety of practical fields such as high-dimensional survival data. Recently, multilayer deep neural network (DNN) models have made remarkable achievements. Thus, a Cox-based DNN prediction survival model (DNNSurv model), which was built with Keras and TensorFlow, was developed. However, its results were only evaluated on the survival datasets with high-dimensional or large sample sizes. In this paper, we evaluated the prediction performance of the DNNSurv model using ultra-high-dimensional and high-dimensional survival datasets and compared it with three popular ML survival prediction models (i.e., random survival forest and the Cox-based LASSO and Ridge models). For this purpose, we also present the optimal setting of several hyperparameters, including the selection of a tuning parameter. The proposed method demonstrated via data analysis that the DNNSurv model performed well overall as compared with the ML models, in terms of the three main evaluation measures (i.e., concordance index, time-dependent Brier score, and the time-dependent AUC) for survival prediction performance.


2020 ◽  
Vol 36 (Supplement_1) ◽  
pp. i389-i398
Author(s):  
Sunkyu Kim ◽  
Keonwoo Kim ◽  
Junseok Choe ◽  
Inggeol Lee ◽  
Jaewoo Kang

Abstract Motivation Recent advances in deep learning have offered solutions to many biomedical tasks. However, there remains a challenge in applying deep learning to survival analysis using human cancer transcriptome data. As the number of genes, the input variables of survival model, is larger than the amount of available cancer patient samples, deep-learning models are prone to overfitting. To address the issue, we introduce a new deep-learning architecture called VAECox. VAECox uses transfer learning and fine tuning. Results We pre-trained a variational autoencoder on all RNA-seq data in 20 TCGA datasets and transferred the trained weights to our survival prediction model. Then we fine-tuned the transferred weights during training the survival model on each dataset. Results show that our model outperformed other previous models such as Cox Proportional Hazard with LASSO and ridge penalty and Cox-nnet on the 7 of 10 TCGA datasets in terms of C-index. The results signify that the transferred information obtained from entire cancer transcriptome data helped our survival prediction model reduce overfitting and show robust performance in unseen cancer patient samples. Availability and implementation Our implementation of VAECox is available at https://github.com/dmis-lab/VAECox. Supplementary information Supplementary data are available at Bioinformatics online.


Author(s):  
Il Do Ha ◽  
Lin Hao ◽  
Juncheol Kim ◽  
Sookhee Kwon

As the development of high-throughput technologies, more and more high-dimensional or ultra high-dimensional genomic data are generated. Therefore, how to make effective analysis of such data becomes a challenge. Machine learning (ML) algorithms have been widely applied for modelling nonlinear and complicated interactions in a variety of practical fields such as high-dimensional survival data. Recently, the multilayer deep neural network (DNN) models have made remarkable achievements. Thus, a Cox-based DNN prediction survival model (DNNSurv model) , which was built with Keras and Tensorflow, was developed. However, its results were only evaluated to the survival datasets with high-dimensional or large sample sizes. In this paper, we evaluate the prediction performance of the DNNSurv model using ultra high-dimensional and high-dimensional survival datasets, and compare it with three popular ML survival prediction models (i.e., random survival forest and Cox-based LASSO and Ridge models). For this purpose we also present the optimal setting of several hyper-parameters including selection of tuning parameter. The proposed method demonstrates via data analysis that the DNNSurv model performs overall well as compared with the ML models, in terms of three main evaluation measures (i.e., concordance index, time-dependent Brier score and time-dependent AUC) for survival prediction performance.


2020 ◽  
Author(s):  
Georgios Kantidakis ◽  
Hein Putter ◽  
Carlo Lancia ◽  
Jacob de Boer ◽  
Andries E Braat ◽  
...  

Abstract Background: Predicting survival of recipients after liver transplantation is regarded as one of the most important challenges in contemporary medicine. Hence, improving on current prediction models is of great interest.Nowadays, there is a strong discussion in the medical field about machine learning (ML) and whether it has greater potential than traditional regression models when dealing with complex data. Criticism to ML is related to unsuitable performance measures and lack of interpretability which is important for clinicians.Methods: In this paper, ML techniques such as random forests and neural networks are applied to large data of 62294 patients from the United States with 97 predictors selected on clinical/statistical grounds, over more than 600, to predict survival from transplantation. Of particular interest is also the identification of potential risk factors. A comparison is performed between 3 different Cox models (with all variables, backward selection and LASSO) and 3 machine learning techniques: a random survival forest and 2 partial logistic artificial neural networks (PLANNs). For PLANNs, novel extensions to their original specification are tested. Emphasis is given on the advantages and pitfalls of each method and on the interpretability of the ML techniques.Results: Well-established predictive measures are employed from the survival field (C-index, Brier score and Integrated Brier Score) and the strongest prognostic factors are identified for each model. Clinical endpoint is overall graft-survival defined as the time between transplantation and the date of graft-failure or death. The random survival forest shows slightly better predictive performance than Cox models based on the C-index. Neural networks show better performance than both Cox models and random survival forest based on the Integrated Brier Score at 10 years.Conclusion: In this work, it is shown that machine learning techniques can be a useful tool for both prediction and interpretation in the survival context. From the ML techniques examined here, PLANN with 1 hidden layer predicts survival probabilities the most accurately, being as calibrated as the Cox model with all variables.


2015 ◽  
Vol 2015 ◽  
pp. 1-7 ◽  
Author(s):  
Shuhei Kaneko ◽  
Akihiro Hirakawa ◽  
Chikuma Hamada

In the past decade, researchers in oncology have sought to develop survival prediction models using gene expression data. The least absolute shrinkage and selection operator (lasso) has been widely used to select genes that truly correlated with a patient’s survival. The lasso selects genes for prediction by shrinking a large number of coefficients of the candidate genes towards zero based on a tuning parameter that is often determined by a cross-validation (CV). However, this method can pass over (or fail to identify) true positive genes (i.e., it identifies false negatives) in certain instances, because the lasso tends to favor the development of a simple prediction model. Here, we attempt to monitor the identification of false negatives by developing a method for estimating the number of true positive (TP) genes for a series of values of a tuning parameter that assumes a mixture distribution for the lasso estimates. Using our developed method, we performed a simulation study to examine its precision in estimating the number of TP genes. Additionally, we applied our method to a real gene expression dataset and found that it was able to identify genes correlated with survival that a CV method was unable to detect.


2014 ◽  
Vol 986-987 ◽  
pp. 1356-1359
Author(s):  
You Xian Peng ◽  
Bo Tang ◽  
Hong Ying Cao ◽  
Bin Chen ◽  
Yu Li

Audible noise prediction is a hot research area in power transmission engineering in recent years, especially come down to AC transmission lines. The conventional prediction models at present have got some problems such as big errors. In this paper, a prediction model is established based on BP network, in which the input variables are the four factors in the international common expression of power line audible noise and the noise value is the output. Take multiple measured power lines as an example, a train is made by the BP network and then the prediction model is set up in the hidden layer of the network. Using the trained model, the audible noise values are predicted. The final results show that the average absolute error in absolute terms of the values by the audible noise prediction model based on BP neural network is 1.6414 less than that predicted by the GE formula.


Cancers ◽  
2020 ◽  
Vol 12 (4) ◽  
pp. 834
Author(s):  
J.J. van Kleef ◽  
H.G. van den Boorn ◽  
R.H.A. Verhoeven ◽  
K. Vanschoenbeek ◽  
A. Abu-Hanna ◽  
...  

The SOURCE prediction model predicts individualised survival conditional on various treatments for patients with metastatic oesophageal or gastric cancer. The aim of this study was to validate SOURCE in an external cohort from the Belgian Cancer Registry. Data of Belgian patients diagnosed with metastatic disease between 2004 and 2014 were extracted (n = 4097). Model calibration and discrimination (c-indices) were determined. A total of 2514 patients with oesophageal cancer and 1583 patients with gastric cancer with a median survival of 7.7 and 5.4 months, respectively, were included. The oesophageal cancer model showed poor calibration (intercept: 0.30, slope: 0.42) with an absolute mean prediction error of 14.6%. The mean difference between predicted and observed survival was −2.6%. The concordance index (c-index) of the oesophageal model was 0.64. The gastric cancer model showed good calibration (intercept: 0.02, slope: 0.91) with an absolute mean prediction error of 2.5%. The mean difference between predicted and observed survival was 2.0%. The c-index of the gastric cancer model was 0.66. The SOURCE gastric cancer model was well calibrated and had a similar performance in the Belgian cohort compared with the Dutch internal validation. However, the oesophageal cancer model had not. Our findings underscore the importance of evaluating the performance of prediction models in other populations.


2019 ◽  
Author(s):  
Wongeun Song ◽  
Se Young Jung ◽  
Hyunyoung Baek ◽  
Chang Won Choi ◽  
Young Hwa Jung ◽  
...  

BACKGROUND Neonatal sepsis is associated with most cases of mortalities and morbidities in the neonatal intensive care unit (NICU). Many studies have developed prediction models for the early diagnosis of bloodstream infections in newborns, but there are limitations to data collection and management because these models are based on high-resolution waveform data. OBJECTIVE The aim of this study was to examine the feasibility of a prediction model by using noninvasive vital sign data and machine learning technology. METHODS We used electronic medical record data in intensive care units published in the Medical Information Mart for Intensive Care III clinical database. The late-onset neonatal sepsis (LONS) prediction algorithm using our proposed forward feature selection technique was based on NICU inpatient data and was designed to detect clinical sepsis 48 hours before occurrence. The performance of this prediction model was evaluated using various feature selection algorithms and machine learning models. RESULTS The performance of the LONS prediction model was found to be comparable to that of the prediction models that use invasive data such as high-resolution vital sign data, blood gas estimations, blood cell counts, and pH levels. The area under the receiver operating characteristic curve of the 48-hour prediction model was 0.861 and that of the onset detection model was 0.868. The main features that could be vital candidate markers for clinical neonatal sepsis were blood pressure, oxygen saturation, and body temperature. Feature generation using kurtosis and skewness of the features showed the highest performance. CONCLUSIONS The findings of our study confirmed that the LONS prediction model based on machine learning can be developed using vital sign data that are regularly measured in clinical settings. Future studies should conduct external validation by using different types of data sets and actual clinical verification of the developed model.


Complexity ◽  
2021 ◽  
Vol 2021 ◽  
pp. 1-13
Author(s):  
Tongfei Lao ◽  
Xiaoting Chen ◽  
Jianian Zhu

As a tool for analyzing time series, grey prediction models have been widely used in various fields of society due to their higher prediction accuracy and the advantages of small sample modeling. The basic GM (1, N) model is the most popular and important grey model, in which the first “1” stands for the “first order” and the second “N” represents the “multivariate.” The construction of the background values is not only an important step in grey modeling but also the key factor that affects the prediction accuracy of the grey prediction models. In order to further improve the prediction accuracy of the multivariate grey prediction models, this paper establishes a novel multivariate grey prediction model based on dynamic background values (abbreviated as DBGM (1, N) model) and uses the whale optimization algorithm to solve the optimal parameters of the model. The DBGM (1, N) model can adapt to different time series by changing parameters to achieve the purpose of improving prediction accuracy. It is a grey prediction model with extremely strong adaptability. Finally, four cases are used to verify the feasibility and effectiveness of the model. The results show that the proposed model significantly outperforms the other 2 multivariate grey prediction models.


Sign in / Sign up

Export Citation Format

Share Document