Deep Learning-Based Survival Analysis for High-Dimensional Survival Data

Lin Hao; Juncheol Kim; Sookhee Kwon; Il Do Ha

doi:10.3390/math9111244

Deep Learning-Based Survival Analysis for High-Dimensional Survival Data

Mathematics ◽

10.3390/math9111244 ◽

2021 ◽

Vol 9 (11) ◽

pp. 1244

Author(s):

Lin Hao ◽

Juncheol Kim ◽

Sookhee Kwon ◽

Il Do Ha

Keyword(s):

Survival Data ◽

Prediction Models ◽

Prediction Performance ◽

Time Dependent ◽

Tuning Parameter ◽

High Dimensional ◽

Brier Score ◽

Survival Prediction ◽

Optimal Setting ◽

Selection Of

With the development of high-throughput technologies, more and more high-dimensional or ultra-high-dimensional genomic data are being generated. Therefore, effectively analyzing such data has become a significant challenge. Machine learning (ML) algorithms have been widely applied for modeling nonlinear and complicated interactions in a variety of practical fields such as high-dimensional survival data. Recently, multilayer deep neural network (DNN) models have made remarkable achievements. Thus, a Cox-based DNN prediction survival model (DNNSurv model), which was built with Keras and TensorFlow, was developed. However, its results were only evaluated on the survival datasets with high-dimensional or large sample sizes. In this paper, we evaluated the prediction performance of the DNNSurv model using ultra-high-dimensional and high-dimensional survival datasets and compared it with three popular ML survival prediction models (i.e., random survival forest and the Cox-based LASSO and Ridge models). For this purpose, we also present the optimal setting of several hyperparameters, including the selection of a tuning parameter. The proposed method demonstrated via data analysis that the DNNSurv model performed well overall as compared with the ML models, in terms of the three main evaluation measures (i.e., concordance index, time-dependent Brier score, and the time-dependent AUC) for survival prediction performance.

Download Full-text

Deep Learning-based Survival Analysis for High-dimensional Survival Data

10.20944/preprints202104.0529.v1 ◽

2021 ◽

Author(s):

Il Do Ha ◽

Lin Hao ◽

Juncheol Kim ◽

Sookhee Kwon

Keyword(s):

Survival Data ◽

Prediction Models ◽

Prediction Performance ◽

Time Dependent ◽

Tuning Parameter ◽

High Dimensional ◽

Brier Score ◽

Survival Prediction ◽

Optimal Setting ◽

Selection Of

As the development of high-throughput technologies, more and more high-dimensional or ultra high-dimensional genomic data are generated. Therefore, how to make effective analysis of such data becomes a challenge. Machine learning (ML) algorithms have been widely applied for modelling nonlinear and complicated interactions in a variety of practical fields such as high-dimensional survival data. Recently, the multilayer deep neural network (DNN) models have made remarkable achievements. Thus, a Cox-based DNN prediction survival model (DNNSurv model) , which was built with Keras and Tensorflow, was developed. However, its results were only evaluated to the survival datasets with high-dimensional or large sample sizes. In this paper, we evaluate the prediction performance of the DNNSurv model using ultra high-dimensional and high-dimensional survival datasets, and compare it with three popular ML survival prediction models (i.e., random survival forest and Cox-based LASSO and Ridge models). For this purpose we also present the optimal setting of several hyper-parameters including selection of tuning parameter. The proposed method demonstrates via data analysis that the DNNSurv model performs overall well as compared with the ML models, in terms of three main evaluation measures (i.e., concordance index, time-dependent Brier score and time-dependent AUC) for survival prediction performance.

Download Full-text

XGBLC: an improved survival prediction model based on XGBoost

Bioinformatics ◽

10.1093/bioinformatics/btab675 ◽

2021 ◽

Author(s):

Baoshan Ma ◽

Ge Yan ◽

Bingjie Chai ◽

Xiaoyu Hou

Keyword(s):

Survival Analysis ◽

Prediction Model ◽

Prediction Models ◽

Expression Profiles ◽

Genomic Data ◽

Supplementary Information ◽

High Dimensional ◽

Brier Score ◽

Survival Prediction ◽

Model Based

Abstract Motivation Survival analysis using gene expression profiles plays a crucial role in the interpretation of clinical research and assessment of disease therapy programs. Several prediction models have been developed to explore the relationship between patients’ covariates and survival. However, the high-dimensional genomic features limit the prediction performance of the survival model. Thus, an accurate and reliable prediction model is necessary for survival analysis using high-dimensional genomic data. Results In this study, we proposed an improved survival prediction model based on XGBoost framework called XGBLC, which used Lasso-Cox to enhance the ability to analyze high-dimensional genomic data. The novel first- and second-order gradient statistics of Lasso-Cox were defined to construct the loss function of XGBLC. We extensively tested our XGBLC algorithm on both simulated and real-world datasets, and estimated the performance of models with 5-fold cross-validation. Based on 20 cancer datasets from The Cancer Genome Atlas (TCGA), XGBLC outperforms five state-of-the-art survival methods in terms of C-index, Brier score and AUC. The results show that XGBLC still keeps good accuracy and robustness by comparing the performance on the simulated datasets with different scales. The developed prediction model would be beneficial for physicians to understand the effects of patient’s genomic characteristics on survival and make personalized treatment decisions. Availability and implementation The implementation of XGBLC algorithm based on R language is available at: https://github.com/lab319/XGBLC Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Survival prediction models since liver transplantation - comparisons between Cox models and machine learning techniques

10.21203/rs.3.rs-22670/v3 ◽

2020 ◽

Author(s):

Georgios Kantidakis ◽

Hein Putter ◽

Carlo Lancia ◽

Jacob de Boer ◽

Andries E Braat ◽

...

Keyword(s):

Machine Learning ◽

Neural Networks ◽

Liver Transplantation ◽

Prediction Models ◽

Machine Learning Techniques ◽

Brier Score ◽

Survival Prediction ◽

Cox Models ◽

Learning Techniques ◽

Random Survival Forest

Abstract Background: Predicting survival of recipients after liver transplantation is regarded as one of the most important challenges in contemporary medicine. Hence, improving on current prediction models is of great interest.Nowadays, there is a strong discussion in the medical field about machine learning (ML) and whether it has greater potential than traditional regression models when dealing with complex data. Criticism to ML is related to unsuitable performance measures and lack of interpretability which is important for clinicians.Methods: In this paper, ML techniques such as random forests and neural networks are applied to large data of 62294 patients from the United States with 97 predictors selected on clinical/statistical grounds, over more than 600, to predict survival from transplantation. Of particular interest is also the identification of potential risk factors. A comparison is performed between 3 different Cox models (with all variables, backward selection and LASSO) and 3 machine learning techniques: a random survival forest and 2 partial logistic artificial neural networks (PLANNs). For PLANNs, novel extensions to their original specification are tested. Emphasis is given on the advantages and pitfalls of each method and on the interpretability of the ML techniques.Results: Well-established predictive measures are employed from the survival field (C-index, Brier score and Integrated Brier Score) and the strongest prognostic factors are identified for each model. Clinical endpoint is overall graft-survival defined as the time between transplantation and the date of graft-failure or death. The random survival forest shows slightly better predictive performance than Cox models based on the C-index. Neural networks show better performance than both Cox models and random survival forest based on the Integrated Brier Score at 10 years.Conclusion: In this work, it is shown that machine learning techniques can be a useful tool for both prediction and interpretation in the survival context. From the ML techniques examined here, PLANN with 1 hidden layer predicts survival probabilities the most accurately, being as calibrated as the Cox model with all variables.

Download Full-text

Enhancing the Lasso Approach for Developing a Survival Prediction Model Based on Gene Expression Data

Computational and Mathematical Methods in Medicine ◽

10.1155/2015/259474 ◽

2015 ◽

Vol 2015 ◽

pp. 1-7 ◽

Cited By ~ 2

Author(s):

Shuhei Kaneko ◽

Akihiro Hirakawa ◽

Chikuma Hamada

Keyword(s):

Gene Expression ◽

Prediction Model ◽

Gene Expression Data ◽

Prediction Models ◽

Mixture Distribution ◽

Tuning Parameter ◽

Survival Prediction ◽

Expression Data ◽

True Positive ◽

False Negatives

In the past decade, researchers in oncology have sought to develop survival prediction models using gene expression data. The least absolute shrinkage and selection operator (lasso) has been widely used to select genes that truly correlated with a patient’s survival. The lasso selects genes for prediction by shrinking a large number of coefficients of the candidate genes towards zero based on a tuning parameter that is often determined by a cross-validation (CV). However, this method can pass over (or fail to identify) true positive genes (i.e., it identifies false negatives) in certain instances, because the lasso tends to favor the development of a simple prediction model. Here, we attempt to monitor the identification of false negatives by developing a method for estimating the number of true positive (TP) genes for a series of values of a tuning parameter that assumes a mixture distribution for the lasso estimates. Using our developed method, we performed a simulation study to examine its precision in estimating the number of TP genes. Additionally, we applied our method to a real gene expression dataset and found that it was able to identify genes correlated with survival that a CV method was unable to detect.

Download Full-text

Estimating a time-dependent concordance index for survival prediction models with covariate dependent censoring

Statistics in Medicine ◽

10.1002/sim.5681 ◽

2012 ◽

Vol 32 (13) ◽

pp. 2173-2184 ◽

Cited By ~ 76

Author(s):

Thomas A. Gerds ◽

Michael W. Kattan ◽

Martin Schumacher ◽

Changhong Yu

Keyword(s):

Prediction Models ◽

Dependent Censoring ◽

Time Dependent ◽

Survival Prediction ◽

Concordance Index

Download Full-text

Is it time to use machine learning survival algorithms for survival and risk factors prediction instead of Cox proportional hazard regression? A comparative population-based study

10.1101/2021.11.20.21266627 ◽

2021 ◽

Author(s):

Sara Morsy ◽

Truong Hong Hieu ◽

Abdelrahman M Makram ◽

Osama Gamal Hassan ◽

Nguyen Tran Minh Duc ◽

...

Keyword(s):

Machine Learning ◽

Prediction Models ◽

Survival Duration ◽

Brier Score ◽

Survival Prediction ◽

Proportional Hazard ◽

Cancer Data ◽

Cox Proportional Hazard ◽

Short And Long Term

Purpose Applying machine learning in medical statistics offers more accurate prediction models. In this paper, we aimed to compare the performance of the Cox Proportional Hazard model (CPH), Classification and Regression Trees (CART), and Random Survival Forest (RSF) in short-, and long-term prediction in glioblastoma patients. Methods We extracted glioblastoma cancer data from the Surveillance, Epidemiology, and End Results database (SEER). We used the CPH, CART, and RSF for the prediction of 1- to 10-year survival probabilities. The Brier Score for each duration was calculated, and the model with the least score was considered the most accurate. Results The cohort included 26473 glioblastoma patients divided into two groups: training (n = 18538) and validation set (n = 7935). The average survival duration was seven months. For the short- and long-term predictions, RSF was the best algorithm followed by CPH and CART. Conclusion For big data, RSF was found to have the highest accuracy and best performance. Using an accurate statistical model for survival prediction and prognostic factors determination will help the care of cancer patients. However, more developments of the R packages are needed to allow more illustrations of the effect of each covariate on the survival probability.

Download Full-text

Survival prediction models since liver transplantation - comparisons between Cox models and machine learning techniques

10.21203/rs.3.rs-22670/v2 ◽

2020 ◽

Author(s):

Georgios Kantidakis ◽

Hein Putter ◽

Carlo Lancia ◽

Jacob de Boer ◽

Andries E Braat ◽

...

Keyword(s):

Machine Learning ◽

Neural Networks ◽

Liver Transplantation ◽

Prediction Models ◽

Machine Learning Techniques ◽

Brier Score ◽

Survival Prediction ◽

Cox Models ◽

Learning Techniques ◽

Random Survival Forest

Abstract Background: Predicting survival of recipients after liver transplantation is regarded as one of the most important challenges in contemporary medicine. Hence, improving on current prediction models is of great interest. Nowadays, there is a strong discussion in the medical field about machine learning (ML) and whether it has greater potential than traditional regression models when dealing with complex data. Criticism to ML is related to unsuitable performance measures and lack of interpretability which is important for clinicians. Methods: In this paper, ML techniques such as random forests and neural networks are applied to large data of 62294 patients from the United States with 97 predictors selected on clinical/statistical grounds, over more than 600, to predict survival from transplantation. Of particular interest is also the identification of potential risk factors. A comparison is performed between 3 different Cox models (with all variables, backward selection and LASSO) and 3 machine learning techniques: a random survival forest and 2 partial logistic artificial neural networks (PLANNs). For PLANNs, novel extensions to their original specification are tested. Emphasis is given on the advantages and pitfalls of each method and on the interpretability of the ML techniques. Results: Well-established predictive measures are employed from the survival field (C-index, Brier score and Integrated Brier Score) and the strongest prognostic factors are identified for each model. Clinical endpoint is overall graft-survival defined as the time between transplantation and the date of graft-failure or death. The random survival forest shows slightly better predictive performance than Cox models based on the C-index. Neural networks show better performance than both Cox models and random survival forest based on the Integrated Brier Score at 10 years. Conclusion: In this work, it is shown that machine learning techniques can be a useful tool for both prediction and interpretation in the survival context. From the ML techniques examined here, PLANN with 1 hidden layer predicts survival probabilities the most accurately, being as calibrated as the Cox model with all variables.

Download Full-text

Impact of feature selection methods and subgroup factors on prognostic analysis with CT-based radiomics in non-small cell lung cancer patients

Radiation Oncology ◽

10.1186/s13014-021-01810-9 ◽

2021 ◽

Vol 16 (1) ◽

Author(s):

Yuto Sugai ◽

Noriyuki Kadoya ◽

Shohei Tanaka ◽

Shunpei Tanabe ◽

Mariko Umeda ◽

...

Keyword(s):

Feature Selection ◽

Subgroup Analysis ◽

Prediction Models ◽

Prediction Performance ◽

Survival Prediction ◽

Histological Subtype ◽

Selection Methods ◽

Small Cell Lung ◽

Test Dataset ◽

Nsclc Patients

Abstract Background Radiomics is a new technology to noninvasively predict survival prognosis with quantitative features extracted from medical images. Most radiomics-based prognostic studies of non-small-cell lung cancer (NSCLC) patients have used mixed datasets of different subgroups. Therefore, we investigated the radiomics-based survival prediction of NSCLC patients by focusing on subgroups with identical characteristics. Methods A total of 304 NSCLC (Stages I–IV) patients treated with radiotherapy in our hospital were used. We extracted 107 radiomic features (i.e., 14 shape features, 18 first-order statistical features, and 75 texture features) from the gross tumor volume drawn on the free breathing planning computed tomography image. Three feature selection methods [i.e., test–retest and multiple segmentation (FS1), Pearson's correlation analysis (FS2), and a method that combined FS1 and FS2 (FS3)] were used to clarify how they affect survival prediction performance. Subgroup analysis for each histological subtype and each T stage applied the best selection method for the analysis of All data. We used a least absolute shrinkage and selection operator Cox regression model for all analyses and evaluated prognostic performance using the concordance-index (C-index) and the Kaplan–Meier method. For subgroup analysis, fivefold cross-validation was applied to ensure model reliability. Results In the analysis of All data, the C-index for the test dataset is 0.62 (FS1), 0.63 (FS2), and 0.62 (FS3). The subgroup analysis indicated that the prediction model based on specific histological subtypes and T stages had a higher C-index for the test dataset than that based on All data (All data, 0.64 vs. SCCall, 060; ADCall, 0.69; T1, 0.68; T2, 0.65; T3, 0.66; T4, 0.70). In addition, the prediction models unified for each T stage in histological subtype showed a different trend in the C-index for the test dataset between ADC-related and SCC-related models (ADCT1–ADCT4, 0.72–0.83; SCCT1–SCCT4, 0.58–0.71). Conclusions Our results showed that feature selection methods moderately affected the survival prediction performance. In addition, prediction models based on specific subgroups may improve the prediction performance. These results may prove useful for determining the optimal radiomics-based predication model.

Download Full-text

Automated, high-dimensional evaluation of physiological aging and resilience in outbred mice

10.1101/2021.08.02.454830 ◽

2021 ◽

Author(s):

Zhenghao Chen ◽

Anil Raj ◽

G.V. Prateek ◽

Andrea Di Francesco ◽

Justin Liu ◽

...

Keyword(s):

Prediction Models ◽

Network Connectivity ◽

Mortality Prediction ◽

High Dimensional ◽

Survival Prediction ◽

Physiological Aging ◽

Mortality Prediction Models ◽

Outbred Mice ◽

Behavioral Aging ◽

Age Range

Behavior and physiology are essential readouts in many studies but have not benefited from the high-dimensional data revolution that has transformed molecular and cellular phenotyping. To address this, we developed an approach that combines commercially available automated phenotyping hardware with a systems biology analysis pipeline to generate a high-dimensional readout of mouse behavior/physiology, as well as intuitive and health-relevant summary statistics (resilience and biological age). We used this platform to longitudinally evaluate aging in hundreds of outbred mice across an age range from 6 months to 3.4 years. In contrast to the assumption that aging can only be measured at the limits of animal ability via challenge-based tasks, we observed widespread physiological and behavioral aging starting in early life. Using network connectivity analysis, we found that organism-level resilience exhibited an accelerating decline with age that was distinct from the trajectory of individual phenotypes. We developed a method, Combined Aging and Survival Prediction of Aging Rate (CASPAR), for jointly predicting chronological age and survival time and showed that the resulting model is able to predict both variables simultaneously, a behavior that is not captured by separate age and mortality prediction models. This study provides a uniquely high-resolution view of physiological aging in mice and demonstrates that systems-level analysis of physiology provides insights not captured by individual phenotypes. The approach described here allows aging, and other processes that affect behavior and physiology, to be studied with sophistication and rigor.

Download Full-text

Comparison of the Tree-Based Machine Learning Algorithms to Cox Regression in Predicting the Survival of Oral and Pharyngeal Cancers: Analyses Based on SEER Database

Cancers ◽

10.3390/cancers12102802 ◽

2020 ◽

Vol 12 (10) ◽

pp. 2802

Author(s):

Mi Du ◽

Dandara G. Haag ◽

John W. Lynch ◽

Murthy N. Mittinty

Keyword(s):

Machine Learning ◽

Prediction Models ◽

Cox Regression ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Conditional Inference ◽

Brier Score ◽

Survival Prediction ◽

Clinical Use ◽

Seer Database

This study aims to demonstrate the use of the tree-based machine learning algorithms to predict the 3- and 5-year disease-specific survival of oral and pharyngeal cancers (OPCs) and compare their performance with the traditional Cox regression. A total of 21,154 individuals diagnosed with OPCs between 2004 and 2009 were obtained from the Surveillance, Epidemiology, and End Results (SEER) database. Three tree-based machine learning algorithms (survival tree (ST), random forest (RF) and conditional inference forest (CF)), together with a reference technique (Cox proportional hazard models (Cox)), were used to develop the survival prediction models. To handle the missing values in predictors, we applied the substantive model compatible version of the fully conditional specification imputation approach to the Cox model, whereas we used RF to impute missing data for the ST, RF and CF models. For internal validation, we used 10-fold cross-validation with 50 iterations in the model development datasets. Following this, model performance was evaluated using the C-index, integrated Brier score (IBS) and calibration curves in the test datasets. For predicting the 3-year survival of OPCs with the complete cases, the C-index in the development sets were 0.77 (0.77, 0.77), 0.70 (0.70, 0.70), 0.83 (0.83, 0.84) and 0.83 (0.83, 0.86) for Cox, ST, RF and CF, respectively. Similar results were observed in the 5-year survival prediction models, with C-index for Cox, ST, RF and CF being 0.76 (0.76, 0.76), 0.69 (0.69, 0.70), 0.83 (0.83, 0.83) and 0.85 (0.84, 0.86), respectively, in development datasets. The prediction error curves based on IBS showed a similar pattern for these models. The predictive performance remained unchanged in the analyses with imputed data. Additionally, a free web-based calculator was developed for potential clinical use. In conclusion, compared to Cox regression, ST had a lower and RF and CF had a higher predictive accuracy in predicting the 3- and 5-year OPCs survival using SEER data. The RF and CF algorithms provide non-parametric alternatives to Cox regression to be of clinical use for estimating the survival probability of OPCs patients.

Download Full-text