Mortality Prediction Using SaO2/FiO2 Ratio Based on eICU Database Analysis

Critical Care Research and Practice ◽

10.1155/2021/6672603 ◽

2021 ◽

Vol 2021 ◽

pp. 1-9

Author(s):

Sharad Patel ◽

Gurkeerat Singh ◽

Samson Zarbiv ◽

Kia Ghiassi ◽

Jean-Sebastien Rachoin

Keyword(s):

Prediction Models ◽

Model Performance ◽

Predictive Ability ◽

Mortality Prediction ◽

Gradient Boosting ◽

Admission Diagnosis ◽

Research Database ◽

Icu Mortality ◽

Feature Importance ◽

Partial Dependence

Purpose. PaO2 to FiO2 ratio (P/F) is used to assess the degree of hypoxemia adjusted for oxygen requirements. The Berlin definition of Acute Respiratory Distress Syndrome (ARDS) includes P/F as a diagnostic criterion. P/F is invasive and cost-prohibitive for resource-limited settings. SaO2/FiO2 (S/F) ratio has the advantages of being easy to calculate, noninvasive, continuous, cost-effective, and reliable, as well as lower infection exposure potential for staff, and avoids iatrogenic anemia. Previous work suggests that the SaO2/FiO2 ratio (S/F) correlates with P/F and can be used as a surrogate in ARDS. Quantitative correlation between S/F and P/F has been verified, but the data for the relative predictive ability for ICU mortality remains in question. We hypothesize that S/F is noninferior to P/F as a predictive feature for ICU mortality. Using a machine-learning approach, we hope to demonstrate the relative mortality predictive capacities of S/F and P/F. Methods. We extracted data from the eICU Collaborative Research Database. The features age, gender, SaO2, PaO2, FIO2, admission diagnosis, Apache IV, mechanical ventilation (MV), and ICU mortality were extracted. Mortality was the dependent variable for our prediction models. Exploratory data analysis was performed in Python. Missing data was imputed with Sklearn Iterative Imputer. Random assignment of all the encounters, 80% to the training (n = 26690) and 20% to testing (n = 6741), was stratified by positive and negative classes to ensure a balanced distribution. We scaled the data using the Sklearn Standard Scaler. Categorical values were encoded using Target Encoding. We used a gradient boosting decision tree algorithm variant called XGBoost as our model. Model hyperparameters were tuned using the Sklearn RandomizedSearchCV with tenfold cross-validation. We used AUC as our metric for model performance. Feature importance was assessed using SHAP, ELI5 (permutation importance), and a built-in XGBoost feature importance method. We constructed partial dependence plots to illustrate the relationship between mortality probability and S/F values. Results. The XGBoost hyperparameter optimized model had an AUC score of .85 on the test set. The hyperparameters selected to train the final models were as follows: colsample_bytree of 0.8, gamma of 1, max_depth of 3, subsample of 1, min_child_weight of 10, and scale_pos_weight of 3. The SHAP, ELI5, and XGBoost feature importance analysis demonstrates that the S/F ratio ranks as the strongest predictor for mortality amongst the physiologic variables. The partial dependence plots illustrate that mortality rises significantly above S/F values of 200. Conclusion. S/F was a stronger predictor of mortality than P/F based upon feature importance evaluation of our data. Our study is hypothesis-generating and a prospective evaluation is warranted. Take-Home Points. S/F ratio is a noninvasive continuous method of measuring hypoxemia as compared to P/F ratio. Our study shows that the S/F ratio is a better predictor of mortality than the more widely used P/F ratio to monitor and manage hypoxemia.

Download Full-text

Machine learning augmented predictive and generative model for rupture life in ferritic and austenitic steels

npj Materials Degradation ◽

10.1038/s41529-021-00166-5 ◽

2021 ◽

Vol 5 (1) ◽

Author(s):

Osman Mamun ◽

Madison Wenzlick ◽

Arun Sathanur ◽

Jeffrey Hawk ◽

Ram Devanathan

Keyword(s):

Pearson Correlation ◽

Rupture Life ◽

Model Performance ◽

Austenitic Stainless Steels ◽

Generative Model ◽

Austenitic Steels ◽

Gradient Boosting ◽

Variational Autoencoder ◽

Feature Importance ◽

Boosting Algorithm

AbstractThe Larson–Miller parameter (LMP) offers an efficient and fast scheme to estimate the creep rupture life of alloy materials for high-temperature applications; however, poor generalizability and dependence on the constant C often result in sub-optimal performance. In this work, we show that the direct rupture life parameterization without intermediate LMP parameterization, using a gradient boosting algorithm, can be used to train ML models for very accurate prediction of rupture life in a variety of alloys (Pearson correlation coefficient >0.9 for 9–12% Cr and >0.8 for austenitic stainless steels). In addition, the Shapley value was used to quantify feature importance, making the model interpretable by identifying the effect of various features on the model performance. Finally, a variational autoencoder-based generative model was built by conditioning on the experimental dataset to sample hypothetical synthetic candidate alloys from the learnt joint distribution not existing in both 9–12% Cr ferritic–martensitic alloys and austenitic stainless steel datasets.

Download Full-text

High mortality rate of obstetric critically ill women in Rwanda and its predictability

BMC Pregnancy and Childbirth ◽

10.1186/s12884-021-03882-7 ◽

2021 ◽

Vol 21 (1) ◽

Author(s):

Alcade Rudakemwa ◽

Amyl Lucille Cassidy ◽

Théogène Twagirumugabe

Keyword(s):

Prediction Models ◽

Average Length ◽

Mortality Prediction ◽

Routine Practice ◽

Icu Mortality ◽

Resource Limited ◽

Obstetric Haemorrhage ◽

Tertiary Hospitals ◽

Mortality Prediction Models ◽

Qsofa Score

Abstract Background Reasons for admission to intensive care units (ICUs) for obstetric patients vary from one setting to another. Outcomes from ICU and prediction models are not well explored in Rwanda owing to lack of appropriate scores. This study aimed to assess reasons for admission and accuracy of prediction models for mortality of obstetric patients admitted to ICUs of two public tertiary hospitals in Rwanda. Methods We prospectively collected data from all obstetric patients admitted to the ICUs of the two public tertiary hospitals in Rwanda from March 2017 to February 2018 to identify reasons for admission, demographic and clinical characteristics, outcome including death and its predictability by both the Modified Early Obstetric Warning Score (MEOWS) and quick Sequential Organ Failure Assessment (qSOFA). We analysed the accuracy of mortality prediction models by MEOWS or qSOFA by using logistic regression adjusting for factors associated with mortality. Area under the Receiver Operating characteristic (AUROC) curves is used to show the predicting capacity for each individual tool. Results Obstetric patients (n = 94) represented 12.8 % of all 747 ICU admissions which is 1.8 % of all 4.999 admitted women for pregnancy or labor. Sepsis (n = 30; 31.9 %) and obstetric haemorrhage (n = 24; 25.5 %) were the two commonest reasons for ICU admission. Overall ICU mortality for obstetric patients was 54.3 % (n = 51) with average length of stay of 6.6 ± 7.525 days. MEOWS score was an independent predictor of mortality (adjusted (a)OR 1.25; 95 % CI 1.07–1.46) and so was qSOFA score (aOR 2.81; 95 % CI 1.25–6.30) with an adjusted AUROC of 0.773 (95 % CI 0.67–0.88) and 0.764 (95 % CI 0.65–0.87), indicating fair accuracy for ICU mortality prediction in these settings of both MEOWS and qSOFA scores. Conclusions Sepsis and obstetric haemorrhage were the commonest reasons for obstetric admissions to ICU in Rwanda. MEOWS and qSOFA scores could accurately predict ICU mortality of obstetric patients in resource-limited settings, but larger studies are needed before a recommendation for their use in routine practice in similar settings.

Download Full-text

High mortality rate of obstetric critically ill patients in Rwanda and its predictability

10.21203/rs.3.rs-29207/v2 ◽

2020 ◽

Author(s):

Alcade Rudakemwa ◽

Amy Lucille Cassidy ◽

Theogene Twagirumugabe

Keyword(s):

Prediction Models ◽

Average Length ◽

Roc Curves ◽

Mortality Prediction ◽

Icu Mortality ◽

Average Length Of Stay ◽

Referral Hospitals ◽

Tertiary Hospitals ◽

Mortality Prediction Models ◽

Qsofa Score

Abstract Background Reasons for admission at the intensive care units (ICU) for obstetric patients vary from a setting to another. Outcomes from ICU and its prediction models are not well explored in Rwanda because of lack of appropriate scores. This study intended to assess profile and accuracy of predictive models for obstetric patients admitted in ICU in the two public tertiary hospitals in Rwanda.Methods We prospectively collected data from all obstetric patients admitted in the ICU of public referral hospitals in Rwanda from March 2017 to February 2018 to identify reasons for admissions and factors for prognosis. We analysed the accuracy of mortality prediction models including the quick Sequential Organ Failure Assessment (qSOFA) and Modified Early Obstetric Warning Score (MEOWS) by using the Logistic Regression and adjusted Receiver Operating characteristic (ROC) curves. Results Obstetric patients represented 12.8% of all ICU admissions and 1.8% of all deliveries. Sepsis (31.9%) and haemorrhage (25.5%) were the two commonest reasons for ICU admission in our study participants. The overall ICU mortality for our obstetric patients was 54.3% while the average length of stay was 6.6 days. MEOWS score was an independent predictor to mortality (adjusted OR=1.25[1.07-1.46]; p=0.005) and so was the qSOFA score (adjusted OR=2.81[1.25-6.30]; p=0.012). The adjusted Area Under the ROC (AUROC) for MEOWS was 0.773[0.666-0.880] and that of the qSOFA was 0.764[0.654-0.873] signing fair accuracies for ICU mortality prediction in these settings for both models.Conclusion Sepsis is the commonest reason for admissions to ICU for obstetric patients in Rwanda. Simple models comprising MEOWS and qSOFA could accurately predict the mortality for those patients but further larger studies are needed before generalization.

Download Full-text

Abstract W MP37: Novel Prognostic Scores for Early Prediction of Outcome Following Aneurysmal Subarachnoid Hemorrhage

Stroke ◽

10.1161/str.46.suppl_1.wmp37 ◽

2015 ◽

Vol 46 (suppl_1) ◽

Author(s):

Blessing Jaja ◽

Hester Lingsma ◽

Ewout Steyerberg ◽

R. Loch Macdonald ◽

Keyword(s):

Subarachnoid Hemorrhage ◽

Cross Validation ◽

Prediction Models ◽

Aneurysmal Subarachnoid Hemorrhage ◽

Model Performance ◽

Predictor Variable ◽

Predictive Ability ◽

Prognostic Scores ◽

Operating Characteristics ◽

Fisher Grade

Background: Aneurysmal subarachnoid hemorrhage (SAH) is a cerebrovascular emergency. Currently, clinicians have limited tools to estimate outcomes early after hospitalization. We aimed to develop novel prognostic scores using large cohorts of patients reflecting experience from different settings. Methods: Logistic regression analysis was used to develop prediction models for mortality and unfavorable outcomes according to 3-month Glasgow outcome score after SAH based on readily obtained parameters at hospital admission. The development cohort was derived from 10 prospective studies involving 10936 patients in the Subarachnoid Hemorrhage International Trialists (SAHIT) repository. Model performance was assessed by bootstrap internal validation and by cross validation by omission of each of the 10 studies, using R2 statistic, Area under the receiver operating characteristics curve (AUC), and calibration plots. Prognostic scores were developed from the regression coefficients. Results: Predictor variable with the strongest prognostic strength was neurologic status (partial R2 = 12.03%), followed by age (1.91%), treatment modality (1.25%), Fisher grade of CT clot burden (0.65%), history of hypertension (0.37%), aneurysm size (0.12%) and aneurysm location (0.06%). These predictors were combined to develop 3 sets of hierarchical scores based on the coefficients of the regression models. The AUC at bootstrap validation was 0.79-0.80, and at cross validation was 0.64-0.85. Calibration plots demonstrated satisfactory agreement between predicted and observed probabilities of the outcomes. Conclusions: The novel prognostic scores have good predictive ability and potential for broad application as they have been developed from prospective cohorts reflecting experience from different centers globally.

Download Full-text

Clinical and Laboratory Predictors of In-hospital Mortality in Patients With Coronavirus Disease-2019: A Cohort Study in Wuhan, China

Clinical Infectious Diseases ◽

10.1093/cid/ciaa538 ◽

2020 ◽

Vol 71 (16) ◽

pp. 2079-2088 ◽

Cited By ~ 52

Author(s):

Kun Wang ◽

Peiyuan Zuo ◽

Yuwei Liu ◽

Meng Zhang ◽

Xiaofang Zhao ◽

...

Keyword(s):

Hospital Mortality ◽

Prediction Models ◽

Area Under The Curve ◽

Mortality Prediction ◽

Gradient Boosting ◽

Laboratory Model ◽

Training Cohort ◽

Clinical Model ◽

Extreme Gradient Boosting ◽

Mortality Prediction Models

Abstract Background This study aimed to develop mortality-prediction models for patients with coronavirus disease-2019 (COVID-19). Methods The training cohort included consecutive COVID-19 patients at the First People’s Hospital of Jiangxia District in Wuhan, China, from 7 January 2020 to 11 February 2020. We selected baseline data through the stepwise Akaike information criterion and ensemble XGBoost (extreme gradient boosting) model to build mortality-prediction models. We then validated these models by randomly collected COVID-19 patients in Union Hospital, Wuhan, from 1 January 2020 to 20 February 2020. Results A total of 296 COVID-19 patients were enrolled in the training cohort; 19 died during hospitalization and 277 discharged from the hospital. The clinical model developed using age, history of hypertension, and coronary heart disease showed area under the curve (AUC), 0.88 (95% confidence interval [CI], .80–.95); threshold, −2.6551; sensitivity, 92.31%; specificity, 77.44%; and negative predictive value (NPV), 99.34%. The laboratory model developed using age, high-sensitivity C-reactive protein, peripheral capillary oxygen saturation, neutrophil and lymphocyte count, d-dimer, aspartate aminotransferase, and glomerular filtration rate had a significantly stronger discriminatory power than the clinical model (P = .0157), with AUC, 0.98 (95% CI, .92–.99); threshold, −2.998; sensitivity, 100.00%; specificity, 92.82%; and NPV, 100.00%. In the subsequent validation cohort (N = 44), the AUC (95% CI) was 0.83 (.68–.93) and 0.88 (.75–.96) for the clinical model and laboratory model, respectively. Conclusions We developed 2 predictive models for the in-hospital mortality of patients with COVID-19 in Wuhan that were validated in patients from another center.

Download Full-text

Acoustic-Based Prediction of End-Product-Based Fibre Determinates within Standing Jack Pine Trees

Forests ◽

10.3390/f10070605 ◽

2019 ◽

Vol 10 (7) ◽

pp. 605

Author(s):

Peter F. Newton

Keyword(s):

Wood Density ◽

Pinus Banksiana ◽

Goodness Of Fit ◽

Prediction Models ◽

Microfibril Angle ◽

Model Performance ◽

Predictive Ability ◽

Analytical Framework ◽

Jack Pine ◽

Cross Sectional

The objective of this study was to specify, parameterize, and evaluate an acoustic-based inferential framework for estimating commercially-relevant wood attributes within standing jack pine (Pinus banksiana Lamb) trees. The analytical framework consisted of a suite of models for predicting the dynamic modulus of elasticity (me), microfibril angle (ma), oven-dried wood density (wd), tracheid wall thickness (wt), radial and tangential tracheid diameters (dr and dt, respectively), fibre coarseness (co), and specific surface area (sa), from dilatational stress wave velocity (vd). Data acquisition consisted of (1) in-forest collection of acoustic velocity measurements on 61 sample trees situated within 10 variable-sized plots that were established in four mature jack pine stands situated in boreal Canada followed by the removal of breast-height cross-sectional disk samples, and (2) given (1), in-laboratory extraction of radial-based transverse xylem samples from the 61 disks and subsequent attribute determination via Silviscan-3. Statistically, attribute-specific acoustic prediction models were specified, parameterized, and, subsequently, evaluated on their goodness-of-fit, lack-of-fit, and predictive ability. The results indicated that significant (p ≤ 0.05) and unbiased relationships could be established for all attributes but dt. The models explained 71%, 66%, 61%, 42%, 30%, 19%, and 13% of the variation in me, wt, sa, co, wd, ma, and dr, respectively. Simulated model performance when deploying an acoustic-based wood density estimate indicated that the expected magnitude of the error arising from predicting dt, co, sa, wt, me, and ma prediction would be in the order of ±8%, ±12%, ±12%, ±13%, ±20%, and ±39% of their true values, respectively. Assessment of the utility of predicting the prerequisite wd estimate using micro-drill resistance measures revealed that the amplitude-based wd estimate was inconsequentially more precise than that obtained from vd (≈ <2%). A discourse regarding the potential utility and limitations of the acoustic-based computational suite for forecasting jack pine end-product potential was also articulated.

Download Full-text

Prediction of Maize Phenotypic Traits With Genomic and Environmental Predictors Using Gradient Boosting Frameworks

Frontiers in Plant Science ◽

10.3389/fpls.2021.699589 ◽

2021 ◽

Vol 12 ◽

Author(s):

Cathy C. Westhues ◽

Gregory S. Mahone ◽

Sofia da Silva ◽

Patrick Thorwarth ◽

Malthe Schmidt ◽

...

Keyword(s):

Machine Learning ◽

Grain Yield ◽

Prediction Models ◽

Predictive Ability ◽

The United States ◽

Environmental Data ◽

Gradient Boosting ◽

Phenotypic Traits ◽

Environmental Predictors ◽

Prediction Problems

The development of crop varieties with stable performance in future environmental conditions represents a critical challenge in the context of climate change. Environmental data collected at the field level, such as soil and climatic information, can be relevant to improve predictive ability in genomic prediction models by describing more precisely genotype-by-environment interactions, which represent a key component of the phenotypic response for complex crop agronomic traits. Modern predictive modeling approaches can efficiently handle various data types and are able to capture complex nonlinear relationships in large datasets. In particular, machine learning techniques have gained substantial interest in recent years. Here we examined the predictive ability of machine learning-based models for two phenotypic traits in maize using data collected by the Maize Genomes to Fields (G2F) Initiative. The data we analyzed consisted of multi-environment trials (METs) dispersed across the United States and Canada from 2014 to 2017. An assortment of soil- and weather-related variables was derived and used in prediction models alongside genotypic data. Linear random effects models were compared to a linear regularized regression method (elastic net) and to two nonlinear gradient boosting methods based on decision tree algorithms (XGBoost, LightGBM). These models were evaluated under four prediction problems: (1) tested and new genotypes in a new year; (2) only unobserved genotypes in a new year; (3) tested and new genotypes in a new site; (4) only unobserved genotypes in a new site. Accuracy in forecasting grain yield performance of new genotypes in a new year was improved by up to 20% over the baseline model by including environmental predictors with gradient boosting methods. For plant height, an enhancement of predictive ability could neither be observed by using machine learning-based methods nor by using detailed environmental information. An investigation of key environmental factors using gradient boosting frameworks also revealed that temperature at flowering stage, frequency and amount of water received during the vegetative and grain filling stage, and soil organic matter content appeared as important predictors for grain yield in our panel of environments.

Download Full-text

Electronic patient-reported outcomes (ePROs) and machine learning (ML) in predicting the presence and onset of immune-related adverse events (irAEs) of immune checkpoint inhibitor (ICI) therapies.

Journal of Clinical Oncology ◽

10.1200/jco.2020.38.15_suppl.e14058 ◽

2020 ◽

Vol 38 (15_suppl) ◽

pp. e14058-e14058

Author(s):

Sanna Iivanainen ◽

Jussi Ekström ◽

Vesa V Kataja ◽

Henri Virtanen ◽

Jussi Koivunen

Keyword(s):

Early Detection ◽

Cancer Patients ◽

Performance Metrics ◽

Prediction Models ◽

Model Performance ◽

Gradient Boosting ◽

Random Allocation ◽

Organ Systems ◽

Extreme Gradient Boosting ◽

Patient Reported

e14058 Background: ICIs have introduced novel irAEs, arising from various organ systems without strong timely dependency on initiation and discontinuation of the therapy. Early detection of the irAEs could result in improved safety profile of the treatment and better quality of life for patients. Symptom data collected by ePROs could be used as an input for ML based prediction models for early detection of irAEs. Methods: The utilized dataset consisted of two data sources. The first dataset consisted of 16 540 reported symptoms from 33 ICI-treated cancer patients, including 18 monitored symptoms collected using Kaiku Health digital platform. The second dataset included prospectively collected irAE data, including initiation and end dates, CTCAE class, and severity of 26 irAEs (the longest irAE lasted 799 days, and the shortest two days while median duration was 61 days). Two ML models were built using extreme gradient boosting, a well-known classification algorithm. Using the ePRO data, the first model was trained to detect the presence and the second model to detect the onset (0-21 days prior to diagnosis) of irAEs. The dataset was split into training (70 % of the data) and test sets (30 % of the data) by random allocation. The test set was left out from the model training and tuning, and was used only to evaluate the model performance. Results: The model trained to predict the presence of irAEs had an excellent performance with the test dataset. The prediction of the irAE onset was more difficult, but the model performance was still at a very good level. The performance metrics for the ML models are presented in Table. Conclusions: Current study suggests that ML based prediction models, using ePRO data as input for the models, can predict the presence and onset of irAEs with high accuracy. Thus, it indicates that digital symptom monitoring combined with ML could enable the detection of irAEs in ICI-treated cancer patients. The results should be validated with a larger dataset from prospective clinical trials. Clinical trial information: NCT03928938. [Table: see text]

Download Full-text

Noninvasive Real-time Mortality Prediction in Intensive Care Units Based on Gradient Boosting Method (Preprint)

10.2196/preprints.23888 ◽

2020 ◽

Author(s):

Huizhen Jiang ◽

Longxiang Su ◽

Hao Wang ◽

Dongkai Li ◽

Congpu Zhao ◽

...

Keyword(s):

Intensive Care ◽

Real Time ◽

Intensive Care Units ◽

Prediction Models ◽

Mortality Prediction ◽

Gradient Boosting ◽

Noninvasive Method ◽

Icu Patients ◽

The Real ◽

Boosting Method

BACKGROUND It is especially necessary to pay attention to the critically ill patients in ICU(Intensive Care Units) real time. Scoring systems are mostly used in the risk prediction of mortality, while usually they are not so precise and real-time with the clinical data simply weighted, and it is also time-consuming for clinical staff. OBJECTIVE We would like to fuse all the medical data together and predict the real-time mortality of ICU patients by machine learning method, which would be valuable and significant. Besides, we want to explore predicting the mortality by noninvasive data to lessen the pain of patients. METHODS In this paper, we established 5 models to predict mortality real-time based on different features. Based on monitoring data, examination data and scoring data, we structured the feature engineering. 5 Real-time Mortality prediction models were RMM(Monitoring features), RMA(APACHE and monitoring features), RMS(SOFA and monitoring features), RMME(Monitoring and Examination features) and RM(all features from monitoring, examination data and scoring data). Then, we compared the performance of all models and put more focus on the noninvasive method RMM. RESULTS After extensive experiments, the performance of RMME was superior to that of other 4 models. With the scoring features included, the model showed worse performance. And, RMM only based on monitoring features performed better than that of RMA and RMS. Therefore, it is meaningful and practicable to predict mortality by the noninvasive way, which could reduce the extra physical damage to patients like drawing blood. Moreover, we explored the top 9 features relevant with the real-time mortality prediction. Top 9 features were "ABP (mmHg) invasive mean pressure", "Heart rate", "ABP (mmHg) invasive systolic pressure", "Oxygen concentration", "SPO2", "Balance of inflow and outflow", "Total input", "ABP (mmHg) invasive diastolic pressure" and "NBP-average pressure", which could be paid more focus on during the general clinical work. CONCLUSIONS This research could be helpful in real-time mortality prediction of ICU patients, especially by the noninvasive method. It is meaningful and friendly to patients, which is of strong practical significance.

Download Full-text

Assessing the effect of data integration on predictive ability of cancer survival models

Health Informatics Journal ◽

10.1177/1460458218824692 ◽

2019 ◽

Vol 26 (1) ◽

pp. 8-20 ◽

Cited By ~ 3

Author(s):

Yi Guo ◽

Jiang Bian ◽

Francois Modave ◽

Qian Li ◽

Thomas J George ◽

...

Keyword(s):

Data Integration ◽

Cancer Survival ◽

Prediction Models ◽

Survival Rates ◽

Model Performance ◽

Predictive Ability ◽

The United States ◽

Cancer Prognosis ◽

Survival Models ◽

Survival Prediction

Cancer is the second leading cause of death in the United States. To improve cancer prognosis and survival rates, a better understanding of multi-level contributory factors associated with cancer survival is needed. However, prior research on cancer survival has primarily focused on factors from the individual level due to limited availability of integrated datasets. In this study, we sought to examine how data integration impacts the performance of cancer survival prediction models. We linked data from four different sources and evaluated the performance of Cox proportional hazard models for breast, lung, and colorectal cancers under three common data integration scenarios. We showed that adding additional contextual-level predictors to survival models through linking multiple datasets improved model fit and performance. We also showed that different representations of the same variable or concept have differential impacts on model performance. When building statistical models for cancer outcomes, it is important to consider cross-level predictor interactions.

Download Full-text