scholarly journals Estimating real-world performance of a predictive model: a case-study in predicting mortality

JAMIA Open ◽  
2020 ◽  
Vol 3 (2) ◽  
pp. 243-251
Author(s):  
Vincent J Major ◽  
Neil Jethani ◽  
Yindalon Aphinyanaphongs

Abstract Objective One primary consideration when developing predictive models is downstream effects on future model performance. We conduct experiments to quantify the effects of experimental design choices, namely cohort selection and internal validation methods, on (estimated) real-world model performance. Materials and Methods Four years of hospitalizations are used to develop a 1-year mortality prediction model (composite of death or initiation of hospice care). Two common methods to select appropriate patient visits from their encounter history (backwards-from-outcome and forwards-from-admission) are combined with 2 testing cohorts (random and temporal validation). Two models are trained under otherwise identical conditions, and their performances compared. Operating thresholds are selected in each test set and applied to a “real-world” cohort of labeled admissions from another, unused year. Results Backwards-from-outcome cohort selection retains 25% of candidate admissions (n = 23 579), whereas forwards-from-admission selection includes many more (n = 92 148). Both selection methods produce similar performances when applied to a random test set. However, when applied to the temporally defined “real-world” set, forwards-from-admission yields higher areas under the ROC and precision recall curves (88.3% and 56.5% vs. 83.2% and 41.6%). Discussion A backwards-from-outcome experiment manipulates raw training data, simplifying the experiment. This manipulated data no longer resembles real-world data, resulting in optimistic estimates of test set performance, especially at high precision. In contrast, a forwards-from-admission experiment with a temporally separated test set consistently and conservatively estimates real-world performance. Conclusion Experimental design choices impose bias upon selected cohorts. A forwards-from-admission experiment, validated temporally, can conservatively estimate real-world performance. LAY SUMMARY The routine care of patients stands to benefit greatly from assistive technologies, including data-driven risk assessment. Already, many different machine learning and artificial intelligence applications are being developed from complex electronic health record data. To overcome challenges that arise from such data, researchers often start with simple experimental approaches to test their work. One key component is how patients (and their healthcare visits) are selected for the study from the pool of all patients seen. Another is how the group of patients used to create the risk estimator differs from the group used to evaluate how well it works. These choices complicate how the experimental setting compares to the real-world application to patients. For example, different selection approaches that depend on each patient’s future outcome can simplify the experiment but are impractical upon implementation as these data are unavailable. We show that this kind of “backwards” experiment optimistically estimates how well the model performs. Instead, our results advocate for experiments that select patients in a “forwards” manner and “temporal” validation that approximates training on past data and implementing on future data. More robust results help gauge the clinical utility of recent works and aid decision-making before implementation into practice.

2019 ◽  
Author(s):  
Vincent J Major ◽  
Neil Jethani ◽  
Yindalon Aphinyanaphongs

AbstractObjectiveThe main criteria for choosing how models are built is the subsequent effect on future (estimated) model performance. In this work, we evaluate the effects of experimental design choices on both estimated and actual model performance.Materials and MethodsFour years of hospital admissions are used to develop a 1 year end-of-life prediction model. Two common methods to select appropriate prediction timepoints (backwards-from-outcome and forwards-from-admission) are introduced and combined with two ways of separating cohorts for training and testing (internal and temporal). Two models are trained in identical conditions, and their performances are compared. Finally, operating thresholds are selected in each test set and applied in a final, ‘real-world’ cohort consisting of one year of admissions.ResultsBackwards-from-outcome cohort selection discards 75% of candidate admissions (n=23,579), whereas forwards-from-admission selection includes many more (n=92,148). Both selection methods produce similar global performances when applied to an internal test set. However, when applied to the temporally defined ‘real-world’ set, forwards-from-admission yields higher areas under the ROC and precision recall curves (88.3 and 56.5% vs. 83.2 and 41.6%).DiscussionA backwards-from-outcome experiment effectively transforms the training data such that it no longer resembles real-world data. This results in optimistic estimates of test set performance, especially at high precision. In contrast, a forwards-from-admission experiment with a temporally separated test set consistently and conservatively estimates real-world performance.ConclusionExperimental design choices impose bias upon selected cohorts. A forwards-from-admission experiment, validated temporally, can conservatively estimate real-world performance.


2021 ◽  
Vol 14 (1) ◽  
Author(s):  
Ashwath Radhachandran ◽  
Anurag Garikipati ◽  
Nicole S. Zelin ◽  
Emily Pellegrini ◽  
Sina Ghandian ◽  
...  

Abstract Background Acute heart failure (AHF) is associated with significant morbidity and mortality. Effective patient risk stratification is essential to guiding hospitalization decisions and the clinical management of AHF. Clinical decision support systems can be used to improve predictions of mortality made in emergency care settings for the purpose of AHF risk stratification. In this study, several models for the prediction of seven-day mortality among AHF patients were developed by applying machine learning techniques to retrospective patient data from 236,275 total emergency department (ED) encounters, 1881 of which were considered positive for AHF and were used for model training and testing. The models used varying subsets of age, sex, vital signs, and laboratory values. Model performance was compared to the Emergency Heart Failure Mortality Risk Grade (EHMRG) model, a commonly used system for prediction of seven-day mortality in the ED with similar (or, in some cases, more extensive) inputs. Model performance was assessed in terms of area under the receiver operating characteristic curve (AUROC), sensitivity, and specificity. Results When trained and tested on a large academic dataset, the best-performing model and EHMRG demonstrated test set AUROCs of 0.84 and 0.78, respectively, for prediction of seven-day mortality. Given only measurements of respiratory rate, temperature, mean arterial pressure, and FiO2, one model produced a test set AUROC of 0.83. Neither a logistic regression comparator nor a simple decision tree outperformed EHMRG. Conclusions A model using only the measurements of four clinical variables outperforms EHMRG in the prediction of seven-day mortality in AHF. With these inputs, the model could not be replaced by logistic regression or reduced to a simple decision tree without significant performance loss. In ED settings, this minimal-input risk stratification tool may assist clinicians in making critical decisions about patient disposition by providing early and accurate insights into individual patient’s risk profiles.


2020 ◽  
Author(s):  
Jenna M Reps ◽  
Peter Rijnbeek ◽  
Alana Cuthbert ◽  
Patrick B Ryan ◽  
Nicole Pratt ◽  
...  

Abstract Background: Researchers developing prediction models are faced with numerous design choices that may impact model performance. One key decision is how to include patients who are lost to follow-up. In this paper we perform a large-scale empirical evaluation investigating the impact of this decision. In addition, we aim to provide guidelines for how to deal with loss to follow-up.Methods: We generate a partially synthetic dataset with complete follow-up and simulate loss to follow-up based either on random selection or on selection based on comorbidity. In addition to our synthetic data study we investigate 21 real-world data prediction problems. We compare four simple strategies for developing models when using a cohort design that encounters loss to follow-up. Three strategies employ a binary classifier with data that: i) include all patients (including those lost to follow-up), ii) exclude all patients lost to follow-up or iii) only exclude patients lost to follow-up who do not have the outcome before being lost to follow-up. The fourth strategy uses a survival model with data that include all patients. We empirically evaluate the discrimination and calibration performance.Results: The partially synthetic data study results show that excluding patients who are lost to follow-up can introduce bias when loss to follow-up is common and does not occur at random. However, when loss to follow-up was completely at random, the choice of addressing it had negligible impact on the model performance. Our empirical real-world data results showed that the four design choices investigated to deal with loss to follow-up resulted in comparable performance when the time-at-risk was 1-year, but demonstrated differential bias when we looked into 3-year time-at-risk. Removing patients who are lost to follow-up before experiencing the outcome but keeping patients who are lost to follow-up after the outcome can bias a model and should be avoided.Conclusion: Based on this study we therefore recommend i) developing models using data that includes patients that are lost to follow-up and ii) evaluate the discrimination and calibration of models twice: on a test set including patients lost to follow-up and a test set excluding patients lost to follow-up.


2021 ◽  
Vol 55 (1) ◽  
pp. 1-3
Author(s):  
Jaehun Kim

Machine learning (ML) has become a core technology for many real-world applications. Modern ML models are applied to unprecedentedly complex and difficult challenges, including very large and subjective problems. For instance, applications towards multimedia understanding have been advanced substantially. Here, it is already prevalent that cultural/artistic objects such as music and videos are analyzed and served to users according to their preference, enabled through ML techniques. One of the most recent breakthroughs in ML is Deep Learning (DL), which has been immensely adopted to tackle such complex problems. DL allows for higher learning capacity, making end-to-end learning possible, which reduces the need for substantial engineering effort, while achieving high effectiveness. At the same time, this also makes DL models more complex than conventional ML models. Reports in several domains indicate that such more complex ML models may have potentially critical hidden problems: various biases embedded in the training data can emerge in the prediction, extremely sensitive models can make unaccountable mistakes. Furthermore, the black-box nature of the DL models hinders the interpretation of the mechanisms behind them. Such unexpected drawbacks result in a significant impact on the trustworthiness of the systems in which the ML models are equipped as the core apparatus. In this thesis, a series of studies investigates aspects of trustworthiness for complex ML applications, namely the reliability and explainability. Specifically, we focus on music as the primary domain of interest, considering its complexity and subjectivity. Due to this nature of music, ML models for music are necessarily complex for achieving meaningful effectiveness. As such, the reliability and explainability of music ML models are crucial in the field. The first main chapter of the thesis investigates the transferability of the neural network in the Music Information Retrieval (MIR) context. Transfer learning, where the pre-trained ML models are used as off-the-shelf modules for the task at hand, has become one of the major ML practices. It is helpful since a substantial amount of the information is already encoded in the pre-trained models, which allows the model to achieve high effectiveness even when the amount of the dataset for the current task is scarce. However, this may not always be true if the "source" task which pre-trained the model shares little commonality with the "target" task at hand. An experiment including multiple "source" tasks and "target" tasks was conducted to examine the conditions which have a positive effect on the transferability. The result of the experiment suggests that the number of source tasks is a major factor of transferability. Simultaneously, it is less evident that there is a single source task that is universally effective on multiple target tasks. Overall, we conclude that considering multiple pre-trained models or pre-training a model employing heterogeneous source tasks can increase the chance for successful transfer learning. The second major work investigates the robustness of the DL models in the transfer learning context. The hypothesis is that the DL models can be susceptible to imperceptible noise on the input. This may drastically shift the analysis of similarity among inputs, which is undesirable for tasks such as information retrieval. Several DL models pre-trained in MIR tasks are examined for a set of plausible perturbations in a real-world setup. Based on a proposed sensitivity measure, the experimental results indicate that all the DL models were substantially vulnerable to perturbations, compared to a traditional feature encoder. They also suggest that the experimental framework can be used to test the pre-trained DL models for measuring robustness. In the final main chapter, the explainability of black-box ML models is discussed. In particular, the chapter focuses on the evaluation of the explanation derived from model-agnostic explanation methods. With black-box ML models having become common practice, model-agnostic explanation methods have been developed to explain a prediction. However, the evaluation of such explanations is still an open problem. The work introduces an evaluation framework that measures the quality of the explanations employing fidelity and complexity. Fidelity refers to the explained mechanism's coherence to the black-box model, while complexity is the length of the explanation. Throughout the thesis, we gave special attention to the experimental design, such that robust conclusions can be reached. Furthermore, we focused on delivering machine learning framework and evaluation frameworks. This is crucial, as we intend that the experimental design and results will be reusable in general ML practice. As it implies, we also aim our findings to be applicable beyond the music applications such as computer vision or natural language processing. Trustworthiness in ML is not a domain-specific problem. Thus, it is vital for both researchers and practitioners from diverse problem spaces to increase awareness of complex ML systems' trustworthiness. We believe the research reported in this thesis provides meaningful stepping stones towards the trustworthiness of ML.


2021 ◽  
Vol 186 (Supplement_1) ◽  
pp. 445-451
Author(s):  
Yifei Sun ◽  
Navid Rashedi ◽  
Vikrant Vaze ◽  
Parikshit Shah ◽  
Ryan Halter ◽  
...  

ABSTRACT Introduction Early prediction of the acute hypotensive episode (AHE) in critically ill patients has the potential to improve outcomes. In this study, we apply different machine learning algorithms to the MIMIC III Physionet dataset, containing more than 60,000 real-world intensive care unit records, to test commonly used machine learning technologies and compare their performances. Materials and Methods Five classification methods including K-nearest neighbor, logistic regression, support vector machine, random forest, and a deep learning method called long short-term memory are applied to predict an AHE 30 minutes in advance. An analysis comparing model performance when including versus excluding invasive features was conducted. To further study the pattern of the underlying mean arterial pressure (MAP), we apply a regression method to predict the continuous MAP values using linear regression over the next 60 minutes. Results Support vector machine yields the best performance in terms of recall (84%). Including the invasive features in the classification improves the performance significantly with both recall and precision increasing by more than 20 percentage points. We were able to predict the MAP with a root mean square error (a frequently used measure of the differences between the predicted values and the observed values) of 10 mmHg 60 minutes in the future. After converting continuous MAP predictions into AHE binary predictions, we achieve a 91% recall and 68% precision. In addition to predicting AHE, the MAP predictions provide clinically useful information regarding the timing and severity of the AHE occurrence. Conclusion We were able to predict AHE with precision and recall above 80% 30 minutes in advance with the large real-world dataset. The prediction of regression model can provide a more fine-grained, interpretable signal to practitioners. Model performance is improved by the inclusion of invasive features in predicting AHE, when compared to predicting the AHE based on only the available, restricted set of noninvasive technologies. This demonstrates the importance of exploring more noninvasive technologies for AHE prediction.


Sensors ◽  
2021 ◽  
Vol 21 (5) ◽  
pp. 1688
Author(s):  
Luqman Ali ◽  
Fady Alnajjar ◽  
Hamad Al Jassmi ◽  
Munkhjargal Gochoo ◽  
Wasif Khan ◽  
...  

This paper proposes a customized convolutional neural network for crack detection in concrete structures. The proposed method is compared to four existing deep learning methods based on training data size, data heterogeneity, network complexity, and the number of epochs. The performance of the proposed convolutional neural network (CNN) model is evaluated and compared to pretrained networks, i.e., the VGG-16, VGG-19, ResNet-50, and Inception V3 models, on eight datasets of different sizes, created from two public datasets. For each model, the evaluation considered computational time, crack localization results, and classification measures, e.g., accuracy, precision, recall, and F1-score. Experimental results demonstrated that training data size and heterogeneity among data samples significantly affect model performance. All models demonstrated promising performance on a limited number of diverse training data; however, increasing the training data size and reducing diversity reduced generalization performance, and led to overfitting. The proposed customized CNN and VGG-16 models outperformed the other methods in terms of classification, localization, and computational time on a small amount of data, and the results indicate that these two models demonstrate superior crack detection and localization for concrete structures.


2021 ◽  
pp. 096452842098757
Author(s):  
Javier Mata ◽  
Pilar Sanchís ◽  
Pedro Valentí ◽  
Beatriz Hernández ◽  
Jose Luis Aguilar

Objective: Existing systematic reviews and meta-analyses indicate that acupuncture has similar clinical effectiveness in the prevention of headache disorders (HDs) as drug therapy, but with fewer side effects. As such, examining acupuncture’s use in a pragmatic, real-world setting would be valuable. The purpose of this study was to compare the effects of acupuncture and prophylactic drug treatment (PDT) on headache frequency in patients with HDs, under real-world clinical conditions. Methods: Retrospective cohort study of patients with HDs referred to a pain clinic, using electronic health record data. Patients continued with tertiary care (treatment of acute headache attacks and lifestyle, meditation, exercise and dietary instructions) with PDT, or received 12 sessions of acupuncture over 3 months, instead of PDT under conditions of tertiary care. The primary outcome data were the number of days with headache per month, and groups were compared at baseline and at the end of the third month of treatment. Results: Data were analysed for 482 patients with HDs. The number of headache days per month decreased by 3.7 (standard deviation (SD) = 2.9) days in the acupuncture group versus 2.9 (SD = 2.3) in the PDT group (p = 0.007). The proportion of responders was 39.5% versus 16.3% (p < 0.001). The number needed to treat was 4 (95% confidence interval = 3–7). Conclusion: Our study has shown that patients with HDs in tertiary care who opted for treatment with acupuncture appeared to receive similar clinical benefits to those that chose PDT, suggesting these treatments may be similarly effective of the prevention of headache in a real-world clinical setting.


2021 ◽  
pp. 193229682110497
Author(s):  
Daniel J. DeSalvo ◽  
Nudrat Noor ◽  
Cicilyn Xie ◽  
Sarah D. Corathers ◽  
Shideh Majidi ◽  
...  

Background: The benefits of Continuous Glucose Monitoring (CGM) on glycemic management have been demonstrated in numerous studies; however, widespread uptake remians limited. The aim of this study was to provide real-world evidence of patient attributes and clinical outcomes associated with CGM use across clinics in the U.S. based T1D Exchange Quality Improvement (T1DX-QI) Collaborative. Method: We examined electronic Health Record data from eight endocrinology clinics participating in the T1DX-QI Collaborative during the years 2017-2019. Results: Among 11,469 type 1 diabetes patients, 48% were CGM users. CGM use varied by race/ethnicity with Non-Hispanic Whites having higher rates of CGM use (50%) compared to Non-Hispanic Blacks (18%) or Hispanics (38%). Patients with private insurance were more likely to use CGM (57.2%) than those with public insurance (33.3%) including Medicaid or Medicare. CGM users had lower median HbA1c (7.7%) compared to nonusers (8.4%). Rates of diabetic ketoacidosis (DKA) and severe hypoglycemia were significantly higher in nonusers compared to CGM users. Conclusion: In this real-world study of patients in the T1DX-QI Collaborative, CGM users had better glycemic control and lower rates of DKA and severe hypoglycemia (SH) events, compared to nonusers; however, there were significant sociodemographic disparities in CGM use. Quality improvement and advocacy measures to promote widespread and equitable CGM uptake have the potential to improve clinical outcomes.


2021 ◽  
Vol 39 (28_suppl) ◽  
pp. 57-57
Author(s):  
Robert M. Rifkin ◽  
Lisa Herms ◽  
Chuck Wentworth ◽  
Anupama Vasudevan ◽  
Kimberley Campbell ◽  
...  

57 Background: Biosimilars have potential to reduce healthcare costs and increase access in the United States, but lack of uptake has contributed to lost savings. Filgrastim-sndz was the first FDA-approved biosimilar, and much can be learned by evaluating its uptake. In February 2016, the US Oncology Network converted to filgrastim-sndz as its short-acting granulocyte colony-stimulating factor (GCSF) of choice for prevention of febrile neutropenia (FN) following myelosuppressive chemotherapy (MCT). To understand utilization and cost patterns, this study analyzes real-world data of GCSFs within a community oncology network during the initial period of conversion to the first biosimilar available in the US. Methods: This descriptive retrospective observational study used electronic health record data for female breast cancer (BC) patients receiving GCSF and MCT at high risk of FN. Patient cohorts were defined by first receipt of either filgrastim or filgrastim-sndz during the 410 days before and after biosimilar conversion. Healthcare resource utilization (HCRU) and costs for GCSF and complete blood counts (CBC) were collected at GCSF initiation through the earliest of 30 days following end of MCT, loss to follow up, death, or data cutoff. Results: 146 patients were identified: 81 (55.5%) filgrastim and 65 (44.5%) filgrastim-sndz. No directional differences existed in baseline characteristics between the cohorts. Higher proportions of filgrastim-sndz patients received dose-dense MCT (33.8% vs 22.2%). Time trends show an initial spike in HCRU and cost for filgrastim-sndz patients after formulary conversion, which subsequently decreased and converged to that of the filgrastim cohort after 12 months. When aggregated, the overall median total administration counts, per patient per month (PPPM) and dosage, were marginally higher for filgrastim-sndz (5 vs 3; 2.9 vs 1.4; 1920 vs 1440 mcg, respectively). Median PPPM costs were higher for filgrastim-sndz ($803 vs $545). Median CBC utilization and costs were higher for filgrastim-sndz (2.8 vs 2.5; $28 vs $23, respectively). Conclusions: This study provides insight into real-world HCRU and cost patterns after formulary conversion to a biosimilar for BC patients receiving MCT and GCSF. As a descriptive study, causal inferences cannot be made and an underlying effect from index chemotherapy cannot be excluded. Convergence of HCRU and costs after 12 months suggests that overall results may be driven by behavior at initial formulary switch. Since filgrastim-sndz was the first US biosimilar approved, the uptake may be indicative of an experience with biosimilar acceptance in general. Future real-world studies of biosimilars must consider inconsistent utilization and practice trends during the time frame directly following formulary conversion.


Sign in / Sign up

Export Citation Format

Share Document