Development and Use of Natural Language Processing for Identification of Distant Cancer Recurrence and Sites of Distant Recurrence Using Unstructured Electronic Health Record Data

2021 ◽  
pp. 469-478
Author(s):  
Yasmin H. Karimi ◽  
Douglas W. Blayney ◽  
Allison W. Kurian ◽  
Jeanne Shen ◽  
Rikiya Yamashita ◽  
...  

PURPOSE Large-scale analysis of real-world evidence is often limited to structured data fields that do not contain reliable information on recurrence status and disease sites. In this report, we describe a natural language processing (NLP) framework that uses data from free-text, unstructured reports to classify recurrence status and sites of recurrence for patients with breast and hepatocellular carcinomas (HCC). METHODS Using two cohorts of breast cancer and HCC cases, we validated the ability of a previously developed NLP model to distinguish between no recurrence, local recurrence, and distant recurrence, based on clinician notes, radiology reports, and pathology reports compared with manual curation. A second NLP model was trained and validated to identify sites of recurrence. We compared the ability of each NLP model to identify the presence, timing, and site of recurrence, when compared against manual chart review and International Classification of Diseases coding. RESULTS A total of 1,273 patients were included in the development and validation of the two models. The NLP model for recurrence detects distant recurrence with an area under the curve of 0.98 (95% CI, 0.96 to 0.99) and 0.95 (95% CI, 0.88 to 0.98) in breast and HCC cohorts, respectively. The mean accuracy of the NLP model for detecting any site of distant recurrence was 0.9 for breast cancer and 0.83 for HCC. The NLP model for recurrence identified a larger proportion of patients with distant recurrence in a breast cancer database (11.1%) compared with International Classification of Diseases coding (2.31%). CONCLUSION We developed two NLP models to identify distant cancer recurrence, timing of recurrence, and sites of recurrence based on unstructured electronic health record data. These models can be used to perform large-scale retrospective studies in oncology.

2019 ◽  
pp. 1-9 ◽  
Author(s):  
Nikki M. Carroll ◽  
Debra P. Ritzwoller ◽  
Matthew P. Banegas ◽  
Maureen O’Keeffe-Rosetti ◽  
Angel M. Cronin ◽  
...  

PURPOSE We previously developed and validated informatic algorithms that used International Classification of Diseases 9th revision (ICD9)–based diagnostic and procedure codes to detect the presence and timing of cancer recurrence (the RECUR Algorithms). In 2015, ICD10 replaced ICD9 as the worldwide coding standard. To understand the impact of this transition, we evaluated the performance of the RECUR Algorithms after incorporating ICD10 codes. METHODS Using publicly available translation tables along with clinician and other expertise, we updated the algorithms to include ICD10 codes as additional input variables. We evaluated the performance of the algorithms using gold standard recurrence measures associated with a contemporary cohort of patients with stage I to III breast, colorectal, and lung (excluding IIIB) cancer and derived performance measures, including the area under the receiver operating curve, average absolute prediction error, and correct classification rate. These values were compared with the performance measures derived from the validation of the original algorithms. RESULTS A total of 659 colorectal, 280 lung, and 2,053 breast cancer cases were identified. Area under the receiver operating curve derived from the updated algorithms was 89.0% (95% CI, 82.3% to 95.7%), 88.9% (95% CI, 79.3% to 98.2%), and 80.5% (95% CI, 72.8% to 88.2%) for the colorectal, lung, and breast cancer algorithms, respectively. Average absolute prediction errors for recurrence timing were 2.7 (SE, 11.3%), 2.4 (SE, 10.4%), and 5.6 months (SE, 21.8%), respectively, and timing estimates were within 6 months of actual recurrence for more than 80% of colorectal, more than 90% of lung, and more than 50% of breast cancer cases using the updated algorithm. CONCLUSION Performance measures derived from the updated and original algorithms had overlapping confidence intervals, suggesting that the ICD9 to ICD10 transition did not affect the RECUR Algorithm performance.


2020 ◽  
Vol 38 (15_suppl) ◽  
pp. 2043-2043
Author(s):  
Yasmin Karimi ◽  
Douglas W. Blayney ◽  
Allison W. Kurian ◽  
Daniel Rubin ◽  
Imon Banerjee

2043 Background: Electronic health records (EHR) are used for retrospective cancer outcomes analysis. Sites and timing of recurrence are not captured in structured EHR data. Novel computerized methods are necessary to use unstructured longitudinal EHR data for large scale studies. Methods: We previously developed a neural network-based NLP algorithm to identify no recurrence vs. metastatic recurrence cases by analyzing physician notes, pathology and radiology reports in Stanford’s breast cancer database, Oncoshare (Cohort A). To validate this algorithm for local vs. distant recurrence, we identified a distinct Oncoshare cohort (Cohort B). Cases were manually curated for longitudinal development of local or distant recurrence and metastatic sites. A two-sided t-test was used to compare mean probabilities between local and distant recurrence cases. Next, we combined cases in Cohorts A and B to train and validate a novel NLP classifier that identifies metastatic site. The combined cohort was randomly divided into training and validation sets. Sensitivity and specificity were calculated for the NLP algorithm’s ability to detect metastatic sites compared to manual curation. Results: In Cohort B: 350 metastatic cases were identified. Mean probability for local and distant recurrence was 0.43 and 0.79, respectively and differed significantly for patients with local vs. distant recurrence (p<0.01). In Cohorts A and B: 632 metastatic cases were used for determination of sites. Sensitivity and specificity were highest for detection of peritoneal metastasis followed by liver, lung, skin, bone and central nervous system (table). Conclusions: This NLP algorithm is a scalable tool that uses unstructured EHR data to capture breast cancer recurrence, distinguishing local from distant recurrence and identifying metastatic site. This method may facilitate analysis of large datasets and correlation of outcomes with metastatic site. [Table: see text]


2020 ◽  
Author(s):  
Thomas Gaisl ◽  
Naser Musli ◽  
Patrick Baumgartner ◽  
Marc Meier ◽  
Silvana K Rampini ◽  
...  

BACKGROUND The health aspects, disease frequencies, and specific health interests of prisoners and refugees are poorly understood. Importantly, access to the health care system is limited for this vulnerable population. There has been no systematic investigation to understand the health issues of inmates in Switzerland. Furthermore, little is known on how recent migration flows in Europe may have affected the health conditions of inmates. OBJECTIVE The Swiss Prison Study (SWIPS) is a large-scale observational study with the aim of establishing a public health registry in northern-central Switzerland. The primary objective is to establish a central database to assess disease prevalence (ie, International Classification of Diseases-10 codes [German modification]) among prisoners. The secondary objectives include the following: (1) to compare the 2015 versus 2020 disease prevalence among inmates against a representative sample from the local resident population, (2) to assess longitudinal changes in disease prevalence from 2015 to 2020 by using cross-sectional medical records from all inmates at the Police Prison Zurich, Switzerland, and (3) to identify unrecognized health problems to prepare successful public health strategies. METHODS Demographic and health-related data such as age, sex, country of origin, duration of imprisonment, medication (including the drug name, brand, dosage, and release), and medical history (including the International Classification of Diseases-10 codes [German modification] for all diagnoses and external results that are part of the medical history in the prison) have been deposited in a central register over a span of 5 years (January 2015 to August 2020). The final cohort is expected to comprise approximately 50,000 to 60,000 prisoners from the Police Prison Zurich, Switzerland. RESULTS This study was approved on August 5, 2019 by the ethical committee of the Canton of Zurich with the registration code KEK-ZH No. 2019-01055 and funded in August 2020 by the “Walter and Gertrud Siegenthaler” foundation and the “Theodor and Ida Herzog-Egli” foundation. This study is registered with the International Standard Randomized Controlled Trial Number registry. Data collection started in August 2019 and results are expected to be published in 2021. Findings will be disseminated through scientific papers as well as presentations and public events. CONCLUSIONS This study will construct a valuable database of information regarding the health of inmates and refugees in Swiss prisons and will act as groundwork for future interventions in this vulnerable population. CLINICALTRIAL ISRCTN registry ISRCTN11714665; http://www.isrctn.com/ISRCTN11714665 INTERNATIONAL REGISTERED REPORT DERR1-10.2196/23973


Author(s):  
Hua Wang ◽  
Ke Chai ◽  
Minghui Du ◽  
Shengfeng Wang ◽  
Jian-Ping Cai ◽  
...  

Background: Large-scale and population-based studies of heart failure (HF) incidence and prevalence are scarce in China. The study sought to estimate the prevalence, incidence, and cost of HF in China. Methods: We conducted a population-based study using records of 50.0 million individuals ≥25 years old from the national urban employee basic medical insurance from 6 provinces in China in 2017. Incident cases were individuals with a diagnosis of HF (International Classification of Diseases code, and text of diagnosis) in 2017 with a 4-year disease-free period (2013–2016). We calculated standardized rates by applying age standardization to the 2010 Chinese census population. Results: The age-standardized prevalence and incidence were 1.10% (1.10% among men and women) and 275 per 100 000 person-years (287 among men and 261 among women), respectively, accounting for 12.1 million patients with HF and 3.0 million patients with incident HF ≥25 years old. Both prevalence and incidence increased with increasing age (0.57%, 3.86%, and 7.55% for prevalence and 158, 892, and 1655 per 100 000 person-years for incidence among persons who were 25–64, 65–79, and ≥80 years of age, respectively). The inpatient mean cost per-capita was $4406.8 and the proportion with ≥3 hospitalizations among those hospitalized was 40.5%. The outpatient mean cost per-capita was $892.3. Conclusions: HF has placed a considerable burden on health systems in China, and strategies aimed at the prevention and treatment of HF are needed. Registration: URL: https://www.clinicaltrials.gov ; Unique identifier: ChiCTR2000029094.


Circulation ◽  
2020 ◽  
Vol 142 (Suppl_3) ◽  
Author(s):  
Shashank Shekhar ◽  
Anas M Saad ◽  
Toshiaki Isogai ◽  
Mohamed M Gad ◽  
Keerat Ahuja ◽  
...  

Introduction: Even though atrial fibrillation (AF) is present in >30% of patients with aortic stenosis (AS), it is not typically included in the decision-making algorithm for the timing or need for aortic valve replacement (AVR), either by transcatheter (TAVR) or surgical (SAVR) approaches. Large scale data on how AF affects outcomes of AS patients remain scarce. Methods: From the Nationwide Readmissions Database (NRD), we retrospectively identified AS patients aged ≥18years, with and without AF admitted between January and June in 2016 and 2017 (to allow for a six month follow up), using the International Classification of Diseases-10 th revision codes. Multivariable logistic regression was performed to examine the predictors of in-hospital mortality during index hospitalization. In-hospital complications and 6 month in-hospital mortality during any readmission after being discharged alive were compared in patients with and without AF, for patients undergoing TAVR, SAVR or no-AVR. Results: We identified 403,089 AS patients, of which 41% had AF. Patients with AF were older (median age in years: 83 vs. 79) and were more frequently females (52% vs. 48%; p<0.001). Table summarizes outcomes of AS patients with and without AF. TAVR in patients with AF was associated with higher in-hospital mortality and follow-up mortality as compared to patients without AF. Although AF did not influence in-hospital mortality in SAVR population, follow-up mortality was also significantly higher after SAVR in patients with AF compared to patients without AF. For patients not undergoing AVR, in-hospital and follow-up mortality were higher in AF population compared to no AF and was higher than patients undergoing AVR (Table). Conclusions: AF is associated with worse outcomes in patients with AS irrespective of treatment (TAVR, SAVR or no-AVR). More studies are needed to understand the implications of AF in AS population and whether earlier treatment of AS in patients with AF can improve outcomes.


Rheumatology ◽  
2020 ◽  
Vol 59 (12) ◽  
pp. 3759-3766 ◽  
Author(s):  
Sicong Huang ◽  
Jie Huang ◽  
Tianrun Cai ◽  
Kumar P Dahal ◽  
Andrew Cagan ◽  
...  

Abstract Objective The objective of this study was to compare the performance of an RA algorithm developed and trained in 2010 utilizing natural language processing and machine learning, using updated data containing ICD10, new RA treatments, and a new electronic medical records (EMR) system. Methods We extracted data from subjects with ≥1 RA International Classification of Diseases (ICD) codes from the EMR of two large academic centres to create a data mart. Gold standard RA cases were identified from reviewing a random 200 subjects from the data mart, and a random 100 subjects who only have RA ICD10 codes. We compared the performance of the following algorithms using the original 2010 data with updated data: (i) a published 2010 RA algorithm; (ii) updated algorithm, incorporating ICD10 RA codes and new DMARDs; and (iii) published algorithm using ICD codes only, ICD RA code ≥3. Results The gold standard RA cases had mean age 65.5 years, 78.7% female, 74.1% RF or antibodies to cyclic citrullinated peptide (anti-CCP) positive. The positive predictive value (PPV) for ≥3 RA ICD was 54%, compared with 56% in 2010. At a specificity of 95%, the PPV of the 2010 algorithm and the updated version were both 91%, compared with 94% (95% CI: 91, 96%) in 2010. In subjects with ICD10 data only, the PPV for the updated 2010 RA algorithm was 93%. Conclusion The 2010 RA algorithm validated with the updated data with similar performance characteristics as the 2010 data. While the 2010 algorithm continued to perform better than the rule-based approach, the PPV of the latter also remained stable over time.


10.2196/23973 ◽  
2020 ◽  
Vol 9 (12) ◽  
pp. e23973
Author(s):  
Thomas Gaisl ◽  
Naser Musli ◽  
Patrick Baumgartner ◽  
Marc Meier ◽  
Silvana K Rampini ◽  
...  

Background The health aspects, disease frequencies, and specific health interests of prisoners and refugees are poorly understood. Importantly, access to the health care system is limited for this vulnerable population. There has been no systematic investigation to understand the health issues of inmates in Switzerland. Furthermore, little is known on how recent migration flows in Europe may have affected the health conditions of inmates. Objective The Swiss Prison Study (SWIPS) is a large-scale observational study with the aim of establishing a public health registry in northern-central Switzerland. The primary objective is to establish a central database to assess disease prevalence (ie, International Classification of Diseases-10 codes [German modification]) among prisoners. The secondary objectives include the following: (1) to compare the 2015 versus 2020 disease prevalence among inmates against a representative sample from the local resident population, (2) to assess longitudinal changes in disease prevalence from 2015 to 2020 by using cross-sectional medical records from all inmates at the Police Prison Zurich, Switzerland, and (3) to identify unrecognized health problems to prepare successful public health strategies. Methods Demographic and health-related data such as age, sex, country of origin, duration of imprisonment, medication (including the drug name, brand, dosage, and release), and medical history (including the International Classification of Diseases-10 codes [German modification] for all diagnoses and external results that are part of the medical history in the prison) have been deposited in a central register over a span of 5 years (January 2015 to August 2020). The final cohort is expected to comprise approximately 50,000 to 60,000 prisoners from the Police Prison Zurich, Switzerland. Results This study was approved on August 5, 2019 by the ethical committee of the Canton of Zurich with the registration code KEK-ZH No. 2019-01055 and funded in August 2020 by the “Walter and Gertrud Siegenthaler” foundation and the “Theodor and Ida Herzog-Egli” foundation. This study is registered with the International Standard Randomized Controlled Trial Number registry. Data collection started in August 2019 and results are expected to be published in 2021. Findings will be disseminated through scientific papers as well as presentations and public events. Conclusions This study will construct a valuable database of information regarding the health of inmates and refugees in Swiss prisons and will act as groundwork for future interventions in this vulnerable population. Trial Registration ISRCTN registry ISRCTN11714665; http://www.isrctn.com/ISRCTN11714665 International Registered Report Identifier (IRRID) DERR1-10.2196/23973


Sign in / Sign up

Export Citation Format

Share Document