scholarly journals High-throughput Multimodal Automated Phenotyping (MAP) with Application to PheWAS

2019 ◽  
Author(s):  
Katherine P. Liao ◽  
Jiehuan Sun ◽  
Tianrun A. Cai ◽  
Nicholas Link ◽  
Chuan Hong ◽  
...  

AbstractObjectiveElectronic health records (EHR) linked with biorepositories are a powerful platform for translational studies. A major bottleneck exists in the ability to phenotype patients accurately and efficiently. The objective of this study was to develop an automated high-throughput phenotyping method integrating International Classification of Diseases (ICD) codes and narrative data extracted using natural language processing (NLP).MethodWe developed a mapping method for automatically identifying relevant ICD and NLP concepts for a specific phenotype leveraging the UMLS. Aggregated ICD and NLP counts along with healthcare utilization were jointly analyzed by fitting an ensemble of latent mixture models. The MAP algorithm yields a predicted probability of phenotype for each patient and a threshold for classifying subjects with phenotype yes/no. The algorithm was validated using labeled data for 16 phenotypes from a biorepository and further tested in an independent cohort PheWAS for two SNPs with known associations.ResultsThe MAP algorithm achieved higher or similar AUC and F-scores compared to the ICD code across all 16 phenotypes. The features assembled via the automated approach had comparable accuracy to those assembled via manual curation (AUCMAP 0.943, AUCmanual 0.941). The PheWAS results suggest that the MAP approach detected previously validated associations with higher power when compared to the standard PheWAS method based on ICD codes.ConclusionThe MAP approach increased the accuracy of phenotype definition while maintaining scalability, facilitating use in studies requiring large scale phenotyping, such as PheWAS.

2019 ◽  
Vol 26 (11) ◽  
pp. 1255-1262 ◽  
Author(s):  
Katherine P Liao ◽  
Jiehuan Sun ◽  
Tianrun A Cai ◽  
Nicholas Link ◽  
Chuan Hong ◽  
...  

Abstract Objective Electronic health records linked with biorepositories are a powerful platform for translational studies. A major bottleneck exists in the ability to phenotype patients accurately and efficiently. The objective of this study was to develop an automated high-throughput phenotyping method integrating International Classification of Diseases (ICD) codes and narrative data extracted using natural language processing (NLP). Materials and Methods We developed a mapping method for automatically identifying relevant ICD and NLP concepts for a specific phenotype leveraging the Unified Medical Language System. Along with health care utilization, aggregated ICD and NLP counts were jointly analyzed by fitting an ensemble of latent mixture models. The multimodal automated phenotyping (MAP) algorithm yields a predicted probability of phenotype for each patient and a threshold for classifying participants with phenotype yes/no. The algorithm was validated using labeled data for 16 phenotypes from a biorepository and further tested in an independent cohort phenome-wide association studies (PheWAS) for 2 single nucleotide polymorphisms with known associations. Results The MAP algorithm achieved higher or similar AUC and F-scores compared to the ICD code across all 16 phenotypes. The features assembled via the automated approach had comparable accuracy to those assembled via manual curation (AUCMAP 0.943, AUCmanual 0.941). The PheWAS results suggest that the MAP approach detected previously validated associations with higher power when compared to the standard PheWAS method based on ICD codes. Conclusion The MAP approach increased the accuracy of phenotype definition while maintaining scalability, thereby facilitating use in studies requiring large-scale phenotyping, such as PheWAS.


Rheumatology ◽  
2020 ◽  
Vol 59 (12) ◽  
pp. 3759-3766 ◽  
Author(s):  
Sicong Huang ◽  
Jie Huang ◽  
Tianrun Cai ◽  
Kumar P Dahal ◽  
Andrew Cagan ◽  
...  

Abstract Objective The objective of this study was to compare the performance of an RA algorithm developed and trained in 2010 utilizing natural language processing and machine learning, using updated data containing ICD10, new RA treatments, and a new electronic medical records (EMR) system. Methods We extracted data from subjects with ≥1 RA International Classification of Diseases (ICD) codes from the EMR of two large academic centres to create a data mart. Gold standard RA cases were identified from reviewing a random 200 subjects from the data mart, and a random 100 subjects who only have RA ICD10 codes. We compared the performance of the following algorithms using the original 2010 data with updated data: (i) a published 2010 RA algorithm; (ii) updated algorithm, incorporating ICD10 RA codes and new DMARDs; and (iii) published algorithm using ICD codes only, ICD RA code ≥3. Results The gold standard RA cases had mean age 65.5 years, 78.7% female, 74.1% RF or antibodies to cyclic citrullinated peptide (anti-CCP) positive. The positive predictive value (PPV) for ≥3 RA ICD was 54%, compared with 56% in 2010. At a specificity of 95%, the PPV of the 2010 algorithm and the updated version were both 91%, compared with 94% (95% CI: 91, 96%) in 2010. In subjects with ICD10 data only, the PPV for the updated 2010 RA algorithm was 93%. Conclusion The 2010 RA algorithm validated with the updated data with similar performance characteristics as the 2010 data. While the 2010 algorithm continued to perform better than the rule-based approach, the PPV of the latter also remained stable over time.


2020 ◽  
Author(s):  
Andrew L Blumenfeld ◽  
Claudia Gonzaga-Jauregui ◽  
Deepika Sharma ◽  
Ashish Yadav ◽  
Shareef Khalid ◽  
...  

AbstractObjectiveLarge scale next-generation sequencing of population cohorts paired with patients’ electronic health records (EHR) provides an excellent resource for the study of gene-disease associations. To validate those associations, researchers often consult databases that identify relationships between genes of interest and relevant disease phenotypes, which we refer to as simply “phenotypes”. However, most of these databases contain phenotypes that are not suited for automated analysis of EHR data, which often captured these phenotypes in the form of International Classification of Diseases (ICD) codes. There is a need for a resource that comprehensively provides gene-phenotype mappings in a format that can be used to evaluate phenotypes from EHR.MethodsWe built a directed graph database of genes, medical concepts and ICD codes based on a subset of the National Library of Medicine’s Unified Medical Language System (UMLS) and other resources. To obtain associations between genes and ICD codes, we traversed the defined relationships from gene, variant and disease concepts to ICD codes, resulting in a set of mappings that link specific genes and variants to these ICD codes.ResultsOur method created 249,764 mappings between genes and ICD codes, including 27,226 “disease” phenotypes and 222,538 “symptom” phenotypes, and provided mappings for 4,456 unique genes. Paths were validated by manual review of a diverse sample of paths. In a cohort of 92,455 samples, we used these mappings to validate gene-phenotype associations in 32,786 samples where a person had a potentially disease-causing genetic mutation and at least one corresponding diagnosis in their EHR.ConclusionThe concepts and relationships in the UMLS can be used to generate gene-ICD phenotype mappings that are not explicit in the source vocabularies. We were able use these mappings to validate gene-disease associations in a large cohort of sequenced exomes paired with EHR.


2021 ◽  
pp. 469-478
Author(s):  
Yasmin H. Karimi ◽  
Douglas W. Blayney ◽  
Allison W. Kurian ◽  
Jeanne Shen ◽  
Rikiya Yamashita ◽  
...  

PURPOSE Large-scale analysis of real-world evidence is often limited to structured data fields that do not contain reliable information on recurrence status and disease sites. In this report, we describe a natural language processing (NLP) framework that uses data from free-text, unstructured reports to classify recurrence status and sites of recurrence for patients with breast and hepatocellular carcinomas (HCC). METHODS Using two cohorts of breast cancer and HCC cases, we validated the ability of a previously developed NLP model to distinguish between no recurrence, local recurrence, and distant recurrence, based on clinician notes, radiology reports, and pathology reports compared with manual curation. A second NLP model was trained and validated to identify sites of recurrence. We compared the ability of each NLP model to identify the presence, timing, and site of recurrence, when compared against manual chart review and International Classification of Diseases coding. RESULTS A total of 1,273 patients were included in the development and validation of the two models. The NLP model for recurrence detects distant recurrence with an area under the curve of 0.98 (95% CI, 0.96 to 0.99) and 0.95 (95% CI, 0.88 to 0.98) in breast and HCC cohorts, respectively. The mean accuracy of the NLP model for detecting any site of distant recurrence was 0.9 for breast cancer and 0.83 for HCC. The NLP model for recurrence identified a larger proportion of patients with distant recurrence in a breast cancer database (11.1%) compared with International Classification of Diseases coding (2.31%). CONCLUSION We developed two NLP models to identify distant cancer recurrence, timing of recurrence, and sites of recurrence based on unstructured electronic health record data. These models can be used to perform large-scale retrospective studies in oncology.


2020 ◽  
Author(s):  
Thomas Gaisl ◽  
Naser Musli ◽  
Patrick Baumgartner ◽  
Marc Meier ◽  
Silvana K Rampini ◽  
...  

BACKGROUND The health aspects, disease frequencies, and specific health interests of prisoners and refugees are poorly understood. Importantly, access to the health care system is limited for this vulnerable population. There has been no systematic investigation to understand the health issues of inmates in Switzerland. Furthermore, little is known on how recent migration flows in Europe may have affected the health conditions of inmates. OBJECTIVE The Swiss Prison Study (SWIPS) is a large-scale observational study with the aim of establishing a public health registry in northern-central Switzerland. The primary objective is to establish a central database to assess disease prevalence (ie, International Classification of Diseases-10 codes [German modification]) among prisoners. The secondary objectives include the following: (1) to compare the 2015 versus 2020 disease prevalence among inmates against a representative sample from the local resident population, (2) to assess longitudinal changes in disease prevalence from 2015 to 2020 by using cross-sectional medical records from all inmates at the Police Prison Zurich, Switzerland, and (3) to identify unrecognized health problems to prepare successful public health strategies. METHODS Demographic and health-related data such as age, sex, country of origin, duration of imprisonment, medication (including the drug name, brand, dosage, and release), and medical history (including the International Classification of Diseases-10 codes [German modification] for all diagnoses and external results that are part of the medical history in the prison) have been deposited in a central register over a span of 5 years (January 2015 to August 2020). The final cohort is expected to comprise approximately 50,000 to 60,000 prisoners from the Police Prison Zurich, Switzerland. RESULTS This study was approved on August 5, 2019 by the ethical committee of the Canton of Zurich with the registration code KEK-ZH No. 2019-01055 and funded in August 2020 by the “Walter and Gertrud Siegenthaler” foundation and the “Theodor and Ida Herzog-Egli” foundation. This study is registered with the International Standard Randomized Controlled Trial Number registry. Data collection started in August 2019 and results are expected to be published in 2021. Findings will be disseminated through scientific papers as well as presentations and public events. CONCLUSIONS This study will construct a valuable database of information regarding the health of inmates and refugees in Swiss prisons and will act as groundwork for future interventions in this vulnerable population. CLINICALTRIAL ISRCTN registry ISRCTN11714665; http://www.isrctn.com/ISRCTN11714665 INTERNATIONAL REGISTERED REPORT DERR1-10.2196/23973


Author(s):  
Hua Wang ◽  
Ke Chai ◽  
Minghui Du ◽  
Shengfeng Wang ◽  
Jian-Ping Cai ◽  
...  

Background: Large-scale and population-based studies of heart failure (HF) incidence and prevalence are scarce in China. The study sought to estimate the prevalence, incidence, and cost of HF in China. Methods: We conducted a population-based study using records of 50.0 million individuals ≥25 years old from the national urban employee basic medical insurance from 6 provinces in China in 2017. Incident cases were individuals with a diagnosis of HF (International Classification of Diseases code, and text of diagnosis) in 2017 with a 4-year disease-free period (2013–2016). We calculated standardized rates by applying age standardization to the 2010 Chinese census population. Results: The age-standardized prevalence and incidence were 1.10% (1.10% among men and women) and 275 per 100 000 person-years (287 among men and 261 among women), respectively, accounting for 12.1 million patients with HF and 3.0 million patients with incident HF ≥25 years old. Both prevalence and incidence increased with increasing age (0.57%, 3.86%, and 7.55% for prevalence and 158, 892, and 1655 per 100 000 person-years for incidence among persons who were 25–64, 65–79, and ≥80 years of age, respectively). The inpatient mean cost per-capita was $4406.8 and the proportion with ≥3 hospitalizations among those hospitalized was 40.5%. The outpatient mean cost per-capita was $892.3. Conclusions: HF has placed a considerable burden on health systems in China, and strategies aimed at the prevention and treatment of HF are needed. Registration: URL: https://www.clinicaltrials.gov ; Unique identifier: ChiCTR2000029094.


Circulation ◽  
2020 ◽  
Vol 142 (Suppl_3) ◽  
Author(s):  
Shashank Shekhar ◽  
Anas M Saad ◽  
Toshiaki Isogai ◽  
Mohamed M Gad ◽  
Keerat Ahuja ◽  
...  

Introduction: Even though atrial fibrillation (AF) is present in >30% of patients with aortic stenosis (AS), it is not typically included in the decision-making algorithm for the timing or need for aortic valve replacement (AVR), either by transcatheter (TAVR) or surgical (SAVR) approaches. Large scale data on how AF affects outcomes of AS patients remain scarce. Methods: From the Nationwide Readmissions Database (NRD), we retrospectively identified AS patients aged ≥18years, with and without AF admitted between January and June in 2016 and 2017 (to allow for a six month follow up), using the International Classification of Diseases-10 th revision codes. Multivariable logistic regression was performed to examine the predictors of in-hospital mortality during index hospitalization. In-hospital complications and 6 month in-hospital mortality during any readmission after being discharged alive were compared in patients with and without AF, for patients undergoing TAVR, SAVR or no-AVR. Results: We identified 403,089 AS patients, of which 41% had AF. Patients with AF were older (median age in years: 83 vs. 79) and were more frequently females (52% vs. 48%; p<0.001). Table summarizes outcomes of AS patients with and without AF. TAVR in patients with AF was associated with higher in-hospital mortality and follow-up mortality as compared to patients without AF. Although AF did not influence in-hospital mortality in SAVR population, follow-up mortality was also significantly higher after SAVR in patients with AF compared to patients without AF. For patients not undergoing AVR, in-hospital and follow-up mortality were higher in AF population compared to no AF and was higher than patients undergoing AVR (Table). Conclusions: AF is associated with worse outcomes in patients with AS irrespective of treatment (TAVR, SAVR or no-AVR). More studies are needed to understand the implications of AF in AS population and whether earlier treatment of AS in patients with AF can improve outcomes.


2020 ◽  
Vol 7 (1) ◽  
pp. e000485
Author(s):  
Kelly L Hayward ◽  
Amy L Johnson ◽  
Benjamin J Mckillen ◽  
Niall T Burke ◽  
Vikas Bansal ◽  
...  

ObjectiveThe utility of International Classification of Diseases (ICD) codes relies on the accuracy of clinical reporting and administrative coding, which may be influenced by country-specific codes and coding rules. This study explores the accuracy and limitations of the Australian Modification of the 10th revision of ICD (ICD-10-AM) to detect the presence of cirrhosis and a subset of key complications for the purpose of future large-scale epidemiological research and healthcare studies.Design/methodICD-10-AM codes in a random sample of 540 admitted patient encounters at a major Australian tertiary hospital were compared with data abstracted from patients’ medical records by four blinded clinicians. Accuracy of individual codes and grouped combinations was determined by calculating sensitivity, positive predictive value (PPV), negative predictive value and Cohen’s kappa coefficient (κ).ResultsThe PPVs for ‘grouped cirrhosis’ codes (0.96), hepatocellular carcinoma (0.97) ascites (0.97) and ‘grouped varices’ (0.95) were good (κ all >0.60). However, codes under-detected the prevalence of cirrhosis, ascites and varices (sensitivity 81.4%, 61.9% and 61.3%, respectively). Overall accuracy was lower for spontaneous bacterial peritonitis (‘grouped’ PPV 0.75; κ 0.73) and the poorest for encephalopathy (‘grouped’ PPV 0.55; κ 0.21). To optimise detection of cirrhosis-related encounters, an ICD-10-AM code algorithm was constructed and validated in an independent cohort of 116 patients with known cirrhosis.ConclusionMultiple ICD-10-AM codes should be considered when using administrative databases to study the burden of cirrhosis and its complications in Australia, to avoid underestimation of the prevalence, morbidity, mortality and related resource utilisation from this burgeoning chronic disease.


2019 ◽  
pp. 102490791987142
Author(s):  
Erdem Kurt ◽  
Rohat AK ◽  
Şebnem Zeynep Eke Kurt ◽  
Suphi Bahadırlı ◽  
Tuba Cimilli Öztürk

Background: This study aims to determine the relationship between troponin levels and 30- and 90-day mortality rates in patients who applied to emergency service with paroxysmal supraventricular tachycardia. Materials and methods: The data of our study were obtained from the retrospective screening of the files of 321 patients who applied to the emergency department between 1 January 2015 and 31 December 2016 with International Classification of Diseases diagnosis with I47.1 (supraventricular tachycardia). Unstable patients, patients under 18 years, and patients with comorbidities that could increase troponin levels did not participate in the study. A total of 159 patients diagnosed with paroxysmal supraventricular tachycardia were included in the study. These patients’ files were examined, and their examination and anamnesis information at the time of admission to hospital, demographic characteristics, and applied treatments were analyzed. The 30- and 90-day mortality rates of the patients were examined. Results: The study was carried out with 159 patients. Troponin was positive in 25 (15.7%) cases, while it was negative in 134 (84.3%) cases. There was no significant difference between the two groups in terms of 30- and 90-day mortality rates. Coronary artery disease was found to be higher in patients with positive troponin than patients with negative troponin. Conclusion: No significant difference was found between patients with positive troponin values compared to patients with negative troponin values in terms of 30- and 90-day mortality rates. We believe that prospective observational studies or large-scale retrospective studies will better elucidate this issue.


Sign in / Sign up

Export Citation Format

Share Document