Visual Analytics for Dimension Reduction and Cluster Analysis of High Dimensional Electronic Health Records

Sheikh S. Abdullah; Neda Rostamzadeh; Kamran Sedig; Amit X. Garg; Eric McArthur

doi:10.3390/informatics7020017

Visual Analytics for Dimension Reduction and Cluster Analysis of High Dimensional Electronic Health Records

Informatics ◽

10.3390/informatics7020017 ◽

2020 ◽

Vol 7 (2) ◽

pp. 17 ◽

Cited By ~ 1

Author(s):

Sheikh S. Abdullah ◽

Neda Rostamzadeh ◽

Kamran Sedig ◽

Amit X. Garg ◽

Eric McArthur

Keyword(s):

Cluster Analysis ◽

Electronic Health Records ◽

Dimension Reduction ◽

Visual Analytics ◽

Machine Learning Techniques ◽

High Dimensional ◽

Health Records ◽

Wide Range ◽

Electronic Health ◽

And Cluster Analysis

Recent advancement in EHR-based (Electronic Health Record) systems has resulted in producing data at an unprecedented rate. The complex, growing, and high-dimensional data available in EHRs creates great opportunities for machine learning techniques such as clustering. Cluster analysis often requires dimension reduction to achieve efficient processing time and mitigate the curse of dimensionality. Given a wide range of techniques for dimension reduction and cluster analysis, it is not straightforward to identify which combination of techniques from both families leads to the desired result. The ability to derive useful and precise insights from EHRs requires a deeper understanding of the data, intermediary results, configuration parameters, and analysis processes. Although these tasks are often tackled separately in existing studies, we present a visual analytics (VA) system, called Visual Analytics for Cluster Analysis and Dimension Reduction of High Dimensional Electronic Health Records (VALENCIA), to address the challenges of high-dimensional EHRs in a single system. VALENCIA brings a wide range of cluster analysis and dimension reduction techniques, integrate them seamlessly, and make them accessible to users through interactive visualizations. It offers a balanced distribution of processing load between users and the system to facilitate the performance of high-level cognitive tasks in such a way that would be difficult without the aid of a VA system. Through a real case study, we have demonstrated how VALENCIA can be used to analyze the healthcare administrative dataset stored at ICES. This research also highlights what needs to be considered in the future when developing VA systems that are designed to derive deep and novel insights into EHRs.

Download Full-text

Implementing high‐dimensional propensity score principles to improve confounder adjustment in UK electronic health records

Pharmacoepidemiology and Drug Safety ◽

10.1002/pds.5121 ◽

2020 ◽

Vol 29 (11) ◽

pp. 1373-1381

Author(s):

John Tazare ◽

Liam Smeeth ◽

Stephen J. W. Evans ◽

Elizabeth Williamson ◽

Ian J. Douglas

Keyword(s):

Propensity Score ◽

Electronic Health Records ◽

High Dimensional ◽

Health Records ◽

Electronic Health ◽

Confounder Adjustment

Download Full-text

Soft clustering using real-world data for the identification of multimorbidity patterns in an elderly population: cross-sectional study in a Mediterranean population

BMJ Open ◽

10.1136/bmjopen-2019-029594 ◽

2019 ◽

Vol 9 (8) ◽

pp. e029594 ◽

Cited By ~ 5

Author(s):

Concepción Violán ◽

Quintí Foguet-Boreu ◽

Sergio Fernández-Bertolín ◽

Marina Guisado-Clavero ◽

Margarita Cabrera-Bean ◽

...

Keyword(s):

Cluster Analysis ◽

Electronic Health Records ◽

Cross Sectional Study ◽

Secondary Outcome ◽

Sectional Study ◽

Cross Sectional ◽

Health Records ◽

Soft Clustering ◽

Fuzzy C Means ◽

Electronic Health

ObjectivesThe aim of this study was to identify, with soft clustering methods, multimorbidity patterns in the electronic health records of a population ≥65 years, and to analyse such patterns in accordance with the different prevalence cut-off points applied. Fuzzy cluster analysis allows individuals to be linked simultaneously to multiple clusters and is more consistent with clinical experience than other approaches frequently found in the literature.DesignA cross-sectional study was conducted based on data from electronic health records.Setting284 primary healthcare centres in Catalonia, Spain (2012).Participants916 619 eligible individuals were included (women: 57.7%).Primary and secondary outcome measuresWe extracted data on demographics, International Classification of Diseases version 10 chronic diagnoses, prescribed drugs and socioeconomic status for patients aged ≥65. Following principal component analysis of categorical and continuous variables for dimensionality reduction, machine learning techniques were applied for the identification of disease clusters in a fuzzy c-means analysis. Sensitivity analyses, with different prevalence cut-off points for chronic diseases, were also conducted. Solutions were evaluated from clinical consistency and significance criteria.ResultsMultimorbidity was present in 93.1%. Eight clusters were identified with a varying number of disease values: nervous and digestive; respiratory, circulatory and nervous; circulatory and digestive; mental, nervous and digestive, female dominant; mental, digestive and blood, female oldest-old dominant; nervous, musculoskeletal and circulatory, female dominant; genitourinary, mental and musculoskeletal, male dominant; and non-specified, youngest-old dominant. Nuclear diseases were identified for each cluster independently of the prevalence cut-off point considered.ConclusionsMultimorbidity patterns were obtained using fuzzy c-means cluster analysis. They are clinically meaningful clusters which support the development of tailored approaches to multimorbidity management and further research.

Download Full-text

Estimate of disease heritability using 7.4 million familial relationships inferred from electronic health records

10.1101/066068 ◽

2016 ◽

Cited By ~ 3

Author(s):

Fernanda Polubriaginof ◽

Rami Vanguri ◽

Kayla Quinnies ◽

Gillian M. Belbin ◽

Alexandre Yahi ◽

...

Keyword(s):

Electronic Health Records ◽

Patient Privacy ◽

Next Of Kin ◽

Health Records ◽

Familial Relationships ◽

Wide Range ◽

Causes Of Disease ◽

Contact Data ◽

Electronic Health ◽

Patient Emergency

AbstractHeritability is essential for understanding the biological causes of disease, but requires laborious patient recruitment and phenotype ascertainment. Electronic health records (EHR) passively capture a wide range of clinically relevant data and provide a novel resource for studying the heritability of traits that are not typically accessible. EHRs contain next-of-kin information collected via patient emergency contact forms, but until now, these data have gone unused in research. We mined emergency contact data at three academic medical centers and identified millions of familial relationships while maintaining patient privacy. Identified relationships were consistent with genetically-derived relatedness. We used EHR data to compute heritability estimates for 500 disease phenotypes. Overall, estimates were consistent with literature and between sites. Inconsistencies were indicative of limitations and opportunities unique to EHR research. These analyses provide a novel validation of the use of EHRs for genetics and disease research.One Sentence SummaryWe demonstrate that next-of-kin information can be used to identify familial relationships in the EHR, providing unique opportunities for precision medicine studies.

Download Full-text

Predicting Acute Kidney Injury: A Machine Learning Approach Using Electronic Health Records

Information ◽

10.3390/info11080386 ◽

2020 ◽

Vol 11 (8) ◽

pp. 386

Author(s):

Sheikh S. Abdullah ◽

Neda Rostamzadeh ◽

Kamran Sedig ◽

Amit X. Garg ◽

Eric McArthur

Keyword(s):

Machine Learning ◽

Emergency Department ◽

Acute Kidney Injury ◽

Electronic Health Records ◽

Prediction Models ◽

Kidney Injury ◽

Machine Learning Techniques ◽

Health Records ◽

Mortality And Morbidity ◽

Electronic Health

Acute kidney injury (AKI) is a common complication in hospitalized patients and can result in increased hospital stay, health-related costs, mortality and morbidity. A number of recent studies have shown that AKI is predictable and avoidable if early risk factors can be identified by analyzing Electronic Health Records (EHRs). In this study, we employ machine learning techniques to identify older patients who have a risk of readmission with AKI to the hospital or emergency department within 90 days after discharge. One million patients’ records are included in this study who visited the hospital or emergency department in Ontario between 2014 and 2016. The predictor variables include patient demographics, comorbid conditions, medications and diagnosis codes. We developed 31 prediction models based on different combinations of two sampling techniques, three ensemble methods, and eight classifiers. These models were evaluated through 10-fold cross-validation and compared based on the AUROC metric. The performances of these models were consistent, and the AUROC ranged between 0.61 and 0.88 for predicting AKI among 31 prediction models. In general, the performances of ensemble-based methods were higher than the cost-sensitive logistic regression. We also validated features that are most relevant in predicting AKI with a healthcare expert to improve the performance and reliability of the models. This study predicts the risk of AKI for a patient after being discharged, which provides healthcare providers enough time to intervene before the onset of AKI.

Download Full-text

Comparing high-dimensional confounder control methods for rapid cohort studies from electronic health records

Journal of Comparative Effectiveness Research ◽

10.2217/cer.15.53 ◽

2016 ◽

Vol 5 (2) ◽

pp. 179-192 ◽

Cited By ~ 8

Author(s):

Yen Sia Low ◽

Blanca Gallego ◽

Nigam Haresh Shah

Keyword(s):

Electronic Health Records ◽

Cohort Studies ◽

High Dimensional ◽

Control Methods ◽

Health Records ◽

Electronic Health

Download Full-text

Learning and Visualizing Chronic Latent Representations Using Electronic Health Records

10.21203/rs.3.rs-968569/v1 ◽

2021 ◽

Author(s):

David Chushig-Muzo ◽

Cristina Soguero-Ruiz ◽

Pablo de Miguel Bohoyo ◽

Inmaculada Mora-Jiménez

Keyword(s):

Health Status ◽

Electronic Health Records ◽

Chronic Conditions ◽

Dimensional Space ◽

High Dimensional ◽

Diabetic Patients ◽

Two Dimensional ◽

Health Records ◽

Electronic Health ◽

Latent Representations

Abstract Background: Nowadays, patients with chronic diseases such as diabetes and hypertension have reached alarming numbers worldwide. These diseases increase the risk of developing acute complications and involve a substantial economic burden and demand for health resources. The widespread adoption of Electronic Health Records (EHRs) is opening great opportunities for supporting decision-making. Nevertheless, data extracted from EHRs are complex (heterogeneous, high-dimensional and usually noisy), hampering the knowledge extraction with conventional approaches. Methods: We propose the use of the Denoising Autoencoder (DAE), a Machine Learning (ML) technique allowing to transform high-dimensional data into latent representations (LRs), thus addressing the main challenges with clinical data. We explore in this work how the combination of LRs with a visualization method can be used to map the patient data in a two-dimensional space, gaining knowledge about the distribution of patients with different chronic conditions. Furthermore, this representation can be also used to characterize the patient's health status evolution, which is of paramount importance in the clinical setting. Results: To obtain clinical LRs, we considered real-world data extracted from EHRs linked to the University Hospital of Fuenlabrada in Spain. Experimental results showed the great potential of DAEs to identify patients with clinical patterns linked to hypertension, diabetes and multimorbidity. The procedure allowed us to find patients with the same main chronic disease but different clinical characteristics. Thus, we identified two kinds of diabetic patients with differences in their drug therapy (insulin and non-insulin dependant), and also a group of women affected by hypertension and gestational diabetes. We also present a proof of concept for mapping the health status evolution of synthetic patients when considering the most significant diagnoses and drugs associated with chronic patients. Conclusions: Our results highlighted the value of ML techniques to extract clinical knowledge, supporting the identification of patients with certain chronic conditions. Furthermore, the patient's health status progression on the two-dimensional space might be used as a tool for clinicians aiming to characterize health conditions and identify their more relevant clinical codes.

Download Full-text

A Stochastic Multivariate Irregularly Sampled Time Series Imputation Method for Electronic Health Records

BioMedInformatics ◽

10.3390/biomedinformatics1030011 ◽

2021 ◽

Vol 1 (3) ◽

pp. 166-181

Author(s):

Muhammad Adib Uz Zaman ◽

Dongping Du

Keyword(s):

Neural Networks ◽

Time Series ◽

Electronic Health Records ◽

Missing Values ◽

Time Series Data ◽

Temporal Information ◽

Series Data ◽

High Dimensional ◽

Health Records ◽

Electronic Health

Electronic health records (EHRs) can be very difficult to analyze since they usually contain many missing values. To build an efficient predictive model, a complete dataset is necessary. An EHR usually contains high-dimensional longitudinal time series data. Most commonly used imputation methods do not consider the importance of temporal information embedded in EHR data. Besides, most time-dependent neural networks such as recurrent neural networks (RNNs) inherently consider the time steps to be equal, which in many cases, is not appropriate. This study presents a method using the gated recurrent unit (GRU), neural ordinary differential equations (ODEs), and Bayesian estimation to incorporate the temporal information and impute sporadically observed time series measurements in high-dimensional EHR data.

Download Full-text

Challenges in defining Long COVID: Striking differences across literature, Electronic Health Records, and patient-reported information

10.1101/2021.03.20.21253896 ◽

2021 ◽

Author(s):

Halie M. Rando ◽

Tellen D. Bennett ◽

James Brian Byrd ◽

Carolyn Bramante ◽

Tiffany J. Callahan ◽

...

Keyword(s):

Electronic Health Records ◽

Multiple Organ ◽

Health Records ◽

Health Crisis ◽

Organ Systems ◽

Wide Range ◽

Patient Reported ◽

Electronic Health ◽

Novel Coronavirus

Since late 2019, the novel coronavirus SARS-CoV-2 has introduced a wide array of health challenges globally. In addition to a complex acute presentation that can affect multiple organ systems, increasing evidence points to long-term sequelae being common and impactful. As the worldwide scientific community forges ahead with efforts to characterize a wide range of outcomes associated with SARS-CoV-2 infection, the proliferation of available data has made it clear that formal definitions are needed in order to design robust and consistent studies of Long COVID that consistently capture variation in long-term outcomes. In the present study, we investigate the definitions used in the literature published to date and compare them against data available from electronic health records and patient-reported information collected via surveys. Long COVID holds the potential to produce a second public health crisis on the heels of the pandemic. Proactive efforts to identify the characteristics of this heterogeneous condition are imperative for a rigorous scientific effort to investigate and mitigate this threat.

Download Full-text

EHRtemporalVariability: delineating temporal dataset shifts in electronic health records

10.1101/2020.04.07.20056564 ◽

2020 ◽

Author(s):

Carlos Sáez ◽

Alba Gutiérrez-Sacristán ◽

Isaac Kohane ◽

Juan M García-Gómez ◽

Paul Avillach

Keyword(s):

Electronic Health Records ◽

R Package ◽

Reliable Data ◽

Data Reuse ◽

Statistical Distributions ◽

Health Records ◽

Link Type ◽

Wide Range ◽

Electronic Health ◽

Over Time

AbstractBackgroundTemporal variability in healthcare processes or protocols is intrinsic to medicine. Such variability can potentially introduce dataset shifts, a data quality issue when reusing electronic health records (EHRs) for secondary purposes. Temporal dataset shifts can present as trends, abrupt or seasonal changes in the statistical distributions of data over time, being particularly complex to address in multi-modal and highly coded data. These changes, if not delineated, can harm population and data-driven research, such as machine learning. Given that biomedical research repositories are increasingly being populated with large historical data from EHRs, there is a need for specific software methods to help delineate temporal dataset shifts to ensure reliable data reuse.FindingsEHRtemporalVariability is an Open Source R-package and Shiny-app designed to explore and identify temporal dataset shifts. EHRtemporalVariability estimates the statistical distributions of coded and numerical data over time, projects their temporal-evolution through non-parametric Information Geometric Temporal plots, and enables the exploration of changes in variables through Data Temporal Heatmaps. We demonstrate the capability of EHRtemporalVariability to delineate dataset shifts in three impact case studies, one of them available for reproducibility.ConclusionsEHRtemporalVariability enables the exploration and identification of dataset shifts, contributing to broadly examine and repurpose large, longitudinal datasets. Our goal is to help ensure reliable data reuse to a wide range of biomedical data users. EHRtemporalVariability is suited to technical users programmatically using the R-package and to those users not familiar with programming using the Shiny user interface.Availabilityhttps://github.com/hms-dbmi/EHRtemporalVariability/ Reproducible vignette: https://cran.r-project.org/web/packages/EHRtemporalVariability/vignettes/EHRtemporalVariability.html On-line demo: http://ehrtemporalvariability.upv.es/

Download Full-text

Diagnosing hospital bacteraemia in the framework of predictive, preventive and personalised medicine using electronic health records and machine learning classifiers

The EPMA Journal ◽

10.1007/s13167-021-00252-3 ◽

2021 ◽

Author(s):

Oscar Garnica ◽

Diego Gómez ◽

Víctor Ramos ◽

J. Ignacio Hidalgo ◽

José M. Ruiz-Giardín

Keyword(s):

Machine Learning ◽

Blood Culture ◽

Electronic Health Records ◽

Antibiotic Treatment ◽

Personalised Medicine ◽

Machine Learning Techniques ◽

Support Vector ◽

Health Records ◽

Learning Techniques ◽

Electronic Health

Abstract Background The bacteraemia prediction is relevant because sepsis is one of the most important causes of morbidity and mortality. Bacteraemia prognosis primarily depends on a rapid diagnosis. The bacteraemia prediction would shorten up to 6 days the diagnosis, and, in conjunction with individual patient variables, should be considered to start the early administration of personalised antibiotic treatment and medical services, the election of specific diagnostic techniques and the determination of additional treatments, such as surgery, that would prevent subsequent complications. Machine learning techniques could help physicians make these informed decisions by predicting bacteraemia using the data already available in electronic hospital records. Objective This study presents the application of machine learning techniques to these records to predict the blood culture’s outcome, which would reduce the lag in starting a personalised antibiotic treatment and the medical costs associated with erroneous treatments due to conservative assumptions about blood culture outcomes. Methods Six supervised classifiers were created using three machine learning techniques, Support Vector Machine, Random Forest and K-Nearest Neighbours, on the electronic health records of hospital patients. The best approach to handle missing data was chosen and, for each machine learning technique, two classification models were created: the first uses the features known at the time of blood extraction, whereas the second uses four extra features revealed during the blood culture. Results The six classifiers were trained and tested using a dataset of 4357 patients with 117 features per patient. The models obtain predictions that, for the best case, are up to a state-of-the-art accuracy of 85.9%, a sensitivity of 87.4% and an AUC of 0.93. Conclusions Our results provide cutting-edge metrics of interest in predictive medical models with values that exceed the medical practice threshold and previous results in the literature using classical modelling techniques in specific types of bacteraemia. Additionally, the consistency of results is reasserted because the three classifiers’ importance ranking shows similar features that coincide with those that physicians use in their manual heuristics. Therefore, the efficacy of these machine learning techniques confirms their viability to assist in the aims of predictive and personalised medicine once the disease presents bacteraemia-compatible symptoms and to assist in improving the healthcare economy.

Download Full-text