Synthesizing electronic health records using improved generative adversarial networks

Mrinal Kanti Baowaly; Chia-Ching Lin; Chao-Lin Liu; Kuan-Ta Chen

doi:10.1093/jamia/ocy142

Synthesizing electronic health records using improved generative adversarial networks

Journal of the American Medical Informatics Association ◽

10.1093/jamia/ocy142 ◽

2018 ◽

Vol 26 (3) ◽

pp. 228-241 ◽

Cited By ~ 15

Author(s):

Mrinal Kanti Baowaly ◽

Chia-Ching Lin ◽

Chao-Lin Liu ◽

Kuan-Ta Chen

Keyword(s):

Electronic Health Records ◽

Binary Data ◽

Synthetic Data ◽

Generative Adversarial Networks ◽

Research Database ◽

Data Generation ◽

Generative Adversarial Network ◽

Health Records ◽

Adversarial Network ◽

Electronic Health

AbstractObjectiveThe aim of this study was to generate synthetic electronic health records (EHRs). The generated EHR data will be more realistic than those generated using the existing medical Generative Adversarial Network (medGAN) method.Materials and MethodsWe modified medGAN to obtain two synthetic data generation models—designated as medical Wasserstein GAN with gradient penalty (medWGAN) and medical boundary-seeking GAN (medBGAN)—and compared the results obtained using the three models. We used 2 databases: MIMIC-III and National Health Insurance Research Database (NHIRD), Taiwan. First, we trained the models and generated synthetic EHRs by using these three 3 models. We then analyzed and compared the models’ performance by using a few statistical methods (Kolmogorov–Smirnov test, dimension-wise probability for binary data, and dimension-wise average count for count data) and 2 machine learning tasks (association rule mining and prediction).ResultsWe conducted a comprehensive analysis and found our models were adequately efficient for generating synthetic EHR data. The proposed models outperformed medGAN in all cases, and among the 3 models, boundary-seeking GAN (medBGAN) performed the best.DiscussionTo generate realistic synthetic EHR data, the proposed models will be effective in the medical industry and related research from the viewpoint of providing better services. Moreover, they will eliminate barriers including limited access to EHR data and thus accelerate research on medical informatics.ConclusionThe proposed models can adequately learn the data distribution of real EHRs and efficiently generate realistic synthetic EHRs. The results show the superiority of our models over the existing model.

Download Full-text

Generation of Synthetic Data with Conditional Generative Adversarial Networks

Logic Journal of IGPL ◽

10.1093/jigpal/jzaa059 ◽

2020 ◽

Author(s):

Belén Vega-Márquez ◽

Cristina Rubio-Escudero ◽

Isabel Nepomuceno-Chamorro

Keyword(s):

Research Work ◽

Synthetic Data ◽

Original Data ◽

Classification Problem ◽

Generative Adversarial Networks ◽

Data Generation ◽

Generative Adversarial Network ◽

Adversarial Network ◽

Adversarial Networks ◽

Original Dataset

Abstract The generation of synthetic data is becoming a fundamental task in the daily life of any organization due to the new protection data laws that are emerging. Because of the rise in the use of Artificial Intelligence, one of the most recent proposals to address this problem is the use of Generative Adversarial Networks (GANs). These types of networks have demonstrated a great capacity to create synthetic data with very good performance. The goal of synthetic data generation is to create data that will perform similarly to the original dataset for many analysis tasks, such as classification. The problem of GANs is that in a classification problem, GANs do not take class labels into account when generating new data, it is treated as any other attribute. This research work has focused on the creation of new synthetic data from datasets with different characteristics with a Conditional Generative Adversarial Network (CGAN). CGANs are an extension of GANs where the class label is taken into account when the new data is generated. The performance of our results has been measured in two different ways: firstly, by comparing the results obtained with classification algorithms, both in the original datasets and in the data generated; secondly, by checking that the correlation between the original data and those generated is minimal.

Download Full-text

Assessing function of electronic health records for real-world data generation

BMJ evidence-based medicine ◽

10.1136/bmjebm-2018-111111 ◽

2018 ◽

Vol 24 (3) ◽

pp. 95-98 ◽

Cited By ~ 2

Author(s):

Daphne Guinn ◽

Erin E Wilhelm ◽

Grazyna Lieberman ◽

Sean Khozin

Keyword(s):

Electronic Health Records ◽

Real World ◽

Data Generation ◽

Real World Data ◽

Health Records ◽

World Data ◽

Electronic Health

Download Full-text

Boosting Deep Learning Risk Prediction with Generative Adversarial Networks for Electronic Health Records

2017 IEEE International Conference on Data Mining (ICDM) ◽

10.1109/icdm.2017.93 ◽

2017 ◽

Cited By ~ 18

Author(s):

Zhengping Che ◽

Yu Cheng ◽

Shuangfei Zhai ◽

Zhaonan Sun ◽

Yan Liu

Keyword(s):

Deep Learning ◽

Electronic Health Records ◽

Risk Prediction ◽

Generative Adversarial Networks ◽

Health Records ◽

Adversarial Networks ◽

Electronic Health

Download Full-text

Supplementing electronic health records through sample collection and patient diaries: A study set within a primary care research database

Pharmacoepidemiology and Drug Safety ◽

10.1002/pds.4323 ◽

2017 ◽

Vol 27 (2) ◽

pp. 239-242 ◽

Cited By ~ 4

Author(s):

Rebecca M. Joseph ◽

Jamie Soames ◽

Mark Wright ◽

Kirin Sultana ◽

Tjeerd P. van Staa ◽

...

Keyword(s):

Primary Care ◽

Electronic Health Records ◽

Research Database ◽

Sample Collection ◽

Health Records ◽

Primary Care Research ◽

Care Research ◽

Electronic Health

Download Full-text

Automatic Inference of Demographic Parameters Using Generative Adversarial Networks

10.1101/2020.08.05.237834 ◽

2020 ◽

Cited By ~ 2

Author(s):

Zhanpeng Wang ◽

Jiaping Wang ◽

Michael Kourakos ◽

Nhung Hoang ◽

Hyong Hark Lee ◽

...

Keyword(s):

Synthetic Data ◽

Simulated Data ◽

Real Data ◽

Simulation Software ◽

Generative Adversarial Networks ◽

Generative Adversarial Network ◽

Isolation With Migration ◽

Adversarial Network ◽

Novel Approach ◽

Input Parameters

AbstractPopulation genetics relies heavily on simulated data for validation, inference, and intuition. In particular, since real data is always limited, simulated data is crucial for training machine learning methods. Simulation software can accurately model evolutionary processes, but requires many hand-selected input parameters. As a result, simulated data often fails to mirror the properties of real genetic data, which limits the scope of methods that rely on it. In this work, we develop a novel approach to estimating parameters in population genetic models that automatically adapts to data from any population. Our method is based on a generative adversarial network that gradually learns to generate realistic synthetic data. We demonstrate that our method is able to recover input parameters in a simulated isolation-with-migration model. We then apply our method to human data from the 1000 Genomes Project, and show that we can accurately recapitulate the features of real data.

Download Full-text

Human Face Generation using Deep Convolution Generative Adversarial Network

International Journal for Modern Trends in Science and Technology - RTT2020 ◽

10.46501/ijmtst070127 ◽

2021 ◽

Vol 7 (01) ◽

pp. 114-120

Author(s):

Chaudhary Sarimurrab, Ankita Kesari Naman and Sudha Narang

Keyword(s):

Random Noise ◽

Generative Models ◽

Generative Adversarial Networks ◽

Data Generation ◽

Generative Adversarial Network ◽

Practical Applications ◽

Adversarial Network ◽

Adversarial Networks ◽

Face Generation ◽

Human Faces

The Generative Models have gained considerable attention in the field of unsupervised learning via a new and practical framework called Generative Adversarial Networks (GAN) due to its outstanding data generation capability. Many models of GAN have proposed, and several practical applications emerged in various domains of computer vision and machine learning. Despite GAN's excellent success, there are still obstacles to stable training. In this model, we aim to generate human faces through un-labelled data via the help of Deep Convolutional Generative Adversarial Networks. The applications for generating faces are vast in the field of image processing, entertainment, and other such industries. Our resulting model is successfully able to generate human faces from the given un-labelled data and random noise.

Download Full-text

Empirical Evaluation on Synthetic Data Generation with Generative Adversarial Network

Proceedings of the 9th International Conference on Web Intelligence, Mining and Semantics - WIMS2019 ◽

10.1145/3326467.3326474 ◽

2019 ◽

Author(s):

Pei-Hsuan Lu ◽

Pang-Chieh Wang ◽

Chia-Mu Yu

Keyword(s):

Empirical Evaluation ◽

Synthetic Data ◽

Data Generation ◽

Generative Adversarial Network ◽

Adversarial Network ◽

Synthetic Data Generation

Download Full-text

Differentially private synthetic mixed-type data generation for unsupervised learning

Intelligent Decision Technologies ◽

10.3233/idt-210195 ◽

2021 ◽

pp. 1-29

Author(s):

Uthaipon Tao Tantipongpipat ◽

Chris Waites ◽

Digvijay Boob ◽

Amaresh Ankit Siva ◽

Rachel Cummings

Keyword(s):

Binary Data ◽

Mixed Type ◽

Differential Privacy ◽

Synthetic Data ◽

Original Data ◽

Generative Adversarial Networks ◽

Data Generation ◽

Sensitive Data ◽

Type Data ◽

Low Dimensional

We introduce the DP-auto-GAN framework for synthetic data generation, which combines the low dimensional representation of autoencoders with the flexibility of Generative Adversarial Networks (GANs). This framework can be used to take in raw sensitive data and privately train a model for generating synthetic data that will satisfy similar statistical properties as the original data. This learned model can generate an arbitrary amount of synthetic data, which can then be freely shared due to the post-processing guarantee of differential privacy. Our framework is applicable to unlabeled mixed-type data, that may include binary, categorical, and real-valued data. We implement this framework on both binary data (MIMIC-III) and mixed-type data (ADULT), and compare its performance with existing private algorithms on metrics in unsupervised settings. We also introduce a new quantitative metric able to detect diversity, or lack thereof, of synthetic data.

Download Full-text

Generating sequential electronic health records using dual adversarial autoencoder

Journal of the American Medical Informatics Association ◽

10.1093/jamia/ocaa119 ◽

2020 ◽

Vol 27 (9) ◽

pp. 1411-1419 ◽

Cited By ~ 1

Author(s):

Dongha Lee ◽

Hwanjo Yu ◽

Xiaoqian Jiang ◽

Deevakar Rogith ◽

Meghana Gudala ◽

...

Keyword(s):

Electronic Health Records ◽

Predictive Modeling ◽

Medical Information ◽

Real Data ◽

Generative Models ◽

Generative Adversarial Networks ◽

Health Records ◽

Clinical Databases ◽

Electronic Health ◽

Clinical Records

Abstract Objective Recent studies on electronic health records (EHRs) started to learn deep generative models and synthesize a huge amount of realistic records, in order to address significant privacy issues surrounding the EHR. However, most of them only focus on structured records about patients’ independent visits, rather than on chronological clinical records. In this article, we aim to learn and synthesize realistic sequences of EHRs based on the generative autoencoder. Materials and Methods We propose a dual adversarial autoencoder (DAAE), which learns set-valued sequences of medical entities, by combining a recurrent autoencoder with 2 generative adversarial networks (GANs). DAAE improves the mode coverage and quality of generated sequences by adversarially learning both the continuous latent distribution and the discrete data distribution. Using the MIMIC-III (Medical Information Mart for Intensive Care-III) and UT Physicians clinical databases, we evaluated the performances of DAAE in terms of predictive modeling, plausibility, and privacy preservation. Results Our generated sequences of EHRs showed the comparable performances to real data for a predictive modeling task, and achieved the best score in plausibility evaluation conducted by medical experts among all baseline models. In addition, differentially private optimization of our model enables to generate synthetic sequences without increasing the privacy leakage of patients’ data. Conclusions DAAE can effectively synthesize sequential EHRs by addressing its main challenges: the synthetic records should be realistic enough not to be distinguished from the real records, and they should cover all the training patients to reproduce the performance of specific downstream tasks.

Download Full-text