Analyzing Medical Research Results Based on Synthetic Data and Their Relation to Real Data Results: Systematic Comparison From Five Observational Studies

Anat Reiner Benaim; Ronit Almog; Yuri Gorelik; Irit Hochberg; Laila Nassar; Tanya Mashiach; Mogher Khamaisi; Yael Lurie; Zaher S Azzam; Johad Khoury; Daniel Kurnik; Rafael Beyar

doi:10.2196/16492

Analyzing Medical Research Results Based on Synthetic Data and Their Relation to Real Data Results: Systematic Comparison From Five Observational Studies

JMIR Medical Informatics ◽

10.2196/16492 ◽

2020 ◽

Vol 8 (2) ◽

pp. e16492 ◽

Cited By ~ 8

Author(s):

Anat Reiner Benaim ◽

Ronit Almog ◽

Yuri Gorelik ◽

Irit Hochberg ◽

Laila Nassar ◽

...

Keyword(s):

Medical Research ◽

Synthetic Data ◽

Real Data ◽

Data Access ◽

Structured Data ◽

Data Generation ◽

Patient Privacy ◽

Data Anonymization ◽

Number Of Patients ◽

Study Results

Background Privacy restrictions limit access to protected patient-derived health information for research purposes. Consequently, data anonymization is required to allow researchers data access for initial analysis before granting institutional review board approval. A system installed and activated at our institution enables synthetic data generation that mimics data from real electronic medical records, wherein only fictitious patients are listed. Objective This paper aimed to validate the results obtained when analyzing synthetic structured data for medical research. A comprehensive validation process concerning meaningful clinical questions and various types of data was conducted to assess the accuracy and precision of statistical estimates derived from synthetic patient data. Methods A cross-hospital project was conducted to validate results obtained from synthetic data produced for five contemporary studies on various topics. For each study, results derived from synthetic data were compared with those based on real data. In addition, repeatedly generated synthetic datasets were used to estimate the bias and stability of results obtained from synthetic data. Results This study demonstrated that results derived from synthetic data were predictive of results from real data. When the number of patients was large relative to the number of variables used, highly accurate and strongly consistent results were observed between synthetic and real data. For studies based on smaller populations that accounted for confounders and modifiers by multivariate models, predictions were of moderate accuracy, yet clear trends were correctly observed. Conclusions The use of synthetic structured data provides a close estimate to real data results and is thus a powerful tool in shaping research hypotheses and accessing estimated analyses, without risking patient privacy. Synthetic data enable broad access to data (eg, for out-of-organization researchers), and rapid, safe, and repeatable analysis of data in hospitals or other health organizations where patient privacy is a primary value.

Download Full-text

A Validation Study for Medical Research Based on Synthetic Hospital Data (Preprint)

10.2196/preprints.16492 ◽

2019 ◽

Author(s):

Anat Reiner Benaim ◽

Ronit Almog ◽

Yuri Gorelik ◽

Irit Hochberg ◽

Laila Nassar ◽

...

Keyword(s):

Medical Research ◽

Synthetic Data ◽

Real Data ◽

Data Access ◽

Data Generation ◽

Patient Privacy ◽

Hospital Data ◽

Data Anonymization ◽

Number Of Patients ◽

Study Results

BACKGROUND Privacy restrictions limit access to protected patient-derived health information for research purposes. Consequently, data anonymization is required to allow researchers data access for initial analysis before granting Institutional Review Board approval. A system implemented in our institution enables synthetic data generation that mimics data from real electronic medical records, wherein only fictitious patients are listed. OBJECTIVE This paper studies the validity of results obtained when analyzing synthetic data for medical research. A comprehensive validation process concerning meaningful clinical questions and various types of data was conducted to assess the accuracy and precision of statistical estimates derived from synthetic patient data. METHODS A cross-hospital project was conducted to validate results obtained from synthetic data produced for five contemporary studies on various topics. For each study, results derived from synthetic data were compared to those based on real data. In addition, repeatedly generated synthetic data sets were used to estimate the bias and stability of results obtained from synthetic data. RESULTS This study demonstrated that results derived from synthetic data were predictive of results from real data. When the number of patients was large relative to the number of variables used, highly accurate and strongly consistent results were observed between synthetic and real data. When small populations were accounted for, prediction was of moderate accuracy. CONCLUSIONS The use of synthetic data provides a close estimate to real data results and is thus a powerful tool in shaping research hypotheses and accessing estimated analyses, without risking patient privacy. Synthetic data enables broad access to data, including for out-of-organization researchers, and rapid, safe, and repeatable analysis of data in hospitals or other health organizations where patient privacy is a primary value.

Download Full-text

The Problem of Fairness in Synthetic Healthcare Data

Entropy ◽

10.3390/e23091165 ◽

2021 ◽

Vol 23 (9) ◽

pp. 1165

Author(s):

Karan Bhanot ◽

Miao Qi ◽

John S. Erickson ◽

Isabelle Guyon ◽

Kristin P. Bennett

Keyword(s):

Synthetic Data ◽

Real Data ◽

Access To Healthcare ◽

Data Generation ◽

Patient Privacy ◽

Health Records ◽

Healthcare Data ◽

Time Series Dataset ◽

New Research ◽

Realistic Data

Access to healthcare data such as electronic health records (EHR) is often restricted by laws established to protect patient privacy. These restrictions hinder the reproducibility of existing results based on private healthcare data and also limit new research. Synthetically-generated healthcare data solve this problem by preserving privacy and enabling researchers and policymakers to drive decisions and methods based on realistic data. Healthcare data can include information about multiple in- and out- patient visits of patients, making it a time-series dataset which is often influenced by protected attributes like age, gender, race etc. The COVID-19 pandemic has exacerbated health inequities, with certain subgroups experiencing poorer outcomes and less access to healthcare. To combat these inequities, synthetic data must “fairly” represent diverse minority subgroups such that the conclusions drawn on synthetic data are correct and the results can be generalized to real data. In this article, we develop two fairness metrics for synthetic data, and analyze all subgroups defined by protected attributes to analyze the bias in three published synthetic research datasets. These covariate-level disparity metrics revealed that synthetic data may not be representative at the univariate and multivariate subgroup-levels and thus, fairness should be addressed when developing data generation methods. We discuss the need for measuring fairness in synthetic healthcare data to enable the development of robust machine learning models to create more equitable synthetic healthcare datasets.

Download Full-text

Synthetic Data Generation for Steel Defect Detection and Classification Using Deep Learning

Symmetry ◽

10.3390/sym13071176 ◽

2021 ◽

Vol 13 (7) ◽

pp. 1176

Author(s):

Aleksei Boikov ◽

Vladimir Payor ◽

Roman Savelev ◽

Alexandr Kolesnikov

Keyword(s):

Neural Networks ◽

Defect Detection ◽

Production Control ◽

Surface Defects ◽

Synthetic Data ◽

Real Data ◽

Data Generation ◽

Synthetic Data Generation ◽

Steel Slab ◽

Defect Recognition

The paper presents a methodology for training neural networks for vision tasks on synthesized data on the example of steel defect recognition in automated production control systems. The article describes the process of dataset procedural generation of steel slab defects with a symmetrical distribution. The results of training two neural networks Unet and Xception on a generated data grid and testing them on real data are presented. The performance of these neural networks was assessed using real data from the Severstal: Steel Defect Detection set. In both cases, the neural networks showed good results in the classification and segmentation of surface defects of steel workpieces in the image. Dice score on synthetic data reaches 0.62, and accuracy—0.81.

Download Full-text

ACTIVA: realistic single-cell RNA-seq generation with automatic cell-type identification using introspective variational autoencoders

10.1101/2021.01.28.428725 ◽

2021 ◽

Author(s):

A. Ali Heydari ◽

Oscar A. Davalos ◽

Lihong Zhao ◽

Katrina K. Hoyer ◽

Suzanne S. Sindi

Keyword(s):

Single Cell ◽

Synthetic Data ◽

Supplementary Information ◽

Marker Genes ◽

Test Case ◽

Data Generation ◽

Cell Type ◽

Number Of Patients ◽

Variational Autoencoder ◽

Supplementary Material

MotivationSingle-cell RNA sequencing (scRNAseq) technologies allow for measurements of gene expression at a single-cell resolution. This provides researchers with a tremendous advantage for detecting heterogeneity, delineating cellular maps or identifying rare subpopulations. However, a critical complication remains: the low number of single-cell observations due to limitations by cost or rarity of subpopulation. This absence of suicient data may cause inaccuracy or irreproducibility of downstream analysis. In this work, we present ACTIVA (Automated Cell-Type-informed Introspective Variational Autoencoder): a novel framework for generating realistic synthetic data using a single-stream adversarial variational autoencoder conditioned with cell-type information. Data generation and augmentation with ACTIVA can enhance scRNAseq pipelines and analysis, such as benchmarking new algorithms, studying the accuracy of classifiers and detecting marker genes. ACTIVA will facilitate analysis of smaller datasets, potentially reducing the number of patients and animals necessary in initial studies.ResultsWe train and evaluate models on multiple public scRNAseq datasets. Under the same conditions, ACTIVA trains up to 17 times faster than the GAN-based state-of-the-art model, scGAN (2.2 hours compared to 39.5 hours on Brain Small) while performing better or comparable in our quantitative and qualitative evaluations. We show that augmenting rare-populations with ACTIVA can significantly increase the classification accuracy of the rare population (more than 45% improvement in our rarest test case).Availability of data and codeLinks to raw, pre- and post-processed data, source code and tutorials are available at https://github.com/SindiLab.Supplementary informationSupplementary material can be found as a separate file with the same pre-print submission.

Download Full-text

Privacy, Risk, Anonymization and Data Sharing in the Internet of Health Things

Pittsburgh Journal of Technology Law and Policy ◽

10.5195/tlp.2020.235 ◽

2020 ◽

Vol 20 (1) ◽

Author(s):

Liane Colonna

Keyword(s):

Data Sharing ◽

Risk Mitigation ◽

Synthetic Data ◽

The Internet ◽

Data Generation ◽

Privacy Concerns ◽

Data Anonymization ◽

Synthetic Data Generation ◽

Privacy Risks

This paper explores a specific risk-mitigation strategy to reduce privacy concerns in the Internet of Health Things (IoHT): data anonymization. It contributes to the current academic debate surrounding the role of anonymization in the IoHT by evaluating how data controllers can balance privacy risks against the quality of output data and select the appropriate privacy model that achieves the aims underlying the concept of Privacy by Design. It sets forth several approaches for identifying the risk of re-identification in the IoHT as well as explores the potential for synthetic data generation to be used as an alternative method to anonymization for data sharing.

Download Full-text

Towards markerless surgical tool and hand pose estimation

International Journal of Computer Assisted Radiology and Surgery ◽

10.1007/s11548-021-02369-2 ◽

2021 ◽

Author(s):

Jonas Hein ◽

Matthias Seibold ◽

Federica Bogo ◽

Mazda Farshad ◽

Marc Pollefeys ◽

...

Keyword(s):

Pose Estimation ◽

State Of The Art ◽

Synthetic Data ◽

Ground Truth ◽

Real Data ◽

Computer Assisted ◽

Data Generation ◽

Hand Pose Estimation ◽

Surgical Tool ◽

Hand Pose

Abstract Purpose: Tracking of tools and surgical activity is becoming more and more important in the context of computer assisted surgery. In this work, we present a data generation framework, dataset and baseline methods to facilitate further research in the direction of markerless hand and instrument pose estimation in realistic surgical scenarios. Methods: We developed a rendering pipeline to create inexpensive and realistic synthetic data for model pretraining. Subsequently, we propose a pipeline to capture and label real data with hand and object pose ground truth in an experimental setup to gather high-quality real data. We furthermore present three state-of-the-art RGB-based pose estimation baselines. Results: We evaluate three baseline models on the proposed datasets. The best performing baseline achieves an average tool 3D vertex error of 16.7 mm on synthetic data as well as 13.8 mm on real data which is comparable to the state-of-the art in RGB-based hand/object pose estimation. Conclusion: To the best of our knowledge, we propose the first synthetic and real data generation pipelines to generate hand and object pose labels for open surgery. We present three baseline models for RGB based object and object/hand pose estimation based on RGB frames. Our realistic synthetic data generation pipeline may contribute to overcome the data bottleneck in the surgical domain and can easily be transferred to other medical applications.

Download Full-text

Towards synthetic data generation for machine learning models in weather and climate

10.5194/egusphere-egu2020-20132 ◽

2020 ◽

Author(s):

David Meyer

Keyword(s):

Machine Learning ◽

Computer Vision ◽

Climate Models ◽

Synthetic Data ◽

Real Data ◽

Data Generation ◽

Learning Models ◽

Synthetic Data Generation ◽

Weather And Climate ◽

Machine Learning Models

<p>The use of real data for training machine learning (ML) models are often a cause of major limitations. For example, real data may be (a) representative of a subset of situations and domains, (b) expensive to produce, (c) limited to specific individuals due to licensing restrictions. Although the use of synthetic data are becoming increasingly popular in computer vision, ML models used in weather and climate models still rely on the use of large real data datasets. Here we present some recent work towards the generation of synthetic data for weather and climate applications and outline some of the major challenges and limitations encountered.</p>

Download Full-text

BOOSTING SEGMENTATION ACCURACY OF THE DEEP LEARNING MODELS BASED ON THE SYNTHETIC DATA GENERATION

ISPRS - International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences ◽

10.5194/isprs-archives-xliv-2-w1-2021-33-2021 ◽

2021 ◽

Vol XLIV-2/W1-2021 ◽

pp. 33-40

Author(s):

V. V. Danilov ◽

O. M. Gerget ◽

D. Y. Kolpashchikov ◽

N. V. Laptev ◽

R. A. Manakov ◽

...

Keyword(s):

Machine Learning ◽

Coordinate System ◽

Learning Algorithms ◽

Synthetic Data ◽

Real Data ◽

Machine Learning Algorithms ◽

Dice Similarity Coefficient ◽

Data Generation ◽

Heart Chamber ◽

Echocardiographic Images

Abstract. In the era of data-driven machine learning algorithms, data represents a new oil. The application of machine learning algorithms shows they need large heterogeneous datasets that crucially are correctly labeled. However, data collection and its labeling are time-consuming and labor-intensive processes. A particular task we solve using machine learning is related to the segmentation of medical devices in echocardiographic images during minimally invasive surgery. However, the lack of data motivated us to develop an algorithm generating synthetic samples based on real datasets. The concept of this algorithm is to place a medical device (catheter) in an empty cavity of an anatomical structure, for example, in a heart chamber, and then transform it. To create random transformations of the catheter, the algorithm uses a coordinate system that uniquely identifies each point regardless of the bend and the shape of the object. It is proposed to take a cylindrical coordinate system as a basis, modifying it by replacing the Z-axis with a spline along which the h-coordinate is measured. Having used the proposed algorithm, we generated new images with the catheter inserted into different heart cavities while varying its location and shape. Afterward, we compared the results of deep neural networks trained on the datasets comprised of real and synthetic data. The network trained on both real and synthetic datasets performed more accurate segmentation than the model trained only on real data. For instance, modified U-net trained on combined datasets performed segmentation with the Dice similarity coefficient of 92.6±2.2%, while the same model trained only on real samples achieved the level of 86.5±3.6%. Using a synthetic dataset allowed decreasing the accuracy spread and improving the generalization of the model. It is worth noting that the proposed algorithm allows reducing subjectivity, minimizing the labeling routine, increasing the number of samples, and improving the heterogeneity.

Download Full-text

Copula-based synthetic data generation for machine learning emulators in weather and climate: application to a simple radiation model

10.5194/gmd-2020-427 ◽

2021 ◽

Author(s):

David Meyer ◽

Thomas Nagler ◽

Robin J. Hogan

Keyword(s):

Machine Learning ◽

Synthetic Data ◽

Longwave Radiation ◽

Real Data ◽

Absolute Error ◽

Mean Bias Error ◽

Bias Error ◽

Data Generation ◽

Weather And Climate ◽

The Mean

Abstract. Can we improve machine learning (ML) emulators with synthetic data? The use of real data for training ML models is often the cause of major limitations. For example, real data may be (a) only representative of a subset of situations and domains, (b) expensive to source, (c) limited to specific individuals due to licensing restrictions. Although the use of synthetic data is becoming increasingly popular in computer vision, the training of ML emulators in weather and climate still relies on the use of real data datasets. Here we investigate whether the use of copula-based synthetically-augmented datasets improves the prediction of ML emulators for estimating the downwelling longwave radiation. Results show that bulk errors are cut by up to 75 % for the mean bias error (from 0.08 to −0.02 W m−2) and by up to 62 % (from 1.17 to 0.44 W m−2) for the mean absolute error, thus showing potential for improving the generalization of future ML emulators.

Download Full-text

Evaluating the utility of synthetic COVID-19 case data

JAMIA Open ◽

10.1093/jamiaopen/ooab012 ◽

2021 ◽

Vol 4 (1) ◽

Author(s):

Khaled El Emam ◽

Lucy Mosquera ◽

Elizabeth Jonker ◽

Harpreet Sood

Keyword(s):

Confidence Interval ◽

Data Model ◽

Classification Tree ◽

Synthetic Data ◽

Real Data ◽

Original Data ◽

Patient Privacy ◽

The Real ◽

Functional Relationships ◽

And Gender

Abstract Background Concerns about patient privacy have limited access to COVID-19 datasets. Data synthesis is one approach for making such data broadly available to the research community in a privacy protective manner. Objectives Evaluate the utility of synthetic data by comparing analysis results between real and synthetic data. Methods A gradient boosted classification tree was built to predict death using Ontario’s 90 514 COVID-19 case records linked with community comorbidity, demographic, and socioeconomic characteristics. Model accuracy and relationships were evaluated, as well as privacy risks. The same model was developed on a synthesized dataset and compared to one from the original data. Results The AUROC and AUPRC for the real data model were 0.945 [95% confidence interval (CI), 0.941–0.948] and 0.34 (95% CI, 0.313–0.368), respectively. The synthetic data model had AUROC and AUPRC of 0.94 (95% CI, 0.936–0.944) and 0.313 (95% CI, 0.286–0.342) with confidence interval overlap of 45.05% and 52.02% when compared with the real data. The most important predictors of death for the real and synthetic models were in descending order: age, days since January 1, 2020, type of exposure, and gender. The functional relationships were similar between the two data sets. Attribute disclosure risks were 0.0585, and membership disclosure risk was low. Conclusions This synthetic dataset could be used as a proxy for the real dataset.

Download Full-text