Fake It Till You Make It: Guidelines for Effective Synthetic Data Generation

Fida K. Dankar; Mahmoud Ibrahim

doi:10.3390/app11052158

Fake It Till You Make It: Guidelines for Effective Synthetic Data Generation

Applied Sciences ◽

10.3390/app11052158 ◽

2021 ◽

Vol 11 (5) ◽

pp. 2158

Author(s):

Fida K. Dankar ◽

Mahmoud Ibrahim

Keyword(s):

Machine Learning ◽

Propensity Score ◽

Real Life ◽

Synthetic Data ◽

Supervised Machine Learning ◽

Data Generation ◽

Learning Models ◽

Healthcare Data ◽

Synthetic Data Generation ◽

Machine Learning Models

Synthetic data provides a privacy protecting mechanism for the broad usage and sharing of healthcare data for secondary purposes. It is considered a safe approach for the sharing of sensitive data as it generates an artificial dataset that contains no identifiable information. Synthetic data is increasing in popularity with multiple synthetic data generators developed in the past decade, yet its utility is still a subject of research. This paper is concerned with evaluating the effect of various synthetic data generation and usage settings on the utility of the generated synthetic data and its derived models. Specifically, we investigate (i) the effect of data pre-processing on the utility of the synthetic data generated, (ii) whether tuning should be applied to the synthetic datasets when generating supervised machine learning models, and (iii) whether sharing preliminary machine learning results can improve the synthetic data models. Lastly, (iv) we investigate whether one utility measure (Propensity score) can predict the accuracy of the machine learning models generated from the synthetic data when employed in real life. We use two popular measures of synthetic data utility, propensity score and classification accuracy, to compare the different settings. We adopt a recent mechanism for the calculation of propensity, which looks carefully into the choice of model for the propensity score calculation. Accordingly, this paper takes a new direction with investigating the effect of various data generation and usage settings on the quality of the generated data and its ensuing models. The goal is to inform on the best strategies to follow when generating and using synthetic data.

Download Full-text

Towards synthetic data generation for machine learning models in weather and climate

10.5194/egusphere-egu2020-20132 ◽

2020 ◽

Author(s):

David Meyer

Keyword(s):

Machine Learning ◽

Computer Vision ◽

Climate Models ◽

Synthetic Data ◽

Real Data ◽

Data Generation ◽

Learning Models ◽

Synthetic Data Generation ◽

Weather And Climate ◽

Machine Learning Models

<p>The use of real data for training machine learning (ML) models are often a cause of major limitations. For example, real data may be (a) representative of a subset of situations and domains, (b) expensive to produce, (c) limited to specific individuals due to licensing restrictions. Although the use of synthetic data are becoming increasingly popular in computer vision, ML models used in weather and climate models still rely on the use of large real data datasets. Here we present some recent work towards the generation of synthetic data for weather and climate applications and outline some of the major challenges and limitations encountered.</p>

Download Full-text

Improving quality prediction in radial-axial ring rolling using a semi-supervised approach and generative adversarial networks for synthetic data generation

Production Engineering ◽

10.1007/s11740-021-01075-x ◽

2021 ◽

Author(s):

Simon Fahle ◽

Thomas Glaser ◽

Andreas Kneißler ◽

Bernd Kuhlenkötter

Keyword(s):

Machine Learning ◽

Synthetic Data ◽

Ring Rolling ◽

Supervised Machine Learning ◽

Generative Adversarial Networks ◽

Quality Prediction ◽

Data Generation ◽

Adversarial Networks ◽

Synthetic Data Generation ◽

Axial Ring

AbstractAs artificial intelligence and especially machine learning gained a lot of attention during the last few years, methods and models have been improving and are becoming easily applicable. This possibility was used to develop a quality prediction system using supervised machine learning methods in form of time series classification models to predict ovality in radial-axial ring rolling. Different preprocessing steps and model implementations have been used to improve quality prediction. A semi-supervised approach is used to improve the prediction and analyze, to what extend it can improve current research in machine learning for quality prediciton. Moreover, first research steps are taken towards a synthetic data generation within the radial-axial ring rolling domain using generative adversarial networks.

Download Full-text

Application of Machine Learning Techniques to Predict Binding Affinity for Drug Targets: A Study of Cyclin-Dependent Kinase 2

Current Medicinal Chemistry ◽

10.2174/2213275912666191102162959 ◽

2020 ◽

Vol 28 (2) ◽

pp. 253-265 ◽

Cited By ~ 3

Author(s):

Gabriela Bitencourt-Ferreira ◽

Amauri Duarte da Silva ◽

Walter Filgueira de Azevedo

Keyword(s):

Machine Learning ◽

Binding Affinity ◽

Predictive Performance ◽

Supervised Machine Learning ◽

Machine Learning Techniques ◽

Scoring Functions ◽

Cyclin Dependent Kinase ◽

Learning Models ◽

Learning Techniques ◽

Machine Learning Models

Background: The elucidation of the structure of cyclin-dependent kinase 2 (CDK2) made it possible to develop targeted scoring functions for virtual screening aimed to identify new inhibitors for this enzyme. CDK2 is a protein target for the development of drugs intended to modulate cellcycle progression and control. Such drugs have potential anticancer activities. Objective: Our goal here is to review recent applications of machine learning methods to predict ligand- binding affinity for protein targets. To assess the predictive performance of classical scoring functions and targeted scoring functions, we focused our analysis on CDK2 structures. Methods: We have experimental structural data for hundreds of binary complexes of CDK2 with different ligands, many of them with inhibition constant information. We investigate here computational methods to calculate the binding affinity of CDK2 through classical scoring functions and machine- learning models. Results: Analysis of the predictive performance of classical scoring functions available in docking programs such as Molegro Virtual Docker, AutoDock4, and Autodock Vina indicated that these methods failed to predict binding affinity with significant correlation with experimental data. Targeted scoring functions developed through supervised machine learning techniques showed a significant correlation with experimental data. Conclusion: Here, we described the application of supervised machine learning techniques to generate a scoring function to predict binding affinity. Machine learning models showed superior predictive performance when compared with classical scoring functions. Analysis of the computational models obtained through machine learning could capture essential structural features responsible for binding affinity against CDK2.

Download Full-text

Machine learning based Synthetic Data Generation using Iterative Regression Analysis

2020 4th International Conference on Electronics, Communication and Aerospace Technology (ICECA) ◽

10.1109/iceca49313.2020.9297491 ◽

2020 ◽

Author(s):

Sanskar Shah ◽

Darshan Gandhi ◽

Jil Kothari

Keyword(s):

Machine Learning ◽

Regression Analysis ◽

Synthetic Data ◽

Data Generation ◽

Synthetic Data Generation

Download Full-text

A study of micromanufacturing process fingerprints in micro-injection moulding for machine learning and Industry 4.0 applications

The International Journal of Advanced Manufacturing Technology ◽

10.1007/s00170-021-07252-7 ◽

2021 ◽

Author(s):

Mert Gülçür ◽

Ben Whiteside

Keyword(s):

Machine Learning ◽

Linear Regression ◽

Multiple Linear Regression ◽

Injection Moulding ◽

Industry 4.0 ◽

Supervised Machine Learning ◽

Process Conditions ◽

Learning Models ◽

Micro Injection Moulding ◽

Machine Learning Models

AbstractThis paper discusses micromanufacturing process quality proxies called “process fingerprints” in micro-injection moulding for establishing in-line quality assurance and machine learning models for Industry 4.0 applications. Process fingerprints that we present in this study are purely physical proxies of the product quality and need tangible rationale regarding their selection criteria such as sensitivity, cost-effectiveness, and robustness. Proposed methods and selection reasons for process fingerprints are also justified by analysing the temporally collected data with respect to the microreplication efficiency. Extracted process fingerprints were also used in a multiple linear regression scenario where they bring actionable insights for creating traceable and cost-effective supervised machine learning models in challenging micro-injection moulding environments. Multiple linear regression model demonstrated %84 accuracy in predicting the quality of the process, which is significant as far as the extreme process conditions and product features are concerned.

Download Full-text

Prediction of Graduate Admission using Multiple Supervised Machine Learning Models

2020 SoutheastCon ◽

10.1109/southeastcon44009.2020.9249747 ◽

2020 ◽

Author(s):

Zain Bitar ◽

Amjed Al-Mousa

Keyword(s):

Machine Learning ◽

Supervised Machine Learning ◽

Learning Models ◽

Machine Learning Models

Download Full-text

Comparison of the Performance of Machine Learning Algorithms in Predicting Heart Disease

Frontiers in Health Informatics ◽

10.30699/fhi.v10i1.349 ◽

2021 ◽

Vol 10 (1) ◽

pp. 99

Author(s):

Sajad Yousefi

Keyword(s):

Machine Learning ◽

Logistic Regression ◽

Heart Disease ◽

Decision Tree ◽

Roc Curve ◽

Machine Learning Algorithms ◽

Supervised Machine Learning ◽

Learning Models ◽

Algorithm Performance ◽

Machine Learning Models

Introduction: Heart disease is often associated with conditions such as clogged arteries due to the sediment accumulation which causes chest pain and heart attack. Many people die due to the heart disease annually. Most countries have a shortage of cardiovascular specialists and thus, a significant percentage of misdiagnosis occurs. Hence, predicting this disease is a serious issue. Using machine learning models performed on multidimensional dataset, this article aims to find the most efficient and accurate machine learning models for disease prediction.Material and Methods: Several algorithms were utilized to predict heart disease among which Decision Tree, Random Forest and KNN supervised machine learning are highly mentioned. The algorithms are applied to the dataset taken from the UCI repository including 294 samples. The dataset includes heart disease features. To enhance the algorithm performance, these features are analyzed, the feature importance scores and cross validation are considered.Results: The algorithm performance is compared with each other, so that performance based on ROC curve and some criteria such as accuracy, precision, sensitivity and F1 score were evaluated for each model. As a result of evaluation, Accuracy, AUC ROC are 83% and 99% respectively for Decision Tree algorithm. Logistic Regression algorithm with accuracy and AUC ROC are 88% and 91% respectively has better performance than other algorithms. Therefore, these techniques can be useful for physicians to predict heart disease patients and prescribe them correctly.Conclusion: Machine learning technique can be used in medicine for analyzing the related data collections to a disease and its prediction. The area under the ROC curve and evaluating criteria related to a number of classifying algorithms of machine learning to evaluate heart disease and indeed, the prediction of heart disease is compared to determine the most appropriate classification. As a result of evaluation, better performance was observed in both Decision Tree and Logistic Regression models.

Download Full-text

A machine learning approach to inform developmental milestone achievement for children with autism (Preprint)

10.2196/preprints.29242 ◽

2021 ◽

Author(s):

Munirul M. Haque ◽

Masud Rabbani ◽

Dipranjan Das Dipal ◽

Md Ishrak Islam Zarif ◽

Anik Iqbal ◽

...

Keyword(s):

Machine Learning ◽

Autism Spectrum Disorder ◽

Children With Autism ◽

Autism Spectrum ◽

The Other ◽

Supervised Machine Learning ◽

Learning Models ◽

Children With Asd ◽

Socio Demographic Factors ◽

Machine Learning Models

BACKGROUND Care for children with autism spectrum disorder (ASD) can be challenging for families and medical care systems. This is especially true in Low-and-Middle-Income-countries (LMIC) like Bangladesh. To improve family-practitioner communication and developmental monitoring of children with ASD, [spell out] (mCARE) was developed. Within this study, mCARE was used to track child milestone achievement and family socio-demographic assets to inform mCARE feasibility/scalability and family-asset informed practitioner recommendations. OBJECTIVE The objectives of this paper are three-fold. First, document how mCARE can be used to monitor child milestone achievement. Second, demonstrate how advanced machine learning models can inform our understanding of milestone achievement in children with ASD. Third, describe family/child socio-demographic factors that are associated with earlier milestone achievement in children with ASD (across five machine learning models). METHODS Using mCARE collected data, this study assessed milestone achievement in 300 children with ASD from Bangladesh. In this study, we used four supervised machine learning (ML) algorithms (Decision Tree, Logistic Regression, k-Nearest Neighbors, Artificial Neural Network) and one unsupervised machine learning (K-means Clustering) to build models of milestone achievement based on family/child socio-demographic details. For analyses, the sample was randomly divided in half to train the ML models and then their accuracy was estimated based on the other half of the sample. Each model was specified for the following milestones: Brushes teeth, Asks to use the toilet, Urinates in the toilet or potty, and Buttons large buttons. RESULTS This study aimed to find a suitable machine learning algorithm for milestone prediction/achievement for children with ASD using family/child socio-demographic characteristics. For, Brushes teeth, the three supervised machine learning models met or exceeded an accuracy of 95% with Logistic Regression, KNN, and ANN as the most robust socio-demographic predictors. For Asks to use toilet, 84.00% accuracy was achieved with the KNN and ANN models. For these models, the family socio-demographic predictors of “family expenditure” and “parents’ age” accounted for most of the model variability. The last two parameters, Urinates in toilet or potty and Buttons large buttons had an accuracy of 91.00% and 76.00%, respectively, in ANN. Overall, the ANN had a higher accuracy (Above ~80% on average) among the other algorithms for all the parameters. Across the models and milestones, “family expenditure”, “family size/ type”, “living places” and “parent’s age and occupation” were the most influential family/child socio-demographic factors. CONCLUSIONS mCARE was successfully deployed in an LMIC (i.e., Bangladesh), allowing parents and care-practitioners a mechanism to share detailed information on child milestones achievement. Using advanced modeling techniques this study demonstrates how family/child socio-demographic elements can inform child milestone achievement. Specifically, families with fewer socio-demographic resources reported later milestone attainment. Developmental science theories highlight how family/systems can directly influence child development and this study provides a clear link between family resources and child developmental progress. Clinical implications for this work could include supporting the larger family system to improve child milestone achievement. CLINICALTRIAL We took the IRB from Marquette University Institutional Review Board on July 9, 2020, with the protocol number HR-1803022959, and titled “MOBILE-BASED CARE FOR CHILDREN WITH AUTISM SPECTRUM DISORDER USING REMOTE EXPERIENCE SAMPLING METHOD (MCARE)” for recruiting a total of 316 subjects, of which we recruited 300. (Details description of participants in Methods section)

Download Full-text

Classification and Success Investigation of Biomedical Data Sets Using Supervised Machine Learning Models

2019 3rd International Symposium on Multidisciplinary Studies and Innovative Technologies (ISMSIT) ◽

10.1109/ismsit.2019.8932734 ◽

2019 ◽

Author(s):

Sarmad N. Mohammed ◽

Mehmet Serdar Guzel ◽

Erkan Bostanci

Keyword(s):

Machine Learning ◽

Supervised Machine Learning ◽

Data Sets ◽

Biomedical Data ◽

Learning Models ◽

Machine Learning Models

Download Full-text

Mechanistic models versus machine learning, a fight worth fighting for the biological community?

Biology Letters ◽

10.1098/rsbl.2017.0660 ◽

2018 ◽

Vol 14 (5) ◽

pp. 20170660 ◽

Cited By ~ 59

Author(s):

Ruth E. Baker ◽

Jose-Maria Peña ◽

Jayaratnam Jayamohan ◽

Antoine Jérusalem

Keyword(s):

Machine Learning ◽

Disease Progression ◽

Royal Society ◽

Mechanistic Models ◽

Learning Tools ◽

Data Generation ◽

Learning Models ◽

Input Output ◽

Correlation Studies ◽

Machine Learning Models

Ninety per cent of the world's data have been generated in the last 5 years ( Machine learning: the power and promise of computers that learn by example . Report no. DES4702. Issued April 2017. Royal Society). A small fraction of these data is collected with the aim of validating specific hypotheses. These studies are led by the development of mechanistic models focused on the causality of input–output relationships. However, the vast majority is aimed at supporting statistical or correlation studies that bypass the need for causality and focus exclusively on prediction. Along these lines, there has been a vast increase in the use of machine learning models, in particular in the biomedical and clinical sciences, to try and keep pace with the rate of data generation. Recent successes now beg the question of whether mechanistic models are still relevant in this area. Said otherwise, why should we try to understand the mechanisms of disease progression when we can use machine learning tools to directly predict disease outcome?

Download Full-text