scholarly journals Standardised Metrics and Methods for Synthetic Tabular Data Evaluation

Author(s):  
Mikel Hernandez ◽  
Gorka Epelde ◽  
Ane Alberdi ◽  
Rodrigo Cilla ◽  
Debbie Rankin

Synthetic Tabular Data Generation (STDG) is a potentially valuable technology with great promise to augment real data and preserve privacy. However, prior to adoption, an empirical assessment of synthetic tabular data (STD) is required across the three dimensions of resemblance, utility, and privacy, trying to find a trade-off between them. A lack of standardised and objective metrics and methods has been found targeting this assessment in the literature and neither an organised pipeline or process for coordinating this evaluation has been identified. Therefore, in this work we propose a collection of metrics and methods to evaluate STD in the previously defined dimensions, presenting a meaningful orchestration of them and a pipeline unifying all of them. Additionally, we present a methodology to categorise STDG approaches performance for each dimension. Finally, we conducted an extensive analysis and evaluation to verify the usability of the proposed pipeline across six healthcare-related datasets, using four STDG approaches. The results of these analyses showed that the proposed pipeline can effectively be used to evaluate and benchmark the STD generated with one or more different STDG approaches, helping the scientific community to select the most suitable approaches for their data and application of interest.

2021 ◽  
Author(s):  
Mikel Hernandez ◽  
Gorka Epelde ◽  
Ane Alberdi ◽  
Rodrigo Cilla ◽  
Debbie Rankin

Synthetic Tabular Data Generation (STDG) is a potentially valuable technology with great promise to augment real data and preserve privacy. However, prior to adoption, an empirical assessment of synthetic tabular data (STD) is required across the three dimensions of resemblance, utility, and privacy, trying to find a trade-off between them. A lack of standardised and objective metrics and methods has been found targeting this assessment in the literature and neither an organised pipeline or process for coordinating this evaluation has been identified. Therefore, in this work we propose a collection of metrics and methods to evaluate STD in the previously defined dimensions, presenting a meaningful orchestration of them and a pipeline unifying all of them. Additionally, we present a methodology to categorise STDG approaches performance for each dimension. Finally, we conducted an extensive analysis and evaluation to verify the usability of the proposed pipeline across six healthcare-related datasets, using four STDG approaches. The results of these analyses showed that the proposed pipeline can effectively be used to evaluate and benchmark the STD generated with one or more different STDG approaches, helping the scientific community to select the most suitable approaches for their data and application of interest.


2021 ◽  
Vol 22 (1) ◽  
Author(s):  
João Lobo ◽  
Rui Henriques ◽  
Sara C. Madeira

Abstract Background Three-way data started to gain popularity due to their increasing capacity to describe inherently multivariate and temporal events, such as biological responses, social interactions along time, urban dynamics, or complex geophysical phenomena. Triclustering, subspace clustering of three-way data, enables the discovery of patterns corresponding to data subspaces (triclusters) with values correlated across the three dimensions (observations $$\times$$ × features $$\times$$ × contexts). With increasing number of algorithms being proposed, effectively comparing them with state-of-the-art algorithms is paramount. These comparisons are usually performed using real data, without a known ground-truth, thus limiting the assessments. In this context, we propose a synthetic data generator, G-Tric, allowing the creation of synthetic datasets with configurable properties and the possibility to plant triclusters. The generator is prepared to create datasets resembling real 3-way data from biomedical and social data domains, with the additional advantage of further providing the ground truth (triclustering solution) as output. Results G-Tric can replicate real-world datasets and create new ones that match researchers needs across several properties, including data type (numeric or symbolic), dimensions, and background distribution. Users can tune the patterns and structure that characterize the planted triclusters (subspaces) and how they interact (overlapping). Data quality can also be controlled, by defining the amount of missing, noise or errors. Furthermore, a benchmark of datasets resembling real data is made available, together with the corresponding triclustering solutions (planted triclusters) and generating parameters. Conclusions Triclustering evaluation using G-Tric provides the possibility to combine both intrinsic and extrinsic metrics to compare solutions that produce more reliable analyses. A set of predefined datasets, mimicking widely used three-way data and exploring crucial properties was generated and made available, highlighting G-Tric’s potential to advance triclustering state-of-the-art by easing the process of evaluating the quality of new triclustering approaches.


1994 ◽  
Vol 21 (6) ◽  
pp. 1074-1080 ◽  
Author(s):  
J. Llamas ◽  
C. Diaz Delgado ◽  
M.-L. Lavertu

In this paper, an improved probabilistic method for flood analysis using the probable maximum flood, the beta function, and orthogonal Jacobi’s polynomials is proposed. The shape of the beta function depends on the sample's characteristics and the bounds of the phenomenon. On the other hand, a serial of Jacobi’s polynomials has been used improving the beta function and increasing its convergence degree toward the real flood probability density function. This mathematical model has been tested using a sample of 1000 generated beta random data. Finally, some practical applications with real data series, from important Quebec's rivers, have been performed; the model solutions for these rivers showed the accuracy of this new method in flood frequency estimation. Key words: probable maximum flood, beta function, orthogonal polynomials, distribution function, flood frequency estimation, data generation, convergency.


2021 ◽  
Author(s):  
Jennifer E. Bauer ◽  
Imran A. Rahman

Imaging and visualizing fossils in three dimensions with tomography is a powerful approach in paleontology. Here, the authors introduce select destructive and non-destructive tomographic techniques that are routinely applied to fossils and review how this work has improved our understanding of the anatomy, function, taphonomy, and phylogeny of fossil echinoderms. Building on this, this Element discusses how new imaging and computational methods have great promise for addressing long-standing paleobiological questions. Future efforts to improve the accessibility of the data underlying this work will be key for realizing the potential of this virtual world of paleontology.


2015 ◽  
Vol 31 ◽  
pp. 23 ◽  
Author(s):  
Evelyn Sample ◽  
Marije Michel

Studying task repetition for adult and young foreign language learners of English (EFL) has received growing interest in recent literature within the task-based approach (Bygate, 2009; Hawkes, 2012; Mackey, Kanganas, & Oliver, 2007; Pinter, 2007b). Earlier work suggests that second language (L2) learners benefit from repeating the same or a slightly different task. Task repetition has been shown to enhance fluency and may also add to complexity or accuracy of production. However, few investigations have taken a closer look at the underlying relationships between the three dimensions of task performance: complexity, accuracy, and fluency (CAF). Using Skehan’s (2009) trade-off hypothesis as an explanatory framework, our study aims to fill this gap by investigating interactions among CAF measures. We report on the repeated performances on an oral spot- the-difference task by six 9-year-old EFL learners. Mirroring earlier work, our data reveal significant increases of fluency through task repetition. Correlational analyses show that initial performances that benefit in one dimension come at the expense of another; by the third performance, however, trade-off effects disappear. Further qualitative explanations support our interpretation that with growing task-familiarity students are able to focus their attention on all three CAF dimensions simultaneously.Au sein de la littérature relative à l’approche fondée sur les tâches, on évoque de plus en plus d’études portant sur la répétition des tâches pour l’enseignement de l’anglais langue étrangère aux jeunes et aux adultes (Bygate, 2009; Hawkes, 2012; Mackey, Kanganas, & Oliver, 2007; Pinter, 2007b). Des études antérieures semblent indiquer que les apprenants en L2 profitent de la répétition de la même tâche ou d’une tâche légèrement différente. Il a été démontré que la répétition des tâches améliore la fluidité et qu’elle pourrait augmenter la complexité ou la précision de la production. Toutefois, peu d’études se sont penchées davantage sur les relations sous-jacentes entre les trois dimensions de l’exécution des tâches : la complexité, la précision et la fluidité. S’appuyant sur l’hypothèse du compromis de Skehan (2009) comme cadre explicatif, notre étude vise à combler cette lacune en examinant les interactions entre les mesures de ces trois éléments. Nous faisons rapport du rendement de six jeunes âgés de 9 ans qui apprennent l’anglais comme langue étrangère alors qu’ils répètent une tâche impliquant l’identification de différences. Nos données reproduisent les résultats de travaux antérieurs en ce qu’elles révèlent une amélioration significative de la fluidité par la répétition de tâches. Des analyses corrélationnelles indiquent que l’amélioration d’une dimension lors des exécutions initiales se fait aux dépens d’une autre; cet effet de compromis disparait, toutefois, à la troisième exécution. Des explications quali- tatives supplémentaires viennent appuyer notre interprétation selon laquelle la familiarité croissante que ressentent les élèves avec une tâche leur permet de se concentrer sur les trois dimensions (complexité, précision et fluidité) à la fois.


2021 ◽  
Author(s):  
Cemanur Aydinalp ◽  
Sulayman Joof ◽  
Mehmet Nuri Akinci ◽  
Ibrahim Akduman ◽  
Tuba Yilmaz

In the manuscript, we propose a new technique for determination of Debye parameters, representing the dielectric properties of materials, from the reflection coefficient response of open-ended coaxial probes. The method retrieves the Debye parameters using a deep learning model designed through utilization of numerically generated data. Unlike real data, using synthetically generated input and output data for training purposes provides representation of a wide variety of materials with rapid data generation. Furthermore, the proposed method provides design flexibility and can be applied to any desired probe with intended dimensions and material. Next, we experimentally verified the designed deep learning model using measured reflection coefficients when the probe was terminated with five different standard liquids, four mixtures,and a gel-like material.and compared the results with the literature. Obtained mean percent relative error was ranging from 1.21±0.06 to 10.89±0.08. Our work also presents a large-scale statistical verification of the proposed dielectric property retrieval technique.


Computation ◽  
2019 ◽  
Vol 7 (3) ◽  
pp. 36 ◽  
Author(s):  
Keskin ◽  
Alsoy Altinkaya

Computational modeling of membrane materials is a rapidly growing field to investigate the properties of membrane materials beyond the limits of experimental techniques and to complement the experimental membrane studies by providing insights at the atomic-level. In this study, we first reviewed the fundamental approaches employed to describe the gas permeability/selectivity trade-off of polymer membranes and then addressed the great promise of mixed matrix membranes (MMMs) to overcome this trade-off. We then reviewed the current approaches for predicting the gas permeation through MMMs and specifically focused on MMMs composed of metal organic frameworks (MOFs). Computational tools such as atomically-detailed molecular simulations that can predict the gas separation performances of MOF-based MMMs prior to experimental investigation have been reviewed and the new computational methods that can provide information about the compatibility between the MOF and the polymer of the MMM have been discussed. We finally addressed the opportunities and challenges of using computational studies to analyze the barriers that must be overcome to advance the application of MOF-based membranes.


2013 ◽  
Vol 10 (04) ◽  
pp. 1350015
Author(s):  
MARK J. AHN ◽  
ANNE S. YORK ◽  
SO YOUNG SOHN ◽  
PAYAM BENYAMINI

Disruptive technology platforms from emerging companies hold great promise for exploiting innovation, but often face legitimacy hurdles due to their liability of newness. Nascent firms must learn new roles with limited precedent, and establish ties with an environment that may not fully understand or value their existence. Using a legitimacy-based lens in the context of the biotechnology industry, we posit a sequential construct — cognitive, regulative, and normative legitimacy — to evaluate emergent technology platforms. Our model of biotechnology platform emergence may provide insights for understanding how breakthroughs achieve legitimacy in the scientific community, mobilize resources and talent, and attain commercial success.


Author(s):  
Esmaeil Keshavarz ◽  
Abbas Shoul

Trade-off problems concentrate on balancing the main parameters of a project as completion time, total cost and quality of activities. In this study, the problem of project time-cost-quality trade-off is formulated and solved from a new standpoint. For this purpose, completion time and crash cost of project are illustrated as fuzzy goals, also the dependency of implementing time of each activity and its execution-quality is described by a fuzzy number. The overall quality of the project execution is defined as the minimum execution-quality of the project activities that should be maximized. Based on some real assumptions, a three-objective programming problem associated with the time-cost-quality trade-off problem is formulated; then with the aim of identifying a fair and appropriate trade-off, the research problem is reformulated as a single objective linear programming by utilizing a fuzzy decision-making methodology. Generating a final preferred solution, rather than a set of Pareto optimal solutions, and having a reasonable interpretation are two most important advantages of the proposed approach. To explain the practical performance of the proposed models and approach, a time-cost-quality trade-off problem for a project with real data is solved and analyzed.


2015 ◽  
Vol 282 (1798) ◽  
pp. 20141069 ◽  
Author(s):  
Carmen Lía Murall ◽  
Chris T. Bauch ◽  
Troy Day

The human papillomavirus (HPV) vaccines hold great promise for preventing several cancers caused by HPV infections. Yet little attention has been given to whether HPV could respond evolutionarily to the new selection pressures imposed on it by the novel immunity response created by the vaccine. Here, we present and theoretically validate a mechanism by which the vaccine alters the transmission–recovery trade-off that constrains HPV's virulence such that higher oncogene expression is favoured. With a high oncogene expression strategy, the virus is able to increase its viral load and infected cell population before clearance by the vaccine, thus improving its chances of transmission. This new rapid cell-proliferation strategy is able to circulate between hosts with medium to high turnover rates of sexual partners. We also discuss the importance of better quantifying the duration of challenge infections and the degree to which a vaccinated host can shed virus. The generality of the models presented here suggests a wider applicability of this mechanism, and thus highlights the need to investigate viral oncogenicity from an evolutionary perspective.


Sign in / Sign up

Export Citation Format

Share Document