Standardised Metrics and Methods for Synthetic Tabular Data Evaluation

10.36227/techrxiv.16610896.v1 ◽

2021 ◽

Author(s):

Mikel Hernandez ◽

Gorka Epelde ◽

Ane Alberdi ◽

Rodrigo Cilla ◽

Debbie Rankin

Keyword(s):

Scientific Community ◽

Real Data ◽

Three Dimensions ◽

Great Promise ◽

Data Generation ◽

Tabular Data ◽

Trade Off ◽

Analysis And Evaluation ◽

Extensive Analysis ◽

Objective Metrics

Synthetic Tabular Data Generation (STDG) is a potentially valuable technology with great promise to augment real data and preserve privacy. However, prior to adoption, an empirical assessment of synthetic tabular data (STD) is required across the three dimensions of resemblance, utility, and privacy, trying to find a trade-off between them. A lack of standardised and objective metrics and methods has been found targeting this assessment in the literature and neither an organised pipeline or process for coordinating this evaluation has been identified. Therefore, in this work we propose a collection of metrics and methods to evaluate STD in the previously defined dimensions, presenting a meaningful orchestration of them and a pipeline unifying all of them. Additionally, we present a methodology to categorise STDG approaches performance for each dimension. Finally, we conducted an extensive analysis and evaluation to verify the usability of the proposed pipeline across six healthcare-related datasets, using four STDG approaches. The results of these analyses showed that the proposed pipeline can effectively be used to evaluate and benchmark the STD generated with one or more different STDG approaches, helping the scientific community to select the most suitable approaches for their data and application of interest.

Download Full-text

G-Tric: generating three-way synthetic datasets with triclustering solutions

BMC Bioinformatics ◽

10.1186/s12859-020-03925-4 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

João Lobo ◽

Rui Henriques ◽

Sara C. Madeira

Keyword(s):

State Of The Art ◽

Synthetic Data ◽

Ground Truth ◽

Real Data ◽

Three Dimensions ◽

Additional Advantage ◽

Urban Dynamics ◽

Data Generator ◽

Real World Datasets ◽

Synthetic Datasets

Abstract Background Three-way data started to gain popularity due to their increasing capacity to describe inherently multivariate and temporal events, such as biological responses, social interactions along time, urban dynamics, or complex geophysical phenomena. Triclustering, subspace clustering of three-way data, enables the discovery of patterns corresponding to data subspaces (triclusters) with values correlated across the three dimensions (observations $$\times$$ × features $$\times$$ × contexts). With increasing number of algorithms being proposed, effectively comparing them with state-of-the-art algorithms is paramount. These comparisons are usually performed using real data, without a known ground-truth, thus limiting the assessments. In this context, we propose a synthetic data generator, G-Tric, allowing the creation of synthetic datasets with configurable properties and the possibility to plant triclusters. The generator is prepared to create datasets resembling real 3-way data from biomedical and social data domains, with the additional advantage of further providing the ground truth (triclustering solution) as output. Results G-Tric can replicate real-world datasets and create new ones that match researchers needs across several properties, including data type (numeric or symbolic), dimensions, and background distribution. Users can tune the patterns and structure that characterize the planted triclusters (subspaces) and how they interact (overlapping). Data quality can also be controlled, by defining the amount of missing, noise or errors. Furthermore, a benchmark of datasets resembling real data is made available, together with the corresponding triclustering solutions (planted triclusters) and generating parameters. Conclusions Triclustering evaluation using G-Tric provides the possibility to combine both intrinsic and extrinsic metrics to compare solutions that produce more reliable analyses. A set of predefined datasets, mimicking widely used three-way data and exploring crucial properties was generated and made available, highlighting G-Tric’s potential to advance triclustering state-of-the-art by easing the process of evaluating the quality of new triclustering approaches.

Download Full-text

Application de la fonction bêta et des polynômes de Jacobi à l'analyse des valeurs extrêmes

Canadian Journal of Civil Engineering ◽

10.1139/l94-111 ◽

1994 ◽

Vol 21 (6) ◽

pp. 1074-1080 ◽

Cited By ~ 1

Author(s):

J. Llamas ◽

C. Diaz Delgado ◽

M.-L. Lavertu

Keyword(s):

Frequency Estimation ◽

Probabilistic Method ◽

Beta Function ◽

Real Data ◽

Flood Frequency ◽

Data Series ◽

Data Generation ◽

Practical Applications ◽

Flood Probability ◽

Probable Maximum Flood

In this paper, an improved probabilistic method for flood analysis using the probable maximum flood, the beta function, and orthogonal Jacobi’s polynomials is proposed. The shape of the beta function depends on the sample's characteristics and the bounds of the phenomenon. On the other hand, a serial of Jacobi’s polynomials has been used improving the beta function and increasing its convergence degree toward the real flood probability density function. This mathematical model has been tested using a sample of 1000 generated beta random data. Finally, some practical applications with real data series, from important Quebec's rivers, have been performed; the model solutions for these rivers showed the accuracy of this new method in flood frequency estimation. Key words: probable maximum flood, beta function, orthogonal polynomials, distribution function, flood frequency estimation, data generation, convergency.

Download Full-text

Virtual Paleontology

10.1017/9781108881944 ◽

2021 ◽

Author(s):

Jennifer E. Bauer ◽

Imran A. Rahman

Keyword(s):

Computational Methods ◽

Virtual World ◽

Three Dimensions ◽

Great Promise ◽

Powerful Approach ◽

Non Destructive

Imaging and visualizing fossils in three dimensions with tomography is a powerful approach in paleontology. Here, the authors introduce select destructive and non-destructive tomographic techniques that are routinely applied to fossils and review how this work has improved our understanding of the anatomy, function, taphonomy, and phylogeny of fossil echinoderms. Building on this, this Element discusses how new imaging and computational methods have great promise for addressing long-standing paleobiological questions. Future efforts to improve the accessibility of the data underlying this work will be key for realizing the potential of this virtual world of paleontology.

Download Full-text

An Exploratory Study into Trade-off Effects of Complexity, Accuracy, and Fluency on Young Learners’ Oral Task Repetition

TESL Canada Journal ◽

10.18806/tesl.v31i0.1185 ◽

2015 ◽

Vol 31 ◽

pp. 23 ◽

Cited By ~ 15

Author(s):

Evelyn Sample ◽

Marije Michel

Keyword(s):

Foreign Language ◽

Language Learners ◽

Recent Literature ◽

Three Dimensions ◽

Task Repetition ◽

Trade Off ◽

Efl Learners ◽

Foreign Language Learners ◽

De Se ◽

The Difference

Studying task repetition for adult and young foreign language learners of English (EFL) has received growing interest in recent literature within the task-based approach (Bygate, 2009; Hawkes, 2012; Mackey, Kanganas, & Oliver, 2007; Pinter, 2007b). Earlier work suggests that second language (L2) learners benefit from repeating the same or a slightly different task. Task repetition has been shown to enhance fluency and may also add to complexity or accuracy of production. However, few investigations have taken a closer look at the underlying relationships between the three dimensions of task performance: complexity, accuracy, and fluency (CAF). Using Skehan’s (2009) trade-off hypothesis as an explanatory framework, our study aims to fill this gap by investigating interactions among CAF measures. We report on the repeated performances on an oral spot- the-difference task by six 9-year-old EFL learners. Mirroring earlier work, our data reveal significant increases of fluency through task repetition. Correlational analyses show that initial performances that benefit in one dimension come at the expense of another; by the third performance, however, trade-off effects disappear. Further qualitative explanations support our interpretation that with growing task-familiarity students are able to focus their attention on all three CAF dimensions simultaneously.Au sein de la littérature relative à l’approche fondée sur les tâches, on évoque de plus en plus d’études portant sur la répétition des tâches pour l’enseignement de l’anglais langue étrangère aux jeunes et aux adultes (Bygate, 2009; Hawkes, 2012; Mackey, Kanganas, & Oliver, 2007; Pinter, 2007b). Des études antérieures semblent indiquer que les apprenants en L2 profitent de la répétition de la même tâche ou d’une tâche légèrement différente. Il a été démontré que la répétition des tâches améliore la fluidité et qu’elle pourrait augmenter la complexité ou la précision de la production. Toutefois, peu d’études se sont penchées davantage sur les relations sous-jacentes entre les trois dimensions de l’exécution des tâches : la complexité, la précision et la fluidité. S’appuyant sur l’hypothèse du compromis de Skehan (2009) comme cadre explicatif, notre étude vise à combler cette lacune en examinant les interactions entre les mesures de ces trois éléments. Nous faisons rapport du rendement de six jeunes âgés de 9 ans qui apprennent l’anglais comme langue étrangère alors qu’ils répètent une tâche impliquant l’identification de différences. Nos données reproduisent les résultats de travaux antérieurs en ce qu’elles révèlent une amélioration significative de la fluidité par la répétition de tâches. Des analyses corrélationnelles indiquent que l’amélioration d’une dimension lors des exécutions initiales se fait aux dépens d’une autre; cet effet de compromis disparait, toutefois, à la troisième exécution. Des explications quali- tatives supplémentaires viennent appuyer notre interprétation selon laquelle la familiarité croissante que ressentent les élèves avec une tâche leur permet de se concentrer sur les trois dimensions (complexité, précision et fluidité) à la fois.

Download Full-text

Microwave Dielectric Property Retrieval from Open-Ended Coaxial Probe Response with Deep Learning

10.36227/techrxiv.16992394.v1 ◽

2021 ◽

Author(s):

Cemanur Aydinalp ◽

Sulayman Joof ◽

Mehmet Nuri Akinci ◽

Ibrahim Akduman ◽

Tuba Yilmaz

Keyword(s):

Deep Learning ◽

Dielectric Property ◽

Large Scale ◽

Real Data ◽

Learning Model ◽

Data Generation ◽

Retrieval Technique ◽

Design Flexibility ◽

A New Technique ◽

Deep Learning Model

In the manuscript, we propose a new technique for determination of Debye parameters, representing the dielectric properties of materials, from the reflection coefficient response of open-ended coaxial probes. The method retrieves the Debye parameters using a deep learning model designed through utilization of numerically generated data. Unlike real data, using synthetically generated input and output data for training purposes provides representation of a wide variety of materials with rapid data generation. Furthermore, the proposed method provides design flexibility and can be applied to any desired probe with intended dimensions and material. Next, we experimentally verified the designed deep learning model using measured reflection coefficients when the probe was terminated with five different standard liquids, four mixtures,and a gel-like material.and compared the results with the literature. Obtained mean percent relative error was ranging from 1.21±0.06 to 10.89±0.08. Our work also presents a large-scale statistical verification of the proposed dielectric property retrieval technique.

Download Full-text

A Review on Computational Modeling Tools for MOF-Based Mixed Matrix Membranes

Computation ◽

10.3390/computation7030036 ◽

2019 ◽

Vol 7 (3) ◽

pp. 36 ◽

Cited By ~ 5

Author(s):

Keskin ◽

Alsoy Altinkaya

Keyword(s):

Computational Modeling ◽

Gas Permeability ◽

Gas Permeation ◽

Mixed Matrix Membranes ◽

Great Promise ◽

Mixed Matrix ◽

Modeling Tools ◽

Trade Off ◽

Membrane Materials ◽

Matrix Membranes

Computational modeling of membrane materials is a rapidly growing field to investigate the properties of membrane materials beyond the limits of experimental techniques and to complement the experimental membrane studies by providing insights at the atomic-level. In this study, we first reviewed the fundamental approaches employed to describe the gas permeability/selectivity trade-off of polymer membranes and then addressed the great promise of mixed matrix membranes (MMMs) to overcome this trade-off. We then reviewed the current approaches for predicting the gas permeation through MMMs and specifically focused on MMMs composed of metal organic frameworks (MOFs). Computational tools such as atomically-detailed molecular simulations that can predict the gas separation performances of MOF-based MMMs prior to experimental investigation have been reviewed and the new computational methods that can provide information about the compatibility between the MOF and the polymer of the MMM have been discussed. We finally addressed the opportunities and challenges of using computational studies to analyze the barriers that must be overcome to advance the application of MOF-based membranes.

Download Full-text

BIOTECHNOLOGY INNOVATION: A LEGITIMACY-BASED VIEW

International Journal of Innovation and Technology Management ◽

10.1142/s0219877013500156 ◽

2013 ◽

Vol 10 (04) ◽

pp. 1350015

Author(s):

MARK J. AHN ◽

ANNE S. YORK ◽

SO YOUNG SOHN ◽

PAYAM BENYAMINI

Keyword(s):

Scientific Community ◽

Biotechnology Industry ◽

Great Promise ◽

Disruptive Technology ◽

Commercial Success ◽

Technology Platforms

Disruptive technology platforms from emerging companies hold great promise for exploiting innovation, but often face legitimacy hurdles due to their liability of newness. Nascent firms must learn new roles with limited precedent, and establish ties with an environment that may not fully understand or value their existence. Using a legitimacy-based lens in the context of the biotechnology industry, we posit a sequential construct — cognitive, regulative, and normative legitimacy — to evaluate emergent technology platforms. Our model of biotechnology platform emergence may provide insights for understanding how breakthroughs achieve legitimacy in the scientific community, mobilize resources and talent, and attain commercial success.

Download Full-text

Project Time-Cost-Quality Trade-off Problem: A Novel Approach Based on Fuzzy Decision Making

International Journal of Uncertainty Fuzziness and Knowledge-Based Systems ◽

10.1142/s0218488520500233 ◽

2020 ◽

Vol 28 (04) ◽

pp. 545-567

Author(s):

Esmaeil Keshavarz ◽

Abbas Shoul

Keyword(s):

Decision Making ◽

Completion Time ◽

Real Data ◽

Research Problem ◽

Fuzzy Decision Making ◽

Time Cost ◽

Fuzzy Decision ◽

Trade Off ◽

Novel Approach

Trade-off problems concentrate on balancing the main parameters of a project as completion time, total cost and quality of activities. In this study, the problem of project time-cost-quality trade-off is formulated and solved from a new standpoint. For this purpose, completion time and crash cost of project are illustrated as fuzzy goals, also the dependency of implementing time of each activity and its execution-quality is described by a fuzzy number. The overall quality of the project execution is defined as the minimum execution-quality of the project activities that should be maximized. Based on some real assumptions, a three-objective programming problem associated with the time-cost-quality trade-off problem is formulated; then with the aim of identifying a fair and appropriate trade-off, the research problem is reformulated as a single objective linear programming by utilizing a fuzzy decision-making methodology. Generating a final preferred solution, rather than a set of Pareto optimal solutions, and having a reasonable interpretation are two most important advantages of the proposed approach. To explain the practical performance of the proposed models and approach, a time-cost-quality trade-off problem for a project with real data is solved and analyzed.

Download Full-text

Could the human papillomavirus vaccines drive virulence evolution?

Proceedings of The Royal Society B Biological Sciences ◽

10.1098/rspb.2014.1069 ◽

2015 ◽

Vol 282 (1798) ◽

pp. 20141069 ◽

Cited By ~ 22

Author(s):

Carmen Lía Murall ◽

Chris T. Bauch ◽

Troy Day

Keyword(s):

Human Papillomavirus ◽

Sexual Partners ◽

Great Promise ◽

The Novel ◽

Evolutionary Perspective ◽

Turnover Rates ◽

High Turnover ◽

Trade Off ◽

Oncogene Expression ◽

Virulence Evolution

The human papillomavirus (HPV) vaccines hold great promise for preventing several cancers caused by HPV infections. Yet little attention has been given to whether HPV could respond evolutionarily to the new selection pressures imposed on it by the novel immunity response created by the vaccine. Here, we present and theoretically validate a mechanism by which the vaccine alters the transmission–recovery trade-off that constrains HPV's virulence such that higher oncogene expression is favoured. With a high oncogene expression strategy, the virus is able to increase its viral load and infected cell population before clearance by the vaccine, thus improving its chances of transmission. This new rapid cell-proliferation strategy is able to circulate between hosts with medium to high turnover rates of sexual partners. We also discuss the importance of better quantifying the duration of challenge infections and the degree to which a vaccinated host can shed virus. The generality of the models presented here suggests a wider applicability of this mechanism, and thus highlights the need to investigate viral oncogenicity from an evolutionary perspective.

Download Full-text