A Bayesian Model for Bivariate Causal Inference

Maximilian Kurthen; Torsten Enßlin

doi:10.3390/e22010046

A Bayesian Model for Bivariate Causal Inference

Entropy ◽

10.3390/e22010046 ◽

2019 ◽

Vol 22 (1) ◽

pp. 46

Author(s):

Maximilian Kurthen ◽

Torsten Enßlin

Keyword(s):

Causal Inference ◽

Bayesian Model ◽

State Of The Art ◽

Causal Relation ◽

Synthetic Data ◽

Bayesian Hierarchical ◽

Adopted Model ◽

Probability Densities ◽

Ill Posed ◽

Discrete Nature

We address the problem of two-variable causal inference without intervention. This task is to infer an existing causal relation between two random variables, i.e., X → Y or Y → X , from purely observational data. As the option to modify a potential cause is not given in many situations, only structural properties of the data can be used to solve this ill-posed problem. We briefly review a number of state-of-the-art methods for this, including very recent ones. A novel inference method is introduced, Bayesian Causal Inference (BCI) which assumes a generative Bayesian hierarchical model to pursue the strategy of Bayesian model selection. In the adopted model, the distribution of the cause variable is given by a Poisson lognormal distribution, which allows to explicitly regard the discrete nature of datasets, correlations in the parameter spaces, as well as the variance of probability densities on logarithmic scales. We assume Fourier diagonal Field covariance operators. The model itself is restricted to use cases where a direct causal relation X → Y has to be decided against a relation Y → X , therefore we compare it other methods for this exact problem setting. The generative model assumed provides synthetic causal data for benchmarking our model in comparison to existing state-of-the-art models, namely LiNGAM, ANM-HSIC, ANM-MML, IGCI, and CGNN. We explore how well the above methods perform in case of high noise settings, strongly discretized data, and very sparse data. BCI performs generally reliably with synthetic data as well as with the real world TCEP benchmark set, with an accuracy comparable to state-of-the-art algorithms. We discuss directions for the future development of BCI.

Download Full-text

Gradient Profile Estimation Using Exponential Cubic Spline Smoothing in a Bayesian Framework

Entropy ◽

10.3390/e23060674 ◽

2021 ◽

Vol 23 (6) ◽

pp. 674

Author(s):

Kushani De De Silva ◽

Carlo Cafaro ◽

Adom Giffin

Keyword(s):

State Of The Art ◽

Estimation Method ◽

Synthetic Data ◽

Noisy Data ◽

Bayesian Framework ◽

Physical Systems ◽

Gradient Profile ◽

Ill Posed ◽

Profile Estimation ◽

Different Levels

Attaining reliable gradient profiles is of utmost relevance for many physical systems. In many situations, the estimation of the gradient is inaccurate due to noise. It is common practice to first estimate the underlying system and then compute the gradient profile by taking the subsequent analytic derivative of the estimated system. The underlying system is often estimated by fitting or smoothing the data using other techniques. Taking the subsequent analytic derivative of an estimated function can be ill-posed. This becomes worse as the noise in the system increases. As a result, the uncertainty generated in the gradient estimate increases. In this paper, a theoretical framework for a method to estimate the gradient profile of discrete noisy data is presented. The method was developed within a Bayesian framework. Comprehensive numerical experiments were conducted on synthetic data at different levels of noise. The accuracy of the proposed method was quantified. Our findings suggest that the proposed gradient profile estimation method outperforms the state-of-the-art methods.

Download Full-text

G-Tric: generating three-way synthetic datasets with triclustering solutions

BMC Bioinformatics ◽

10.1186/s12859-020-03925-4 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

João Lobo ◽

Rui Henriques ◽

Sara C. Madeira

Keyword(s):

State Of The Art ◽

Synthetic Data ◽

Ground Truth ◽

Real Data ◽

Three Dimensions ◽

Additional Advantage ◽

Urban Dynamics ◽

Data Generator ◽

Real World Datasets ◽

Synthetic Datasets

Abstract Background Three-way data started to gain popularity due to their increasing capacity to describe inherently multivariate and temporal events, such as biological responses, social interactions along time, urban dynamics, or complex geophysical phenomena. Triclustering, subspace clustering of three-way data, enables the discovery of patterns corresponding to data subspaces (triclusters) with values correlated across the three dimensions (observations $$\times$$ × features $$\times$$ × contexts). With increasing number of algorithms being proposed, effectively comparing them with state-of-the-art algorithms is paramount. These comparisons are usually performed using real data, without a known ground-truth, thus limiting the assessments. In this context, we propose a synthetic data generator, G-Tric, allowing the creation of synthetic datasets with configurable properties and the possibility to plant triclusters. The generator is prepared to create datasets resembling real 3-way data from biomedical and social data domains, with the additional advantage of further providing the ground truth (triclustering solution) as output. Results G-Tric can replicate real-world datasets and create new ones that match researchers needs across several properties, including data type (numeric or symbolic), dimensions, and background distribution. Users can tune the patterns and structure that characterize the planted triclusters (subspaces) and how they interact (overlapping). Data quality can also be controlled, by defining the amount of missing, noise or errors. Furthermore, a benchmark of datasets resembling real data is made available, together with the corresponding triclustering solutions (planted triclusters) and generating parameters. Conclusions Triclustering evaluation using G-Tric provides the possibility to combine both intrinsic and extrinsic metrics to compare solutions that produce more reliable analyses. A set of predefined datasets, mimicking widely used three-way data and exploring crucial properties was generated and made available, highlighting G-Tric’s potential to advance triclustering state-of-the-art by easing the process of evaluating the quality of new triclustering approaches.

Download Full-text

Extraction of causal relations based on SBEL and BERT model

Database ◽

10.1093/database/baab005 ◽

2021 ◽

Vol 2021 ◽

Author(s):

Yifan Shao ◽

Haoru Li ◽

Jinghang Gu ◽

Longhua Qian ◽

Guodong Zhou

Keyword(s):

State Of The Art ◽

Causal Relation ◽

Relation Extraction ◽

The Other ◽

Biomedical Text ◽

Intermediate Form ◽

Biomedical Text Mining ◽

Causal Relations ◽

The One ◽

Stage 1

Abstract Extraction of causal relations between biomedical entities in the form of Biological Expression Language (BEL) poses a new challenge to the community of biomedical text mining due to the complexity of BEL statements. We propose a simplified form of BEL statements [Simplified Biological Expression Language (SBEL)] to facilitate BEL extraction and employ BERT (Bidirectional Encoder Representation from Transformers) to improve the performance of causal relation extraction (RE). On the one hand, BEL statement extraction is transformed into the extraction of an intermediate form—SBEL statement, which is then further decomposed into two subtasks: entity RE and entity function detection. On the other hand, we use a powerful pretrained BERT model to both extract entity relations and detect entity functions, aiming to improve the performance of two subtasks. Entity relations and functions are then combined into SBEL statements and finally merged into BEL statements. Experimental results on the BioCreative-V Track 4 corpus demonstrate that our method achieves the state-of-the-art performance in BEL statement extraction with F1 scores of 54.8% in Stage 2 evaluation and of 30.1% in Stage 1 evaluation, respectively. Database URL: https://github.com/grapeff/SBEL_datasets

Download Full-text

Estimating Efforts and Success of Symmetry-Seeing Machines by Use of Synthetic Data

Symmetry ◽

10.3390/sym11020227 ◽

2019 ◽

Vol 11 (2) ◽

pp. 227

Author(s):

Eckart Michaelsen ◽

Stéphane Vujasinovic

Keyword(s):

Extraction Method ◽

Input Data ◽

State Of The Art ◽

Recognition Performance ◽

Human Subjects ◽

Synthetic Data ◽

Ground Truth ◽

Comparative Test ◽

Real Imagery ◽

Dominant Part

Representative input data are a necessary requirement for the assessment of machine-vision systems. For symmetry-seeing machines in particular, such imagery should provide symmetries as well as asymmetric clutter. Moreover, there must be reliable ground truth with the data. It should be possible to estimate the recognition performance and the computational efforts by providing different grades of difficulty and complexity. Recent competitions used real imagery labeled by human subjects with appropriate ground truth. The paper at hand proposes to use synthetic data instead. Such data contain symmetry, clutter, and nothing else. This is preferable because interference with other perceptive capabilities, such as object recognition, or prior knowledge, can be avoided. The data are given sparsely, i.e., as sets of primitive objects. However, images can be generated from them, so that the same data can also be fed into machines requiring dense input, such as multilayered perceptrons. Sparse representations are preferred, because the author’s own system requires such data, and in this way, any influence of the primitive extraction method is excluded. The presented format allows hierarchies of symmetries. This is important because hierarchy constitutes a natural and dominant part in symmetry-seeing. The paper reports some experiments using the author’s Gestalt algebra system as symmetry-seeing machine. Additionally included is a comparative test run with the state-of-the-art symmetry-seeing deep learning convolutional perceptron of the PSU. The computational efforts and recognition performance are assessed.

Download Full-text

Improving Skin Lesion Analysis with Generative Adversarial Networks

10.5753/sibgrapi.est.2020.12986 ◽

2020 ◽

Author(s):

Alceu Bissoto ◽

Sandra Avila

Keyword(s):

Skin Lesion ◽

State Of The Art ◽

Synthetic Data ◽

Clinical Information ◽

Analysis Data ◽

Training Dataset ◽

Generative Adversarial Networks ◽

Classification Models ◽

Adversarial Networks ◽

Lesion Analysis

Melanoma is the most lethal type of skin cancer. Early diagnosis is crucial to increase the survival rate of those patients due to the possibility of metastasis. Automated skin lesion analysis can play an essential role by reaching people that do not have access to a specialist. However, since deep learning became the state-of-the-art for skin lesion analysis, data became a decisive factor in pushing the solutions further. The core objective of this M.Sc. dissertation is to tackle the problems that arise by having limited datasets. In the first part, we use generative adversarial networks to generate synthetic data to augment our classification model’s training datasets to boost performance. Our method generates high-resolution clinically-meaningful skin lesion images, that when compound our classification model’s training dataset, consistently improved the performance in different scenarios, for distinct datasets. We also investigate how our classification models perceived the synthetic samples and how they can aid the model’s generalization. Finally, we investigate a problem that usually arises by having few, relatively small datasets that are thoroughly re-used in the literature: bias. For this, we designed experiments to study how our models’ use data, verifying how it exploits correct (based on medical algorithms), and spurious (based on artifacts introduced during image acquisition) correlations. Disturbingly, even in the absence of any clinical information regarding the lesion being diagnosed, our classification models presented much better performance than chance (even competing with specialists benchmarks), highly suggesting inflated performances.

Download Full-text

Present State-of-The-Art of Association Rule Mining Algorithms

International Journal of Engineering and Advanced Technology - Regular Issue ◽

10.35940/ijeat.a2202.109119 ◽

2019 ◽

Vol 9 (1) ◽

pp. 6398-6405

Keyword(s):

Data Mining ◽

Association Rule ◽

Association Rule Mining ◽

State Of The Art ◽

Synthetic Data ◽

Data Sets ◽

Evolutionary Analysis ◽

Rule Mining ◽

Transaction Database ◽

Mining Algorithms

A Data mining is the method of extracting useful information from various repositories such as Relational Database, Transaction database, spatial database, Temporal and Time-series database, Data Warehouses, World Wide Web. Various functionalities of Data mining include Characterization and Discrimination, Classification and prediction, Association Rule Mining, Cluster analysis, Evolutionary analysis. Association Rule mining is one of the most important techniques of Data Mining, that aims at extracting interesting relationships within the data. In this paper we study various Association Rule mining algorithms, also compare them by using synthetic data sets, and we provide the results obtained from the experimental analysis

Download Full-text

Optimization Learning: Perspective, Method, and Applications

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2020/728 ◽

2020 ◽

Author(s):

Risheng Liu

Keyword(s):

Inverse Problems ◽

Theoretical Analysis ◽

Iterative Methods ◽

State Of The Art ◽

Learning Paradigm ◽

Rigorous Analysis ◽

The Core ◽

Globally Convergent ◽

Ill Posed ◽

Art Performance

Numerous tasks at the core of statistics, learning, and vision areas are specific cases of ill-posed inverse problems. Recently, learning-based (e.g., deep) iterative methods have been empirically shown to be useful for these problems. Nevertheless, integrating learnable structures into iterations is still a laborious process, which can only be guided by intuitions or empirical insights. Moreover, there is a lack of rigorous analysis of the convergence behaviors of these reimplemented iterations, and thus the significance of such methods is a little bit vague. We move beyond these limits and propose a theoretically guaranteed optimization learning paradigm, a generic and provable paradigm for nonconvex inverse problems, and develop a series of convergent deep models. Our theoretical analysis reveals that the proposed optimization learning paradigm allows us to generate globally convergent trajectories for learning-based iterative methods. Thanks to the superiority of our framework, we achieve state-of-the-art performance on different real applications.

Download Full-text

AI-driven deep CNN approach for multi-label pathology classification using chest X-Rays

PeerJ Computer Science ◽

10.7717/peerj-cs.495 ◽

2021 ◽

Vol 7 ◽

pp. e495

Author(s):

Saleh Albahli ◽

Hafiz Tayyab Rauf ◽

Abdulelah Algosaibi ◽

Valentina Emilia Balas

Keyword(s):

Neural Networks ◽

Data Augmentation ◽

State Of The Art ◽

Synthetic Data ◽

X Rays ◽

Deep Convolutional Neural Networks ◽

Current State ◽

Pathology Classification ◽

Wide Range ◽

Multi Class Classification

Artificial intelligence (AI) has played a significant role in image analysis and feature extraction, applied to detect and diagnose a wide range of chest-related diseases. Although several researchers have used current state-of-the-art approaches and have produced impressive chest-related clinical outcomes, specific techniques may not contribute many advantages if one type of disease is detected without the rest being identified. Those who tried to identify multiple chest-related diseases were ineffective due to insufficient data and the available data not being balanced. This research provides a significant contribution to the healthcare industry and the research community by proposing a synthetic data augmentation in three deep Convolutional Neural Networks (CNNs) architectures for the detection of 14 chest-related diseases. The employed models are DenseNet121, InceptionResNetV2, and ResNet152V2; after training and validation, an average ROC-AUC score of 0.80 was obtained competitive as compared to the previous models that were trained for multi-class classification to detect anomalies in x-ray images. This research illustrates how the proposed model practices state-of-the-art deep neural networks to classify 14 chest-related diseases with better accuracy.

Download Full-text

A Bayesian model integration for mutation calling through data partitioning

Bioinformatics ◽

10.1093/bioinformatics/btz233 ◽

2019 ◽

Vol 35 (21) ◽

pp. 4247-4254 ◽

Cited By ~ 2

Author(s):

Takuya Moriyama ◽

Seiya Imoto ◽

Shuto Hayashi ◽

Yuichi Shiraishi ◽

Satoru Miyano ◽

...

Keyword(s):

Hierarchical Model ◽

Bayesian Model ◽

Information Sources ◽

Tumor Model ◽

Bayesian Hierarchical Model ◽

Generative Models ◽

Error Model ◽

Model Integration ◽

Bayesian Hierarchical ◽

Model Based

Abstract Motivation Detection of somatic mutations from tumor and matched normal sequencing data has become among the most important analysis methods in cancer research. Some existing mutation callers have focused on additional information, e.g. heterozygous single-nucleotide polymorphisms (SNPs) nearby mutation candidates or overlapping paired-end read information. However, existing methods cannot take multiple information sources into account simultaneously. Existing Bayesian hierarchical model-based methods construct two generative models, the tumor model and error model, and limited information sources have been modeled. Results We proposed a Bayesian model integration framework named as partitioning-based model integration. In this framework, through introducing partitions for paired-end reads based on given information sources, we integrate existing generative models and utilize multiple information sources. Based on that, we constructed a novel Bayesian hierarchical model-based method named as OHVarfinDer. In both the tumor model and error model, we introduced partitions for a set of paired-end reads that cover a mutation candidate position, and applied a different generative model for each category of paired-end reads. We demonstrated that our method can utilize both heterozygous SNP information and overlapping paired-end read information effectively in simulation datasets and real datasets. Availability and implementation https://github.com/takumorizo/OHVarfinDer. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Hierarchical Bayesian model to inferPL(Z)relations usingGaiaparallaxes

Astronomy and Astrophysics ◽

10.1051/0004-6361/201832945 ◽

2019 ◽

Vol 623 ◽

pp. A156 ◽

Cited By ~ 3

Author(s):

H. E. Delgado ◽

L. M. Sarro ◽

G. Clementini ◽

T. Muraveva ◽

A. Garofalo

Keyword(s):

Bayesian Model ◽

Probability Distributions ◽

Synthetic Data ◽

Full Description ◽

Hierarchical Bayesian ◽

Hierarchical Bayesian Model ◽

Rr Lyrae Stars ◽

Rr Lyrae ◽

Data Release ◽

Lyrae Stars

In a recent study we analysed period–luminosity–metallicity (PLZ) relations for RR Lyrae stars using theGaiaData Release 2 (DR2) parallaxes. It built on a previous work that was based on the firstGaiaData Release (DR1), and also included period–luminosity (PL) relations for Cepheids and RR Lyrae stars. The method used to infer the relations fromGaiaDR2 data and one of the methods used forGaiaDR1 data was based on a Bayesian model, the full description of which was deferred to a subsequent publication. This paper presents the Bayesian method for the inference of the parameters ofPL(Z) relations used in those studies, the main feature of which is to manage the uncertainties on observables in a rigorous and well-founded way. The method encodes the probability relationships between the variables of the problem in a hierarchical Bayesian model and infers the posterior probability distributions of thePL(Z) relationship coefficients using Markov chain Monte Carlo simulation techniques. We evaluate the method with several semi-synthetic data sets and apply it to a sample of 200 fundamental and first-overtone RR Lyrae stars for whichGaiaDR1 parallaxes and literatureKs-band mean magnitudes are available. We define and test several hyperprior probabilities to verify their adequacy and check the sensitivity of the solution with respect to the prior choice. The main conclusion of this work, based on the test with semi-syntheticGaiaDR1 parallaxes, is the absolute necessity of incorporating the existing correlations between the period, metallicity, and parallax measurements in the form of model priors in order to avoid systematically biased results, especially in the case of non-negligible uncertainties in the parallaxes. The relation coefficients obtained here have been superseded by those presented in our recent paper that incorporates the findings of this work and the more recentGaiaDR2 measurements.

Download Full-text