Computational Complexity and ILP Models for Pattern Problems in the Logical Analysis of Data

Giuseppe Lancia; Paolo Serafini

doi:10.3390/a14080235

Computational Complexity and ILP Models for Pattern Problems in the Logical Analysis of Data

Algorithms ◽

10.3390/a14080235 ◽

2021 ◽

Vol 14 (8) ◽

pp. 235

Author(s):

Giuseppe Lancia ◽

Paolo Serafini

Keyword(s):

Computational Complexity ◽

Logical Analysis ◽

Programming Models ◽

Minimum Size ◽

Data Sets ◽

Np Hard ◽

Logical Analysis Of Data ◽

Data Set ◽

Boolean Formulas ◽

Ilp Models

Logical Analysis of Data is a procedure aimed at identifying relevant features in data sets with both positive and negative samples. The goal is to build Boolean formulas, represented by strings over {0,1,-} called patterns, which can be used to classify new samples as positive or negative. Since a data set can be explained in alternative ways, many computational problems arise related to the choice of a particular set of patterns. In this paper we study the computational complexity of several of these pattern problems (showing that they are, in general, computationally hard) and we propose some integer programming models that appear to be effective. We describe an ILP model for finding the minimum-size set of patterns explaining a given set of samples and another one for the problem of determining whether two sets of patterns are equivalent, i.e., they explain exactly the same samples. We base our first model on a polynomial procedure that computes all patterns compatible with a given set of samples. Computational experiments substantiate the effectiveness of our models on fairly large instances. Finally, we conjecture that the existence of an effective ILP model for finding a minimum-size set of patterns equivalent to a given set of patterns is unlikely, due to the problem being NP-hard and co-NP-hard at the same time.

Download Full-text

School Size and the Influence of Socioeconomic Status on Student Achievement:Confronting the Threat of Size Bias in National Data Sets

Education Policy Analysis Archives ◽

10.14507/epaa.v12n52.2004 ◽

2004 ◽

Vol 12 ◽

pp. 52 ◽

Cited By ~ 10

Author(s):

Craig B. Howley ◽

Aimee A. Howley

Keyword(s):

Socioeconomic Status ◽

School Size ◽

Large Data ◽

Minimum Size ◽

Data Sets ◽

Data Set ◽

Student Socioeconomic Status ◽

Nationally Representative ◽

Substantial Bias ◽

Relationship Of

Most of the recent literature on the achievement effects of school size has examined school and district performance. These studies have demonstrated substantial benefits of smaller school and district size in impoverished settings. To date, however, no work has adequately examined the relationship of size and socioeconomic status (SES) with students as the unit of analysis. One study, however, came close (Lee & Smith, 1997), but failed to adjust its analyses or conclusions to the substantial bias toward larger schools evident in the data set used. The present study, based on the same large data set, but with size issues in the rural circumstance clearly in focus, reaches rather different conclusions, extending previous work for the first time to a more adequate examination of size effects on individual students. Findings challenge assertions about ideal and minimum size. Analyses include comparison of means and multi-level modeling. Methodologically, the study illustrates the challenge of using nationally representative data sets of students to investigate second-level contextual phenomena, such as school size. When aggregated to schools attended by nationally representative students, the result cannot be a nationally representative set of schools. Adjustment with weights to simulate such a distribution, moreover, is inadequate to overcome this threat if one is interested in investigating size relationships among the smaller half of US schools, as one must be in seeking to generalize results to the nation as a whole. The present study finds that the smallest national decile of size maximizes the achievement of the poorest quartile of students. Moreover, appropriate size is shown to vary by student socioeconomic status.

Download Full-text

Controllability of Control Argumentation Frameworks

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2020/257 ◽

2020 ◽

Author(s):

Andreas Niskanen ◽

Daniel Neugebauer ◽

Matti Järvisalo

Keyword(s):

Computational Complexity ◽

Boolean Satisfiability ◽

Np Hard ◽

Abstraction Refinement ◽

Computational Problem ◽

Quantified Boolean Formulas ◽

Modeling Uncertainties ◽

Boolean Formulas

Control argumentation frameworks (CAFs) allow for modeling uncertainties inherent in various argumentative settings. We establish a complete computational complexity map of the central computational problem of controllability in CAFs for five key semantics. We also develop Boolean satisfiability based counterexample-guided abstraction refinement algorithms and direct encodings of controllability as quantified Boolean formulas, and empirically evaluate their scalability on a range of NP-hard variants of controllability.

Download Full-text

Logical Analysis of Data-A Survey Paper

International Journal of Advanced Research in Computer Science and Software Engineering ◽

10.23956/ijarcsse/v7i3/0177 ◽

2017 ◽

Vol 7 (3) ◽

pp. 227-229

Author(s):

Himani Chauhan ◽

◽

Garima Saxena ◽

Arpit Tripathi ◽

◽

...

Keyword(s):

Logical Analysis ◽

Logical Analysis Of Data ◽

Survey Paper

Download Full-text

The social wasp Vespula germanica (Fabricius) (Hymenoptera: Vespidae) population dynamics in England over 39 years.

The Entomologist s monthly magazine ◽

10.31184/m00138908.1542.3906 ◽

2018 ◽

Vol 154 (2) ◽

pp. 149-155

Author(s):

Michael Archer

Keyword(s):

Population Dynamics ◽

Population Dynamic ◽

Ecological Factors ◽

Social Wasp ◽

Data Sets ◽

Data Set ◽

Vespula Germanica ◽

The Social ◽

Minimum Number ◽

Suction Traps

1. Yearly records of worker Vespula germanica (Fabricius) taken in suction traps at Silwood Park (28 years) and at Rothamsted Research (39 years) are examined. 2. Using the autocorrelation function (ACF), a significant negative 1-year lag followed by a lesser non-significant positive 2-year lag was found in all, or parts of, each data set, indicating an underlying population dynamic of a 2-year cycle with a damped waveform. 3. The minimum number of years before the 2-year cycle with damped waveform was shown varied between 17 and 26, or was not found in some data sets. 4. Ecological factors delaying or preventing the occurrence of the 2-year cycle are considered.

Download Full-text

COMPARISON OF TWO OPTIMIZATION MODELS FOR CONSTRUCTING PATTERNS IN THE METHOD OF LOGICAL ANALYSIS OF DATA

СИСТЕМЫ УПРАВЛЕНИЯ И ИНФОРМАЦИОННЫЕ ТЕХНОЛОГИИ ◽

10.36622/vstu.2020.95.50.002 ◽

2020 ◽

pp. 10-13

Author(s):

Р.И. Кузьмич ◽

А.А. Ступина ◽

С.Н. Ежеманская ◽

А.П. Шугалей

Keyword(s):

Objective Function ◽

Optimization Model ◽

Logical Analysis ◽

Optimization Models ◽

Logical Analysis Of Data

Предлагаются две оптимизационные модели для построения информативных закономерностей. Приводится эмпирическое подтверждение целесообразности использования критерия бустинга в качестве целевой функции оптимизационной модели для получения информативных закономерностей. Информативность, закономерность, критерий бустинга, оптимизационная модель Comparison of two optimization models for constructing patterns in the method of logical analysis of data Two optimization models for constructing informative patterns are proposed. An empirical confirmation of the expediency of using the boosting criterion as an objective function of the optimization model for obtaining informative patterns is given.

Download Full-text

Predictive and Descriptive CoMFA Models: The Effect of Variable Selection

Combinatorial Chemistry & High Throughput Screening ◽

10.2174/1386207321666180212162028 ◽

2018 ◽

Vol 21 (2) ◽

pp. 117-124 ◽

Cited By ~ 4

Author(s):

Bakhtyar Sepehri ◽

Nematollah Omidikia ◽

Mohsen Kompany-Zareh ◽

Raouf Ghavami

Keyword(s):

Variable Selection ◽

Predictive Power ◽

Selection Method ◽

Data Sets ◽

Data Set ◽

Comfa Model ◽

Variable Selection Method

Aims & Scope: In this research, 8 variable selection approaches were used to investigate the effect of variable selection on the predictive power and stability of CoMFA models. Materials & Methods: Three data sets including 36 EPAC antagonists, 79 CD38 inhibitors and 57 ATAD2 bromodomain inhibitors were modelled by CoMFA. First of all, for all three data sets, CoMFA models with all CoMFA descriptors were created then by applying each variable selection method a new CoMFA model was developed so for each data set, 9 CoMFA models were built. Obtained results show noisy and uninformative variables affect CoMFA results. Based on created models, applying 5 variable selection approaches including FFD, SRD-FFD, IVE-PLS, SRD-UVEPLS and SPA-jackknife increases the predictive power and stability of CoMFA models significantly. Result & Conclusion: Among them, SPA-jackknife removes most of the variables while FFD retains most of them. FFD and IVE-PLS are time consuming process while SRD-FFD and SRD-UVE-PLS run need to few seconds. Also applying FFD, SRD-FFD, IVE-PLS, SRD-UVE-PLS protect CoMFA countor maps information for both fields.

Download Full-text

Human Activity Recognition using Fourier Transform Inspired Deep Learning Combination Model

International Journal of Sensors Wireless Communications and Control ◽

10.2174/2210327908666180727123657 ◽

2019 ◽

Vol 9 (1) ◽

pp. 16-31

Author(s):

Kyungkoo Jun

Keyword(s):

Fourier Transform ◽

Deep Learning ◽

Short Term Memory ◽

Window Size ◽

Sensor Data ◽

Data Sets ◽

Data Set ◽

Proposed Model ◽

Testing Data ◽

Labeling Scheme

Background & Objective: This paper proposes a Fourier transform inspired method to classify human activities from time series sensor data. Methods: Our method begins by decomposing 1D input signal into 2D patterns, which is motivated by the Fourier conversion. The decomposition is helped by Long Short-Term Memory (LSTM) which captures the temporal dependency from the signal and then produces encoded sequences. The sequences, once arranged into the 2D array, can represent the fingerprints of the signals. The benefit of such transformation is that we can exploit the recent advances of the deep learning models for the image classification such as Convolutional Neural Network (CNN). Results: The proposed model, as a result, is the combination of LSTM and CNN. We evaluate the model over two data sets. For the first data set, which is more standardized than the other, our model outperforms previous works or at least equal. In the case of the second data set, we devise the schemes to generate training and testing data by changing the parameters of the window size, the sliding size, and the labeling scheme. Conclusion: The evaluation results show that the accuracy is over 95% for some cases. We also analyze the effect of the parameters on the performance.

Download Full-text

An Algorithm for the Removal of Cosmic Ray Artifacts in Spectral Data Sets

Applied Spectroscopy ◽

10.1177/0003702819839098 ◽

2019 ◽

Vol 73 (8) ◽

pp. 893-901

Author(s):

Sinead J. Barton ◽

Bryan M. Hennelly

Keyword(s):

Cosmic Ray ◽

Data Sets ◽

Biological Cells ◽

Statistical Classification ◽

Signal To Noise ◽

Multivariate Statistical ◽

Data Set ◽

Artefact Removal ◽

Single Capture ◽

Acquisition Method

Cosmic ray artifacts may be present in all photo-electric readout systems. In spectroscopy, they present as random unidirectional sharp spikes that distort spectra and may have an affect on post-processing, possibly affecting the results of multivariate statistical classification. A number of methods have previously been proposed to remove cosmic ray artifacts from spectra but the goal of removing the artifacts while making no other change to the underlying spectrum is challenging. One of the most successful and commonly applied methods for the removal of comic ray artifacts involves the capture of two sequential spectra that are compared in order to identify spikes. The disadvantage of this approach is that at least two recordings are necessary, which may be problematic for dynamically changing spectra, and which can reduce the signal-to-noise (S/N) ratio when compared with a single recording of equivalent duration due to the inclusion of two instances of read noise. In this paper, a cosmic ray artefact removal algorithm is proposed that works in a similar way to the double acquisition method but requires only a single capture, so long as a data set of similar spectra is available. The method employs normalized covariance in order to identify a similar spectrum in the data set, from which a direct comparison reveals the presence of cosmic ray artifacts, which are then replaced with the corresponding values from the matching spectrum. The advantage of the proposed method over the double acquisition method is investigated in the context of the S/N ratio and is applied to various data sets of Raman spectra recorded from biological cells.

Download Full-text

Imbalanced Data Detection Kernel Method in Closed Systems

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.756-759.3652 ◽

2013 ◽

Vol 756-759 ◽

pp. 3652-3658

Author(s):

You Li Lu ◽

Jun Luo

Keyword(s):

Kernel Methods ◽

Kernel Method ◽

Imbalanced Data ◽

Data Detection ◽

Data Sets ◽

System Call ◽

Data Set ◽

Imbalanced Data Sets ◽

Lower Complexity ◽

Closed Systems

Under the study of Kernel Methods, this paper put forward two improved algorithm which called R-SVM & I-SVDD in order to cope with the imbalanced data sets in closed systems. R-SVM used K-means algorithm clustering space samples while I-SVDD improved the performance of original SVDD by imbalanced sample training. Experiment of two sets of system call data set shows that these two algorithms are more effectively and R-SVM has a lower complexity.

Download Full-text

Investigating the impact of pre-processing techniques and pre-trained word embeddings in detecting Arabic health information on social media

Journal Of Big Data ◽

10.1186/s40537-021-00488-w ◽

2021 ◽

Vol 8 (1) ◽

Author(s):

Yahya Albalawi ◽

Jim Buckley ◽

Nikola S. Nikolov

Keyword(s):

Social Media ◽

Deep Learning ◽

Comprehensive Evaluation ◽

Classification Problem ◽

Data Sets ◽

Word Embeddings ◽

Data Set ◽

Lower Accuracy ◽

Health Related ◽

The Impact

AbstractThis paper presents a comprehensive evaluation of data pre-processing and word embedding techniques in the context of Arabic document classification in the domain of health-related communication on social media. We evaluate 26 text pre-processings applied to Arabic tweets within the process of training a classifier to identify health-related tweets. For this task we use the (traditional) machine learning classifiers KNN, SVM, Multinomial NB and Logistic Regression. Furthermore, we report experimental results with the deep learning architectures BLSTM and CNN for the same text classification problem. Since word embeddings are more typically used as the input layer in deep networks, in the deep learning experiments we evaluate several state-of-the-art pre-trained word embeddings with the same text pre-processing applied. To achieve these goals, we use two data sets: one for both training and testing, and another for testing the generality of our models only. Our results point to the conclusion that only four out of the 26 pre-processings improve the classification accuracy significantly. For the first data set of Arabic tweets, we found that Mazajak CBOW pre-trained word embeddings as the input to a BLSTM deep network led to the most accurate classifier with F1 score of 89.7%. For the second data set, Mazajak Skip-Gram pre-trained word embeddings as the input to BLSTM led to the most accurate model with F1 score of 75.2% and accuracy of 90.7% compared to F1 score of 90.8% achieved by Mazajak CBOW for the same architecture but with lower accuracy of 70.89%. Our results also show that the performance of the best of the traditional classifier we trained is comparable to the deep learning methods on the first dataset, but significantly worse on the second dataset.

Download Full-text