A Scalable Algorithm for One-to-One, Onto, and Partial Schema Matching with Uninterpreted Column Names and Column Values

Boris Rabinovich; Mark Last

doi:10.4018/jdm.2014100101

A Scalable Algorithm for One-to-One, Onto, and Partial Schema Matching with Uninterpreted Column Names and Column Values

Journal of Database Management ◽

10.4018/jdm.2014100101 ◽

2014 ◽

Vol 25 (4) ◽

pp. 1-16

Author(s):

Boris Rabinovich ◽

Mark Last

Keyword(s):

Mutual Information ◽

Data Integration ◽

Bayesian Networks ◽

Data Privacy ◽

Privacy Preservation ◽

Data Warehousing ◽

Schema Matching ◽

Database Applications ◽

Scalable Algorithm ◽

Matching Techniques

In this paper, the authors propose a five-step approach to the problem of identifying semantic correspondences between attributes of two database schemas. It is one of the key challenges in many database applications such as data integration and data warehousing. The authors' research is focused on uninterpreted schema matching, where the column names and column values are uninterpreted or unreliable. The approach implements Bayesian networks, Pearson's correlation and mutual information to identify inter-attribute dependencies. Additionally, the authors propose an extension to their algorithm that allows the user to manually enter the known mappings to improve the automated matching results. The five-step approach also allows data privacy preservation. The authors' evaluation experiments show that the proposed approach enhances the current set of schema matching techniques.

Download Full-text

Schema Matching Quality: Thesaurus as the Matcher

Jurnal Teknologi ◽

10.11113/jt.v70.3514 ◽

2014 ◽

Vol 70 (5) ◽

Author(s):

Thabit Sabbah ◽

Ali Selamat

Keyword(s):

Information Retrieval ◽

Data Integration ◽

Query Processing ◽

Data Warehousing ◽

Schema Matching ◽

Semantic Query ◽

Integration Data ◽

F Measure

Thesaurus is used in many Information Retrieval (IR) applications such as data integration, data warehousing, semantic query processing and classifiers. It was also utilized to solve the problem of schema matching. Considering the fact of existence of many thesauri for a certain area of knowledge, the quality of schema matching results when using different thesauri in the same field is not predictable. In this paper, we propose a methodology to study the performance of the thesaurus in solving schema matching. The paper also presents results of experiments using different thesauri. Precision, recall, F-measure, and similarity average were calculated to show that the quality of matching changed according to the used thesaurus.

Download Full-text

Big Data Privacy Preservation Using Two Phase Top-Down Specialization Algorithm with Multidimensional Map Reduce Framework on Hadoop

International Journal of Distributed and Cloud Computing ◽

10.21863/ijdcc/2015.3.2.009 ◽

2015 ◽

Vol 3 (2) ◽

Author(s):

Shalin Eliabeth S. ◽

Sarju S.

Keyword(s):

Big Data ◽

Data Privacy ◽

Privacy Preservation ◽

Experimental Result ◽

Map Reduce ◽

Distributed Environment ◽

Top Down ◽

Two Phase ◽

Data Anonymization ◽

Big Data Privacy

Big data privacy preservation is one of the most disturbed issues in current industry. Sometimes the data privacy problems never identified when input data is published on cloud environment. Data privacy preservation in hadoop deals in hiding and publishing input dataset to the distributed environment. In this paper investigate the problem of big data anonymization for privacy preservation from the perspectives of scalability and time factor etc. At present, many cloud applications with big data anonymization faces the same kind of problems. For recovering this kind of problems, here introduced a data anonymization algorithm called Two Phase Top-Down Specialization (TPTDS) algorithm that is implemented in hadoop. For the data anonymization-45,222 records of adults information with 15 attribute values was taken as the input big data. With the help of multidimensional anonymization in map reduce framework, here implemented proposed Two-Phase Top-Down Specialization anonymization algorithm in hadoop and it will increases the efficiency on the big data processing system. By conducting experiment in both one dimensional and multidimensional map reduce framework with Two Phase Top-Down Specialization algorithm on hadoop, the better result shown in multidimensional anonymization on input adult dataset. Data sets is generalized in a top-down manner and the better result was shown in multidimensional map reduce framework by the better IGPL values generated by the algorithm. The anonymization was performed with specialization operation on taxonomy tree. The experiment shows that the solutions improves the IGPL values, anonymity parameter and decreases the execution time of big data privacy preservation by compared to the existing algorithm. This experimental result will leads to great application to the distributed environment.

Download Full-text

Application of Bayesian networks to generate synthetic health data

Journal of the American Medical Informatics Association ◽

10.1093/jamia/ocaa303 ◽

2020 ◽

Author(s):

Dhamanpreet Kaur ◽

Matthew Sobiesk ◽

Shubham Patil ◽

Jin Liu ◽

Puran Bhagat ◽

...

Keyword(s):

Machine Learning ◽

Bayesian Networks ◽

Data Privacy ◽

Statistical Tests ◽

Synthetic Data ◽

Original Data ◽

Health Data ◽

Minimal Risk ◽

Data Types ◽

Automated Method

Abstract Objective This study seeks to develop a fully automated method of generating synthetic data from a real dataset that could be employed by medical organizations to distribute health data to researchers, reducing the need for access to real data. We hypothesize the application of Bayesian networks will improve upon the predominant existing method, medBGAN, in handling the complexity and dimensionality of healthcare data. Materials and Methods We employed Bayesian networks to learn probabilistic graphical structures and simulated synthetic patient records from the learned structure. We used the University of California Irvine (UCI) heart disease and diabetes datasets as well as the MIMIC-III diagnoses database. We evaluated our method through statistical tests, machine learning tasks, preservation of rare events, disclosure risk, and the ability of a machine learning classifier to discriminate between the real and synthetic data. Results Our Bayesian network model outperformed or equaled medBGAN in all key metrics. Notable improvement was achieved in capturing rare variables and preserving association rules. Discussion Bayesian networks generated data sufficiently similar to the original data with minimal risk of disclosure, while offering additional transparency, computational efficiency, and capacity to handle more data types in comparison to existing methods. We hope this method will allow healthcare organizations to efficiently disseminate synthetic health data to researchers, enabling them to generate hypotheses and develop analytical tools. Conclusion We conclude the application of Bayesian networks is a promising option for generating realistic synthetic health data that preserves the features of the original data without compromising data privacy.

Download Full-text

LRDM: Local Record-Driving Mechanism for Big Data Privacy Preservation in Social Networks

2016 IEEE First International Conference on Data Science in Cyberspace (DSC) ◽

10.1109/dsc.2016.94 ◽

2016 ◽

Cited By ~ 2

Author(s):

Weihao Li ◽

Hui Li

Keyword(s):

Social Networks ◽

Big Data ◽

Data Privacy ◽

Privacy Preservation ◽

Driving Mechanism ◽

Big Data Privacy

Download Full-text

Mediating between heterogeneous ontologies using schema matching techniques

IRI -2005 IEEE International Conference on Information Reuse and Integration, Conf, 2005. ◽

10.1109/iri-05.2005.1506481 ◽

2005 ◽

Cited By ~ 1

Author(s):

O. Lyttleton ◽

D. Sinclair ◽

D. Tracey

Keyword(s):

Schema Matching ◽

Matching Techniques

Download Full-text

CDC: Classification Driven Compression for Bandwidth Efficient Edge-Cloud Collaborative Deep Learning

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2020/467 ◽

2020 ◽

Author(s):

Yuanrui Dong ◽

Peng Zhao ◽

Hanqiao Yu ◽

Cong Zhao ◽

Shusen Yang

Keyword(s):

Deep Learning ◽

Classification Accuracy ◽

Data Privacy ◽

Privacy Preservation ◽

Fine Tuning ◽

Quantization Scheme ◽

Raw Data ◽

Resource Limited ◽

Lower Accuracy ◽

Bandwidth Consumption

The emerging edge-cloud collaborative Deep Learning (DL) paradigm aims at improving the performance of practical DL implementations in terms of cloud bandwidth consumption, response latency, and data privacy preservation. Focusing on bandwidth efficient edge-cloud collaborative training of DNN-based classifiers, we present CDC, a Classification Driven Compression framework that reduces bandwidth consumption while preserving classification accuracy of edge-cloud collaborative DL. Specifically, to reduce bandwidth consumption, for resource-limited edge servers, we develop a lightweight autoencoder with a classification guidance for compression with classification driven feature preservation, which allows edges to only upload the latent code of raw data for accurate global training on the Cloud. Additionally, we design an adjustable quantization scheme adaptively pursuing the tradeoff between bandwidth consumption and classification accuracy under different network conditions, where only fine-tuning is required for rapid compression ratio adjustment. Results of extensive experiments demonstrate that, compared with DNN training with raw data, CDC consumes 14.9× less bandwidth with an accuracy loss no more than 1.06%, and compared with DNN training with data compressed by AE without guidance, CDC introduces at least 100% lower accuracy loss.

Download Full-text

Learning Bayesian Networks Based on a Mutual Information Scoring Function and EMI Method

Advances in Neural Networks – ISNN 2007 - Lecture Notes in Computer Science ◽

10.1007/978-3-540-72393-6_50 ◽

2007 ◽

pp. 414-423 ◽

Cited By ~ 1

Author(s):

Fengzhan Tian ◽

Haisheng Li ◽

Zhihai Wang ◽

Jian Yu

Keyword(s):

Mutual Information ◽

Bayesian Networks ◽

Scoring Function

Download Full-text

Privacy Preserving OLAP and OLAP Security

Encyclopedia of Data Warehousing and Mining, Second Edition ◽

10.4018/978-1-60566-010-3.ch241 ◽

2011 ◽

pp. 1575-1581 ◽

Cited By ~ 26

Author(s):

Alfredo Cuzzocrea ◽

Vincenzo Russo

Keyword(s):

Access Control ◽

Privacy Preservation ◽

Data Warehousing ◽

Privacy Preserving ◽

Data Cube ◽

The Other ◽

Complex Data ◽

Privacy And Security ◽

Domain Experts ◽

Data Cubes

The problem of ensuring the privacy and security of OLAP data cubes (Gray et al., 1997) arises in several fields ranging from advanced Data Warehousing (DW) and Business Intelligence (BI) systems to sophisticated Data Mining (DM) tools. In DW and BI systems, decision making analysts aim at avoiding that malicious users access perceptive ranges of multidimensional data in order to infer sensitive knowledge, or attack corporate data cubes via violating user rules, grants and revokes. In DM tools, domain experts aim at avoiding that malicious users infer critical-for-thetask knowledge from authoritative DM results such as frequent item sets, patterns and regularities, clusters, and discovered association rules. In more detail, the former application scenario (i.e., DW and BI systems) deals with both the privacy preservation and the security of data cubes, whereas the latter one (i.e., DM tools) deals with privacy preserving OLAP issues solely. With respect to security issues, although security aspects of information systems include a plethora of topics ranging from cryptography to access control and secure digital signature, in our work we particularly focus on access control techniques for data cubes, and remand the reader to the active literature for the other orthogonal matters. Specifically, privacy preservation of data cubes refers to the problem of ensuring the privacy of data cube cells (and, in turn, that of queries defined over collections of data cube cells), i.e. hiding sensitive information and knowledge during data management activities, according to the general guidelines drawn by Sweeney in her seminar paper (Sweeney, 2002), whereas access control issues refer to the problem of ensuring the security of data cube cells, i.e. restricting the access of unauthorized users to specific sub-domains of the target data cube, according to well-known concepts studied and assessed in the context of DBMS security. Nonetheless, it is quite straightforward foreseeing that these two even distinct aspects should be meaningfully integrated in order to ensure both the privacy and security of complex data cubes, i.e. data cubes built on top of complex data/knowledge bases. During last years, these topics have became of great interest for the Data Warehousing and Databases research communities, due to their exciting theoretical challenges as well as their relevance and practical impact in modern real-life OLAP systems and applications. On a more conceptual plane, theoretical aspects are mainly devoted to study how probability and statistics schemes as well as rule-based models can be applied in order to efficiently solve the above-introduced problems. On a more practical plane, researchers and practitioners aim at integrating convenient privacy preserving and security solutions within the core layers of commercial OLAP server platforms. Basically, to tackle deriving privacy preservation challenges in OLAP, researchers have proposed models and algorithms that can be roughly classified within two main classes: restriction-based techniques, and data perturbation techniques. First ones propose limiting the number of query kinds that can be posed against the target OLAP server. Second ones propose perturbing data cells by means of random noise at various levels, ranging from schemas to queries. On the other hand, access control solutions in OLAP are mainly inspired by the wide literature developed in the context of controlling accesses to DBMS, and try to adapt such schemes in order to control accesses to OLAP systems.

Download Full-text

Opposition Intensity-Based Cuckoo Search Algorithm for Data Privacy Preservation

Journal of Intelligent Systems ◽

10.1515/jisys-2018-0420 ◽

2019 ◽

Vol 29 (1) ◽

pp. 1441-1452 ◽

Cited By ~ 2

Author(s):

G.K. Shailaja ◽

C.V. Guru Rao

Keyword(s):

Data Mining ◽

Data Privacy ◽

Privacy Preservation ◽

Search Algorithm ◽

Cuckoo Search ◽

Original Data ◽

Cuckoo Search Algorithm ◽

Research Issues ◽

Novel Approach ◽

Two Phases

Abstract Privacy-preserving data mining (PPDM) is a novel approach that has emerged in the market to take care of privacy issues. The intention of PPDM is to build up data-mining techniques without raising the risk of mishandling of the data exploited to generate those schemes. The conventional works include numerous techniques, most of which employ some form of transformation on the original data to guarantee privacy preservation. However, these schemes are quite multifaceted and memory intensive, thus leading to restricted exploitation of these methods. Hence, this paper intends to develop a novel PPDM technique, which involves two phases, namely, data sanitization and data restoration. Initially, the association rules are extracted from the database before proceeding with the two phases. In both the sanitization and restoration processes, key extraction plays a major role, which is selected optimally using Opposition Intensity-based Cuckoo Search Algorithm, which is the modified format of Cuckoo Search Algorithm. Here, four research issues, such as hiding failure rate, information preservation rate, and false rule generation, and degree of modification are minimized using the adopted sanitization and restoration processes.

Download Full-text

An SOA-Based Architecture to Share Medical Data with Privacy Preservation

International Journal of Organizational and Collective Intelligence ◽

10.4018/ijoci.2011070102 ◽

2011 ◽

Vol 2 (3) ◽

pp. 11-26

Author(s):

Mahmoud Barhamgi ◽

Djamal Benslimane ◽

Chirine Ghedira ◽

Brahim Medjahed

Keyword(s):

Web Services ◽

Web Service ◽

Data Sharing ◽

Data Privacy ◽

Privacy Preservation ◽

Medical Data ◽

Query Rewriting ◽

Healthcare Organizations ◽

Privacy Constraints ◽

Viable Approach

Recent years have witnessed a growing interest in using Web services as a reliable means for medical data sharing inside and across healthcare organizations. In such service-based data sharing environments, Web service composition emerged as a viable approach to query data scattered across independent locations. Patient data privacy preservation is an important aspect that must be considered when composing medical Web services. In this paper, the authors show how data privacy can be preserved when composing and executing Web services. Privacy constraints are expressed in the form of RDF queries over a mediated ontology. Query rewriting algorithms are defined to process those queries while preserving users’ privacy.

Download Full-text