A Scalable Algorithm for One-to-One, Onto, and Partial Schema Matching with Uninterpreted Column Names and Column Values

2014 ◽  
Vol 25 (4) ◽  
pp. 1-16
Author(s):  
Boris Rabinovich ◽  
Mark Last

In this paper, the authors propose a five-step approach to the problem of identifying semantic correspondences between attributes of two database schemas. It is one of the key challenges in many database applications such as data integration and data warehousing. The authors' research is focused on uninterpreted schema matching, where the column names and column values are uninterpreted or unreliable. The approach implements Bayesian networks, Pearson's correlation and mutual information to identify inter-attribute dependencies. Additionally, the authors propose an extension to their algorithm that allows the user to manually enter the known mappings to improve the automated matching results. The five-step approach also allows data privacy preservation. The authors' evaluation experiments show that the proposed approach enhances the current set of schema matching techniques.

2014 ◽  
Vol 70 (5) ◽  
Author(s):  
Thabit Sabbah ◽  
Ali Selamat

Thesaurus is used in many Information Retrieval (IR) applications such as data integration, data warehousing, semantic query processing and classifiers. It was also utilized to solve the problem of schema matching. Considering the fact of existence of many thesauri for a certain area of knowledge, the quality of schema matching results when using different thesauri in the same field is not predictable. In this paper, we propose a methodology to study the performance of the thesaurus in solving schema matching. The paper also presents results of experiments using different thesauri. Precision, recall, F-measure, and similarity average were calculated to show that the quality of matching changed according to the used thesaurus.  


Author(s):  
Shalin Eliabeth S. ◽  
Sarju S.

Big data privacy preservation is one of the most disturbed issues in current industry. Sometimes the data privacy problems never identified when input data is published on cloud environment. Data privacy preservation in hadoop deals in hiding and publishing input dataset to the distributed environment. In this paper investigate the problem of big data anonymization for privacy preservation from the perspectives of scalability and time factor etc. At present, many cloud applications with big data anonymization faces the same kind of problems. For recovering this kind of problems, here introduced a data anonymization algorithm called Two Phase Top-Down Specialization (TPTDS) algorithm that is implemented in hadoop. For the data anonymization-45,222 records of adults information with 15 attribute values was taken as the input big data. With the help of multidimensional anonymization in map reduce framework, here implemented proposed Two-Phase Top-Down Specialization anonymization algorithm in hadoop and it will increases the efficiency on the big data processing system. By conducting experiment in both one dimensional and multidimensional map reduce framework with Two Phase Top-Down Specialization algorithm on hadoop, the better result shown in multidimensional anonymization on input adult dataset. Data sets is generalized in a top-down manner and the better result was shown in multidimensional map reduce framework by the better IGPL values generated by the algorithm. The anonymization was performed with specialization operation on taxonomy tree. The experiment shows that the solutions improves the IGPL values, anonymity parameter and decreases the execution time of big data privacy preservation by compared to the existing algorithm. This experimental result will leads to great application to the distributed environment.


Author(s):  
Dhamanpreet Kaur ◽  
Matthew Sobiesk ◽  
Shubham Patil ◽  
Jin Liu ◽  
Puran Bhagat ◽  
...  

Abstract Objective This study seeks to develop a fully automated method of generating synthetic data from a real dataset that could be employed by medical organizations to distribute health data to researchers, reducing the need for access to real data. We hypothesize the application of Bayesian networks will improve upon the predominant existing method, medBGAN, in handling the complexity and dimensionality of healthcare data. Materials and Methods We employed Bayesian networks to learn probabilistic graphical structures and simulated synthetic patient records from the learned structure. We used the University of California Irvine (UCI) heart disease and diabetes datasets as well as the MIMIC-III diagnoses database. We evaluated our method through statistical tests, machine learning tasks, preservation of rare events, disclosure risk, and the ability of a machine learning classifier to discriminate between the real and synthetic data. Results Our Bayesian network model outperformed or equaled medBGAN in all key metrics. Notable improvement was achieved in capturing rare variables and preserving association rules. Discussion Bayesian networks generated data sufficiently similar to the original data with minimal risk of disclosure, while offering additional transparency, computational efficiency, and capacity to handle more data types in comparison to existing methods. We hope this method will allow healthcare organizations to efficiently disseminate synthetic health data to researchers, enabling them to generate hypotheses and develop analytical tools. Conclusion We conclude the application of Bayesian networks is a promising option for generating realistic synthetic health data that preserves the features of the original data without compromising data privacy.


Author(s):  
Yuanrui Dong ◽  
Peng Zhao ◽  
Hanqiao Yu ◽  
Cong Zhao ◽  
Shusen Yang

The emerging edge-cloud collaborative Deep Learning (DL) paradigm aims at improving the performance of practical DL implementations in terms of cloud bandwidth consumption, response latency, and data privacy preservation. Focusing on bandwidth efficient edge-cloud collaborative training of DNN-based classifiers, we present CDC, a Classification Driven Compression framework that reduces bandwidth consumption while preserving classification accuracy of edge-cloud collaborative DL. Specifically, to reduce bandwidth consumption, for resource-limited edge servers, we develop a lightweight autoencoder with a classification guidance for compression with classification driven feature preservation, which allows edges to only upload the latent code of raw data for accurate global training on the Cloud. Additionally, we design an adjustable quantization scheme adaptively pursuing the tradeoff between bandwidth consumption and classification accuracy under different network conditions, where only fine-tuning is required for rapid compression ratio adjustment. Results of extensive experiments demonstrate that, compared with DNN training with raw data, CDC consumes 14.9× less bandwidth with an accuracy loss no more than 1.06%, and compared with DNN training with data compressed by AE without guidance, CDC introduces at least 100% lower accuracy loss.


Author(s):  
Alfredo Cuzzocrea ◽  
Vincenzo Russo

The problem of ensuring the privacy and security of OLAP data cubes (Gray et al., 1997) arises in several fields ranging from advanced Data Warehousing (DW) and Business Intelligence (BI) systems to sophisticated Data Mining (DM) tools. In DW and BI systems, decision making analysts aim at avoiding that malicious users access perceptive ranges of multidimensional data in order to infer sensitive knowledge, or attack corporate data cubes via violating user rules, grants and revokes. In DM tools, domain experts aim at avoiding that malicious users infer critical-for-thetask knowledge from authoritative DM results such as frequent item sets, patterns and regularities, clusters, and discovered association rules. In more detail, the former application scenario (i.e., DW and BI systems) deals with both the privacy preservation and the security of data cubes, whereas the latter one (i.e., DM tools) deals with privacy preserving OLAP issues solely. With respect to security issues, although security aspects of information systems include a plethora of topics ranging from cryptography to access control and secure digital signature, in our work we particularly focus on access control techniques for data cubes, and remand the reader to the active literature for the other orthogonal matters. Specifically, privacy preservation of data cubes refers to the problem of ensuring the privacy of data cube cells (and, in turn, that of queries defined over collections of data cube cells), i.e. hiding sensitive information and knowledge during data management activities, according to the general guidelines drawn by Sweeney in her seminar paper (Sweeney, 2002), whereas access control issues refer to the problem of ensuring the security of data cube cells, i.e. restricting the access of unauthorized users to specific sub-domains of the target data cube, according to well-known concepts studied and assessed in the context of DBMS security. Nonetheless, it is quite straightforward foreseeing that these two even distinct aspects should be meaningfully integrated in order to ensure both the privacy and security of complex data cubes, i.e. data cubes built on top of complex data/knowledge bases. During last years, these topics have became of great interest for the Data Warehousing and Databases research communities, due to their exciting theoretical challenges as well as their relevance and practical impact in modern real-life OLAP systems and applications. On a more conceptual plane, theoretical aspects are mainly devoted to study how probability and statistics schemes as well as rule-based models can be applied in order to efficiently solve the above-introduced problems. On a more practical plane, researchers and practitioners aim at integrating convenient privacy preserving and security solutions within the core layers of commercial OLAP server platforms. Basically, to tackle deriving privacy preservation challenges in OLAP, researchers have proposed models and algorithms that can be roughly classified within two main classes: restriction-based techniques, and data perturbation techniques. First ones propose limiting the number of query kinds that can be posed against the target OLAP server. Second ones propose perturbing data cells by means of random noise at various levels, ranging from schemas to queries. On the other hand, access control solutions in OLAP are mainly inspired by the wide literature developed in the context of controlling accesses to DBMS, and try to adapt such schemes in order to control accesses to OLAP systems.


2019 ◽  
Vol 29 (1) ◽  
pp. 1441-1452 ◽  
Author(s):  
G.K. Shailaja ◽  
C.V. Guru Rao

Abstract Privacy-preserving data mining (PPDM) is a novel approach that has emerged in the market to take care of privacy issues. The intention of PPDM is to build up data-mining techniques without raising the risk of mishandling of the data exploited to generate those schemes. The conventional works include numerous techniques, most of which employ some form of transformation on the original data to guarantee privacy preservation. However, these schemes are quite multifaceted and memory intensive, thus leading to restricted exploitation of these methods. Hence, this paper intends to develop a novel PPDM technique, which involves two phases, namely, data sanitization and data restoration. Initially, the association rules are extracted from the database before proceeding with the two phases. In both the sanitization and restoration processes, key extraction plays a major role, which is selected optimally using Opposition Intensity-based Cuckoo Search Algorithm, which is the modified format of Cuckoo Search Algorithm. Here, four research issues, such as hiding failure rate, information preservation rate, and false rule generation, and degree of modification are minimized using the adopted sanitization and restoration processes.


Author(s):  
Mahmoud Barhamgi ◽  
Djamal Benslimane ◽  
Chirine Ghedira ◽  
Brahim Medjahed

Recent years have witnessed a growing interest in using Web services as a reliable means for medical data sharing inside and across healthcare organizations. In such service-based data sharing environments, Web service composition emerged as a viable approach to query data scattered across independent locations. Patient data privacy preservation is an important aspect that must be considered when composing medical Web services. In this paper, the authors show how data privacy can be preserved when composing and executing Web services. Privacy constraints are expressed in the form of RDF queries over a mediated ontology. Query rewriting algorithms are defined to process those queries while preserving users’ privacy.


Sign in / Sign up

Export Citation Format

Share Document