Privacy Preserving Approaches for High Dimensional Data

Abstract Business intelligence and AI services often involve the collection of copious amounts of multidimensional personal data. Since these data usually contain sensitive information of individuals, the direct collection can lead to privacy violations. Local differential privacy (LDP) is currently considered a state-ofthe-art solution for privacy-preserving data collection. However, existing LDP algorithms are not applicable to high-dimensional data; not only because of the increase in computation and communication cost, but also poor data utility. In this paper, we aim at addressing the curse-of-dimensionality problem in LDP-based high-dimensional data collection. Based on the idea of machine learning and data synthesis, we propose DP-Fed-Wae, an efficient privacy-preserving framework for collecting high-dimensional categorical data. With the combination of a generative autoencoder, federated learning, and differential privacy, our framework is capable of privately learning the statistical distributions of local data and generating high utility synthetic data on the server side without revealing users’ private information. We have evaluated the framework in terms of data utility and privacy protection on a number of real-world datasets containing 68–124 classification attributes. We show that our framework outperforms the LDP-based baseline algorithms in capturing joint distributions and correlations of attributes and generating high-utility synthetic data. With a local privacy guarantee ∈ = 8, the machine learning models trained with the synthetic data generated by the baseline algorithm cause an accuracy loss of 10% ~ 30%, whereas the accuracy loss is significantly reduced to less than 3% and at best even less than 1% with our framework. Extensive experimental results demonstrate the capability and efficiency of our framework in synthesizing high-dimensional data while striking a satisfactory utility-privacy balance.

Download Full-text

High Dimensional Data Processing in Privacy Preserving Data Mining

2020 IEEE 9th International Conference on Communication Systems and Network Technologies (CSNT) ◽

10.1109/csnt48778.2020.9115771 ◽

2020 ◽

Author(s):

Mayur Rathi ◽

Anand Rajavat

Keyword(s):

Data Mining ◽

Data Processing ◽

High Dimensional Data ◽

Privacy Preserving ◽

High Dimensional ◽

Privacy Preserving Data Mining

Download Full-text

Privacy-Preserving Data Sharing in High Dimensional Regression and Classification Settings

Journal of Privacy and Confidentiality ◽

10.29012/jpc.v4i1.618 ◽

2012 ◽

Vol 4 (1) ◽

Cited By ~ 1

Author(s):

Stephen E. Fienberg ◽

Jiashun Jin

Keyword(s):

Phase Diagram ◽

Phase Space ◽

Data Sharing ◽

High Dimensional Data ◽

Privacy Preserving ◽

High Dimensional ◽

Statistical Research ◽

Higher Criticism ◽

Dimensional Phase Space ◽

Privacy Violation

We focus on the problem of multi-party data sharing in high dimensional data settings where the number of measured features (or the dimension) p is frequently much larger than the number of subjects (or the sample size) n, the so-called p >> n scenario that has been the focus of much recent statistical research. Here, we consider data sharing for two interconnected problems in high dimensional data analysis, namely the feature selection and classification. We characterize the notions of ``cautious", ``regular", and ``generous" data sharing in terms of their privacy-preserving implications for the parties and their share of data, with focus on the ``feature privacy" rather than the ``sample privacy", though the violation of the former may lead to the latter. We evaluate the data sharing methods using {\it phase diagram} from the statistical literature on multiplicity and Higher Criticism thresholding. In the two-dimensional phase space calibrated by the signal sparsity and signal strength, a phase diagram is a partition of the phase space and contains three distinguished regions, where we have no (feature)-privacy violation, relatively rare privacy violations, and an overwhelming amount of privacy violation.

Download Full-text

A Privacy Preserving Similarity Search Scheme over Encrypted High-Dimensional Data for Multiple Data Owners

Cloud Computing and Security - Lecture Notes in Computer Science ◽

10.1007/978-3-030-00009-7_44 ◽

2018 ◽

pp. 484-495

Author(s):

Cheng Guo ◽

Pengxu Tian ◽

Yingmo Jie ◽

Xinyu Tang

Keyword(s):

Similarity Search ◽

High Dimensional Data ◽

Privacy Preserving ◽

High Dimensional ◽

Multiple Data

Download Full-text

Large Sample Covariance Matrices and High-Dimensional Data Analysis

10.1017/cbo9781107588080 ◽

2015 ◽

Cited By ~ 26

Author(s):

Jianfeng Yao ◽

Shurong Zheng ◽

Zhidong Bai

Keyword(s):

Data Analysis ◽

High Dimensional Data ◽

Covariance Matrices ◽

High Dimensional ◽

Large Sample ◽

Sample Covariance Matrices ◽

Sample Covariance ◽

High Dimensional Data Analysis

Download Full-text

Fractal-Based Methods as a Technique for Estimating the Intrinsic Dimensionality of High-Dimensional Data: A Survey

Informatica ◽

10.15388/informatica.2016.84 ◽

2016 ◽

Vol 27 (2) ◽

pp. 257-281 ◽

Cited By ~ 5

Author(s):

Rasa Karbauskaitė ◽

Gintautas Dzemyda

Keyword(s):

High Dimensional Data ◽

High Dimensional ◽

Intrinsic Dimensionality

Download Full-text