Statistical Disclosure Limitation: New Directions and Challenges

Natalie Shlomo

doi:10.29012/jpc.684

Statistical Disclosure Limitation: New Directions and Challenges

Journal of Privacy and Confidentiality ◽

10.29012/jpc.684 ◽

2018 ◽

Vol 8 (1) ◽

Author(s):

Natalie Shlomo

Keyword(s):

Data Dissemination ◽

Differential Privacy ◽

Synthetic Data ◽

Remote Access ◽

Disclosure Limitation ◽

Disclosure Risk ◽

Statistical Disclosure Limitation ◽

Statistical Disclosure ◽

Statistical Agencies ◽

Definition Of

An overview of traditional types of data dissemination at statistical agencies is provided including definitions of disclosure risks, the quantification of disclosure risk and data utility and common statistical disclosure limitation (SDL) methods. However, with technological advancements and the increasing push by governments for openand accessible data, new forms of data dissemination are currently being explored. We focus on web-based applications such as flexible table builders and remote analysis servers, synthetic data and remote access. Many of these applications introduce new challenges for statistical agencies as they are gradually relinquishing some of their control on what data is released. There is now more recognition of the need for perturbative methods to protect the confidentiality of data subjects. These new forms of data dissemination are changing the landscape of how disclosure risks are conceptualized and the types of SDL methods that need to be applied to protect thedata. In particular, inferential disclosure is the main disclosure risk of concern and encompasses the traditional types of disclosure risks based on identity and attribute disclosures. These challenges have led to statisticians exploring the computer science definition of differential privacy and privacy- by-design applications. We explore how differential privacy can be a useful addition to the current SDL framework within statistical agencies.

Download Full-text

Privacy Protection from Sampling and Perturbation in Survey Microdata

Journal of Privacy and Confidentiality ◽

10.29012/jpc.v4i1.615 ◽

2012 ◽

Vol 4 (1) ◽

Cited By ~ 3

Author(s):

Natalie Shlomo ◽

Chris J. Skinner

Keyword(s):

Computer Science ◽

Privacy Protection ◽

Differential Privacy ◽

Science Literature ◽

Confidential Information ◽

Social Surveys ◽

Statistical Disclosure Limitation ◽

Statistical Disclosure ◽

Key Variables ◽

Statistical Agencies

Statistical agencies release microdata from social surveys as public-use files after applying statistical disclosure limitation (SDL) techniques. Disclosure risk is typically assessed in terms of identification risk, where it is supposed that small counts on cross-classified identifying key variables, i.e. a key, could be used to make an identification and confidential information may be learnt. In this paper we explore the application of definitions of privacy from the computer science literature to the same problem, with a focus on sampling and a form of perturbation which can be represented as misclassification. We consider two privacy definitions: differential privacy and probabilistic differential privacy. Chaudhuri and Mishra (2006) have shown that sampling does not guarantee differential privacy, but that, under certain conditions, it may ensure probabilistic differential privacy. We discuss these definitions and conditions in the context of survey microdata. We then extend this discussion to the case of perturbation. We show that differential privacy can be ensured if and only if the perturbation employs a misclassification matrix with no zero entries. We also show that probabilistic differential privacy is a viable alternative to differential privacy when there are zeros in the misclassification matrix. We discuss some common examples of SDL methods where in some cases zeros may be prevalent in the misclassification matrix.

Download Full-text

How Will Statistical Agencies Operate When All Data Are Private?

Journal of Privacy and Confidentiality ◽

10.29012/jpc.v7i3.404 ◽

2017 ◽

Vol 7 (3) ◽

Cited By ~ 3

Author(s):

John M Abowd

Keyword(s):

Big Data ◽

Privacy Protection ◽

Disclosure Limitation ◽

Dual Problems ◽

Statistical Disclosure Limitation ◽

Statistical Disclosure ◽

Access Controls ◽

Statistical Agencies ◽

Paradigm Shifting ◽

Two Sides

The dual problems of respecting citizen privacy and protecting the confidentiality of their data have become hopelessly conflated in the “Big Data” era. There are orders of magnitude more data outside an agency’s firewall than inside it—compromising the integrity of traditional statistical disclosure limitation methods. And increasingly the information processed by the agency was “asked” in a context wholly outside the agency’s operations—blurring the distinction between what was asked and what is published. Already, private businesses like Microsoft, Google and Apple recognize that cybersecurity (safeguarding the integrity and access controls for internal data) and privacy protection (ensuring that what is published does not reveal too much about any person or business) are two sides of the same coin. This is a paradigm-shifting moment for statistical agencies.

Download Full-text

Statistical Disclosure Limitation in the Presence of Edit Rules

Journal of Official Statistics ◽

10.1515/jos-2015-0006 ◽

2015 ◽

Vol 31 (1) ◽

pp. 121-138 ◽

Cited By ~ 7

Author(s):

Hang J. Kim ◽

Alan F. Karr ◽

Jerome P. Reiter

Keyword(s):

Simulation Study ◽

Synthetic Data ◽

Disclosure Limitation ◽

Statistical Disclosure Limitation ◽

Statistical Disclosure ◽

Imputation Procedure

Abstract We compare two general strategies for performing statistical disclosure limitation (SDL) for continuous microdata subject to edit rules. In the first, existing SDL methods are applied, and any constraint-violating values they produce are replaced using a constraint-preserving imputation procedure. In the second, the SDL methods are modified to prevent them from generating violations. We present a simulation study, based on data from the Colombian Annual Manufacturing Survey, that evaluates the performance of the two strategies as applied to several SDL methods. The results suggest that differences in risk-utility profiles across SDL methods dwarf differences between the two general strategies. Among the SDL strategies, variants of microaggregation and partially synthetic data offer the most attractive risk-utility profiles.

Download Full-text

Integrating Differential Privacy in the Statistical Disclosure Control Tool-Kit for Synthetic Data Production

Privacy in Statistical Databases - Lecture Notes in Computer Science ◽

10.1007/978-3-030-57521-2_19 ◽

2020 ◽

pp. 271-280

Author(s):

Natalie Shlomo

Keyword(s):

Differential Privacy ◽

Synthetic Data ◽

Statistical Disclosure Control ◽

Disclosure Control ◽

Data Production ◽

Statistical Disclosure ◽

Control Tool

Download Full-text

The Role of Statistical Disclosure Limitation in Total Survey Error

Total Survey Error in Practice ◽

10.1002/9781119041702.ch4 ◽

2017 ◽

pp. 71-94 ◽

Cited By ~ 1

Author(s):

Alan F. Karr

Keyword(s):

Disclosure Limitation ◽

Survey Error ◽

Statistical Disclosure Limitation ◽

Statistical Disclosure

Download Full-text

Data Dissemination and Disclosure Limitation in a World Without Microdata: A Risk–Utility Framework for Remote Access Analysis Servers

Statistical Science ◽

10.1214/088342305000000043 ◽

2005 ◽

Vol 20 (2) ◽

pp. 163-177 ◽

Cited By ~ 45

Author(s):

S. Gomatam ◽

A. F. Karr ◽

J. P. Reiter ◽

A. P. Sanil

Keyword(s):

Data Dissemination ◽

Remote Access ◽

Disclosure Limitation

Download Full-text

Statistical Disclosure Limitation

Encyclopedia of Statistical Sciences ◽

10.1002/0471667196.ess1166 ◽

2004 ◽

Author(s):

Lawrence H. Cox

Keyword(s):

Disclosure Limitation ◽

Statistical Disclosure Limitation ◽

Statistical Disclosure

Download Full-text

Quality Indicators for Statistical Disclosure Methods: A Case Study on the Structure of Earnings Survey

Journal of Official Statistics ◽

10.1515/jos-2015-0043 ◽

2015 ◽

Vol 31 (4) ◽

pp. 737-761 ◽

Cited By ~ 2

Author(s):

Matthias Templ

Keyword(s):

Original Data ◽

Data Sets ◽

Data Utility ◽

High Data ◽

Disclosure Risk ◽

Statistical Disclosure ◽

Context Data ◽

Statistical Agencies ◽

Utility Measures

Abstract Scientific- or public-use files are typically produced by applying anonymisation methods to the original data. Anonymised data should have both low disclosure risk and high data utility. Data utility is often measured by comparing well-known estimates from original data and anonymised data, such as comparing their means, covariances or eigenvalues. However, it is a fact that not every estimate can be preserved. Therefore the aim is to preserve the most important estimates, that is, instead of calculating generally defined utility measures, evaluation on context/data dependent indicators is proposed. In this article we define such indicators and utility measures for the Structure of Earnings Survey (SES) microdata and proper guidelines for selecting indicators and models, and for evaluating the resulting estimates are given. For this purpose, hundreds of publications in journals and from national statistical agencies were reviewed to gain insight into how the SES data are used for research and which indicators are relevant for policy making. Besides the mathematical description of the indicators and a brief description of the most common models applied to SES, four different anonymisation procedures are applied and the resulting indicators and models are compared to those obtained from the unmodified data. The disclosure risk is reported and the data utility is evaluated for each of the anonymised data sets based on the most important indicators and a model which is often used in practice.

Download Full-text

Prospects for Protecting Business Microdata when Releasing Population Totals via a Remote Server

Journal of Official Statistics ◽

10.2478/jos-2019-0015 ◽

2019 ◽

Vol 35 (2) ◽

pp. 319-336

Author(s):

James Chipperfield ◽

John Newman ◽

Gwenda Thompson ◽

Yue Ma ◽

Yan-Xia Lin

Keyword(s):

Perturbation Method ◽

Synthetic Data ◽

Australian Bureau ◽

Remote Access ◽

Trade Off ◽

Cautious Approach ◽

Remote Server ◽

Statistical Agencies ◽

Business Data ◽

Australian Bureau Of Statistics

Abstract Many statistical agencies face the challenge of maintaining the confidentiality of respondents while providing as much analytical value as possible from their data. Datasets relating to businesses present particular difficulties because they are likely to contain information about large enterprises that dominate industries and may be more easily identified. Agencies therefore tend to take a cautious approach to releasing business data (e.g., trusted access, remote access and synthetic data). The Australian Bureau of Statistics has developed a remote server, called TableBuilder, which has the capability to allow users to specify and request tables created from business microdata. The tables are confidentialised automatically by perturbing cell values, and the results are returned quickly to the users. The perturbation method is designed to protect against attacks, which are attempts to undo the confidentialisation, such as the well-known differencing attack. This paper considers the risk and utility trade-off when releasing three Australian Bureau of Statistics business collections via its TableBuilder product.

Download Full-text

A Bayesian Hierarchical Model Approach to Risk Estimation in Statistical Disclosure Limitation

Privacy in Statistical Databases - Lecture Notes in Computer Science ◽

10.1007/978-3-540-25955-8_19 ◽

2004 ◽

pp. 247-261 ◽

Cited By ~ 5

Author(s):

Silvia Polettini ◽

Julian Stander

Keyword(s):

Hierarchical Model ◽

Risk Estimation ◽

Bayesian Hierarchical Model ◽

Disclosure Limitation ◽

Bayesian Hierarchical ◽

Statistical Disclosure Limitation ◽

Statistical Disclosure ◽

Model Approach

Download Full-text