Ecological Regression with Partial Identification

Wenxin Jiang; Gary King; Allen Schmaltz; Martin A. Tanner

doi:10.1017/pan.2019.19

Ecological Regression with Partial Identification

Political Analysis ◽

10.1017/pan.2019.19 ◽

2019 ◽

Vol 28 (1) ◽

pp. 65-86

Author(s):

Wenxin Jiang ◽

Gary King ◽

Allen Schmaltz ◽

Martin A. Tanner

Keyword(s):

Conceptual Framework ◽

Ground Truth ◽

Real Data ◽

Contextual Effects ◽

Partial Identification ◽

Aggregate Data ◽

Data Sets ◽

Ecological Inference ◽

Point Estimates ◽

Ecological Regression

Ecological inference (EI) is the process of learning about individual behavior from aggregate data. We relax assumptions by allowing for “linear contextual effects,” which previous works have regarded as plausible but avoided due to nonidentification, a problem we sidestep by deriving bounds instead of point estimates. In this way, we offer a conceptual framework to improve on the Duncan–Davis bound, derived more than 65 years ago. To study the effectiveness of our approach, we collect and analyze 8,430 $2\times 2$ EI datasets with known ground truth from several sources—thus bringing considerably more data to bear on the problem than the existing dozen or so datasets available in the literature for evaluating EI estimators. For the 88% of real data sets in our collection that fit a proposed rule, our approach reduces the width of the Duncan–Davis bound, on average, by about 44%, while still capturing the true district-level parameter about 99% of the time. The remaining 12% revert to the Duncan–Davis bound.

Download Full-text

ei.Datasets: Real Data Sets for Assessing Ecological Inference Algorithms

Social Science Computer Review ◽

10.1177/08944393211040808 ◽

2021 ◽

pp. 089443932110408

Author(s):

Jose M. Pavía

Keyword(s):

Simulated Data ◽

Ground Truth ◽

Real Data ◽

R Package ◽

Data Sets ◽

Ecological Inference ◽

Inference Models ◽

Individual Level ◽

Inference Algorithms ◽

Cross Classification

Ecological inference models aim to infer individual-level relationships using aggregate data. They are routinely used to estimate voter transitions between elections, disclose split-ticket voting behaviors, or infer racial voting patterns in U.S. elections. A large number of procedures have been proposed in the literature to solve these problems; therefore, an assessment and comparison of them are overdue. The secret ballot however makes this a difficult endeavor since real individual data are usually not accessible. The most recent work on ecological inference has assessed methods using a very small number of data sets with ground truth, combined with artificial, simulated data. This article dramatically increases the number of real instances by presenting a unique database (available in the R package ei.Datasets) composed of data from more than 550 elections where the true inner-cell values of the global cross-classification tables are known. The article describes how the data sets are organized, details the data curation and data wrangling processes performed, and analyses the main features characterizing the different data sets.

Download Full-text

“Ecological” Inference: The Use of Aggregate Data to Study Individuals

American Political Science Review ◽

10.1017/s0003055400263272 ◽

1969 ◽

Vol 63 (4) ◽

pp. 1183-1196 ◽

Cited By ~ 4

Author(s):

W. Phillips Shively

Keyword(s):

Research Data ◽

Aggregate Data ◽

Ecological Inference ◽

Nineteen Fifties ◽

Unit Of Analysis ◽

Census Tracts ◽

Individual Level ◽

Ecological Regression ◽

Ecological Correlation ◽

Standing Problem

Because they are inexpensive and easy to obtain, because they may be available under circumstances in which survey data are unavailable, and because they eliminate many of the measurement problems of survey research, data on geographic units such as counties or census tracts are often used by political scientists to measure individual behavior. This has involved us in the long-standing problem of inferring individual-level relationships from aggregate data, which was first raised by W. S. Robinson in the early nineteen fifties.In this paper, I shall first discuss the problem raised by Robinson. I shall then review three partial solutions to the problem—the Duncan-Davis method of setting limits, Blalock's version of ecological regression, and Goodman's version of ecological regression. Finally, I shall propose some ways in which Goodman's method may be used so as to reduce the problem of bias in its estimates, and make it a more reasonable tool for reserch.Our difficulty, as Robinson showed, is that we cannot necessarily infer the correlation between variables, taking people as the unit of analysis, on the basis of correlations between the same variables based on groups of people as units. For example, the “ecological” correlation between per cent black and per cent illiterate is +0.946, whereas the correlation between color and illiteracy among individuals is only+0.203.

Download Full-text

Cross‐Level/Ecological Inference

10.1093/oxfordhb/9780199286546.003.0024 ◽

2009 ◽

Cited By ~ 3

Author(s):

Wendy Tam Cho ◽

Charles F. Manski

Keyword(s):

Recent Work ◽

Partial Identification ◽

Aggregate Data ◽

Statistical Problem ◽

Ecological Inference ◽

Inference Problem ◽

Individual Behaviour ◽

Methodological Approaches

This article reports the main methodological approaches to the statistical problem. It describes the fundamental indeterminacy of the problem. It also provides a framework that coherently binds the variety of approaches that have been proposed to address this problem. Then, an overview of these various approaches and their respective contributions are mentioned. The ecological inference problem within the literature of partial identification and the recent work generalizing the use of logical bounds on possible solutions as an identification region for the general r × c problem are explained. It finally covers some admonitions about this fascinating problem that has enthralled decades of scholars from varied disciplines. The analysis by Duncan and Davis made clear that aggregate data only partially reveal the structure of individual behaviour. However, their contribution has largely been viewed as limited and an appreciation for the idea of bounds or an identification region has yet to fully emerge.

Download Full-text

Using Ecological Inference Point Estimates as Dependent Variables in Second-Stage Linear Regressions

Political Analysis ◽

10.1093/pan/11.1.44 ◽

2003 ◽

Vol 11 (1) ◽

pp. 44-64 ◽

Cited By ~ 24

Author(s):

Michael C. Herron ◽

Kenneth W. Shotts

Keyword(s):

Data Sets ◽

Ecological Inference ◽

Ecological Data ◽

Second Stage ◽

Point Estimates ◽

Dependent Variables ◽

Linear Regressions

The practice of using point estimates produced by the King ecological inference technique as dependent variables in second-stage linear regressions leads to second-stage results that, in general, are inconsistent. This conclusion holds even when all assumptions behind King's ecological technique are satisfied. Second-stage inconsistency is a consequence of the fact that King-based point estimates of disaggregated quantities contain errors correlated with the true quantities the estimates measure. Our findings on second-stage inconsistency, as well as a fix that we propose, follow from econometric theory in conjunction with an analysis of simulated and real ecological data sets.

Download Full-text

G-Tric: generating three-way synthetic datasets with triclustering solutions

BMC Bioinformatics ◽

10.1186/s12859-020-03925-4 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

João Lobo ◽

Rui Henriques ◽

Sara C. Madeira

Keyword(s):

State Of The Art ◽

Synthetic Data ◽

Ground Truth ◽

Real Data ◽

Three Dimensions ◽

Additional Advantage ◽

Urban Dynamics ◽

Data Generator ◽

Real World Datasets ◽

Synthetic Datasets

Abstract Background Three-way data started to gain popularity due to their increasing capacity to describe inherently multivariate and temporal events, such as biological responses, social interactions along time, urban dynamics, or complex geophysical phenomena. Triclustering, subspace clustering of three-way data, enables the discovery of patterns corresponding to data subspaces (triclusters) with values correlated across the three dimensions (observations $$\times$$ × features $$\times$$ × contexts). With increasing number of algorithms being proposed, effectively comparing them with state-of-the-art algorithms is paramount. These comparisons are usually performed using real data, without a known ground-truth, thus limiting the assessments. In this context, we propose a synthetic data generator, G-Tric, allowing the creation of synthetic datasets with configurable properties and the possibility to plant triclusters. The generator is prepared to create datasets resembling real 3-way data from biomedical and social data domains, with the additional advantage of further providing the ground truth (triclustering solution) as output. Results G-Tric can replicate real-world datasets and create new ones that match researchers needs across several properties, including data type (numeric or symbolic), dimensions, and background distribution. Users can tune the patterns and structure that characterize the planted triclusters (subspaces) and how they interact (overlapping). Data quality can also be controlled, by defining the amount of missing, noise or errors. Furthermore, a benchmark of datasets resembling real data is made available, together with the corresponding triclustering solutions (planted triclusters) and generating parameters. Conclusions Triclustering evaluation using G-Tric provides the possibility to combine both intrinsic and extrinsic metrics to compare solutions that produce more reliable analyses. A set of predefined datasets, mimicking widely used three-way data and exploring crucial properties was generated and made available, highlighting G-Tric’s potential to advance triclustering state-of-the-art by easing the process of evaluating the quality of new triclustering approaches.

Download Full-text

Transforming variables to central normality

Machine Learning ◽

10.1007/s10994-021-05960-5 ◽

2021 ◽

Author(s):

Jakob Raymaekers ◽

Peter J. Rousseeuw

Keyword(s):

Maximum Likelihood ◽

Maximum Likelihood Estimator ◽

Simulation Study ◽

Real Data ◽

Data Sets ◽

Transformation Parameter ◽

Likelihood Estimator ◽

Extensive Simulation ◽

Highly Sensitive

AbstractMany real data sets contain numerical features (variables) whose distribution is far from normal (Gaussian). Instead, their distribution is often skewed. In order to handle such data it is customary to preprocess the variables to make them more normal. The Box–Cox and Yeo–Johnson transformations are well-known tools for this. However, the standard maximum likelihood estimator of their transformation parameter is highly sensitive to outliers, and will often try to move outliers inward at the expense of the normality of the central part of the data. We propose a modification of these transformations as well as an estimator of the transformation parameter that is robust to outliers, so the transformed data can be approximately normal in the center and a few outliers may deviate from it. It compares favorably to existing techniques in an extensive simulation study and on real data.

Download Full-text

A New Extension of Thinning-Based Integer-Valued Autoregressive Models for Count Data

Entropy ◽

10.3390/e23010062 ◽

2020 ◽

Vol 23 (1) ◽

pp. 62

Author(s):

Zhengwei Liu ◽

Fukang Zhu

Keyword(s):

Likelihood Estimation ◽

Real Data ◽

Autoregressive Models ◽

Superior Performance ◽

Data Sets ◽

Binomial Thinning ◽

Free Case ◽

Two Parameters ◽

Conditional Maximum ◽

Thinning Operator

The thinning operators play an important role in the analysis of integer-valued autoregressive models, and the most widely used is the binomial thinning. Inspired by the theory about extended Pascal triangles, a new thinning operator named extended binomial is introduced, which is a general case of the binomial thinning. Compared to the binomial thinning operator, the extended binomial thinning operator has two parameters and is more flexible in modeling. Based on the proposed operator, a new integer-valued autoregressive model is introduced, which can accurately and flexibly capture the dispersed features of counting time series. Two-step conditional least squares (CLS) estimation is investigated for the innovation-free case and the conditional maximum likelihood estimation is also discussed. We have also obtained the asymptotic property of the two-step CLS estimator. Finally, three overdispersed or underdispersed real data sets are considered to illustrate a superior performance of the proposed model.

Download Full-text

Goodness-of-Fit Tests for Bivariate Time Series of Counts

Econometrics ◽

10.3390/econometrics9010010 ◽

2021 ◽

Vol 9 (1) ◽

pp. 10

Author(s):

Šárka Hudecová ◽

Marie Hušková ◽

Simos G. Meintanis

Keyword(s):

Goodness Of Fit ◽

Probability Generating Function ◽

Parametric Bootstrap ◽

Real Data ◽

Data Sets ◽

Test Statistics ◽

Finite Sample ◽

Generalized Poisson ◽

Goodness Of Fit Tests ◽

Monte Carlo Experiments

This article considers goodness-of-fit tests for bivariate INAR and bivariate Poisson autoregression models. The test statistics are based on an L2-type distance between two estimators of the probability generating function of the observations: one being entirely nonparametric and the second one being semiparametric computed under the corresponding null hypothesis. The asymptotic distribution of the proposed tests statistics both under the null hypotheses as well as under alternatives is derived and consistency is proved. The case of testing bivariate generalized Poisson autoregression and extension of the methods to dimension higher than two are also discussed. The finite-sample performance of a parametric bootstrap version of the tests is illustrated via a series of Monte Carlo experiments. The article concludes with applications on real data sets and discussion.

Download Full-text

TraceAll: A Real-Time Processing for Contact Tracing Using Indoor Trajectories

Information ◽

10.3390/info12050202 ◽

2021 ◽

Vol 12 (5) ◽

pp. 202

Author(s):

Louai Alarabi ◽

Saleh Basalamah ◽

Abdeltawab Hendawi ◽

Mohammed Abdalla

Keyword(s):

Infectious Diseases ◽

Infected Patient ◽

Public Health Problem ◽

Real Data ◽

Exposure Period ◽

Contact Tracing ◽

Data Sets ◽

Major Public Health Problem ◽

Real Time Processing ◽

Recent Developments

The rapid spread of infectious diseases is a major public health problem. Recent developments in fighting these diseases have heightened the need for a contact tracing process. Contact tracing can be considered an ideal method for controlling the transmission of infectious diseases. The result of the contact tracing process is performing diagnostic tests, treating for suspected cases or self-isolation, and then treating for infected persons; this eventually results in limiting the spread of diseases. This paper proposes a technique named TraceAll that traces all contacts exposed to the infected patient and produces a list of these contacts to be considered potentially infected patients. Initially, it considers the infected patient as the querying user and starts to fetch the contacts exposed to him. Secondly, it obtains all the trajectories that belong to the objects moved nearby the querying user. Next, it investigates these trajectories by considering the social distance and exposure period to identify if these objects have become infected or not. The experimental evaluation of the proposed technique with real data sets illustrates the effectiveness of this solution. Comparative analysis experiments confirm that TraceAll outperforms baseline methods by 40% regarding the efficiency of answering contact tracing queries.

Download Full-text

The Flexible Burr X-G Family: Properties, Inference, and Applications in Engineering Science

Symmetry ◽

10.3390/sym13030474 ◽

2021 ◽

Vol 13 (3) ◽

pp. 474

Author(s):

Abdulhakim A. Al-Babtain ◽

Ibrahim Elbatal ◽

Hazem Al-Mofleh ◽

Ahmed M. Gemeay ◽

Ahmed Z. Afify ◽

...

Keyword(s):

Numerical Simulations ◽

Exponential Distribution ◽

Real Data ◽

Exponential Model ◽

Statistical Properties ◽

Engineering Science ◽

Data Sets ◽

Engineering Sciences ◽

General Statistical ◽

Anderson Darling

In this paper, we introduce a new flexible generator of continuous distributions called the transmuted Burr X-G (TBX-G) family to extend and increase the flexibility of the Burr X generator. The general statistical properties of the TBX-G family are calculated. One special sub-model, TBX-exponential distribution, is studied in detail. We discuss eight estimation approaches to estimating the TBX-exponential parameters, and numerical simulations are conducted to compare the suggested approaches based on partial and overall ranks. Based on our study, the Anderson–Darling estimators are recommended to estimate the TBX-exponential parameters. Using two skewed real data sets from the engineering sciences, we illustrate the importance and flexibility of the TBX-exponential model compared with other existing competing distributions.

Download Full-text