Practical and Effective Approaches to Dealing With Clustered Data

2018 ◽  
Vol 7 (3) ◽  
pp. 541-559 ◽  
Author(s):  
Justin Esarey ◽  
Andrew Menger

Cluster-robust standard errors (as implemented by the eponymous cluster option in Stata) can produce misleading inferences when the number of clusters G is small, even if the model is consistent and there are many observations in each cluster. Nevertheless, political scientists commonly employ this method in data sets with few clusters. The contributions of this paper are: (a) developing new and easy-to-use Stata and R packages that implement alternative uncertainty measures robust to small G, and (b) explaining and providing evidence for the advantages of these alternatives, especially cluster-adjusted t-statistics based on Ibragimov and Müller. To illustrate these advantages, we reanalyze recent work where results are based on cluster-robust standard errors.

2017 ◽  
Vol 18 (3) ◽  
pp. 268-283
Author(s):  
Felix Canitz ◽  
Panagiotis Ballis-Papanastasiou ◽  
Christian Fieberg ◽  
Kerstin Lopatta ◽  
Armin Varmaz ◽  
...  

Purpose The purpose of this paper is to review and evaluate the methods commonly used in accounting literature to correct for cointegrated data and data that are neither stationary nor cointegrated. Design/methodology/approach The authors conducted Monte Carlo simulations according to Baltagi et al. (2011), Petersen (2009) and Gow et al. (2010), to analyze how regression results are affected by the possible nonstationarity of the variables of interest. Findings The results of this study suggest that biases in regression estimates can be reduced and valid inferences can be obtained by using robust standard errors clustered by firm, clustered by firm and time or Fama–MacBeth t-statistics based on the mean and standard errors of the cross section of coefficients from time-series regressions. Originality/value The findings of this study are suited to guide future researchers regarding which estimation methods are the most reliable given the possible nonstationarity of the variables of interest.


2019 ◽  
Vol 28 (3) ◽  
pp. 318-339
Author(s):  
John E. Jackson

The use of cluster robust standard errors (CRSE) is common as data are often collected from units, such as cities, states or countries, with multiple observations per unit. There is considerable discussion of how best to estimate standard errors and confidence intervals when using CRSE (Harden 2011; Imbens and Kolesár 2016; MacKinnon and Webb 2017; Esarey and Menger 2019). Extensive simulations in this literature and here show that CRSE seriously underestimate coefficient standard errors and their associated confidence intervals, particularly with a small number of clusters and when there is little within cluster variation in the explanatory variables. These same simulations show that a method developed here provides more reliable estimates of coefficient standard errors. They underestimate confidence intervals for tests of individual and sets of coefficients in extreme conditions, but by far less than do CRSE. Simulations also show that this method produces more accurate standard error and confidence interval estimates than bootstrapping, which is often recommended as an alternative to CRSE.


2020 ◽  
pp. 1-20
Author(s):  
Chad Hazlett ◽  
Leonard Wainstein

Abstract When working with grouped data, investigators may choose between “fixed effects” models (FE) with specialized (e.g., cluster-robust) standard errors, or “multilevel models” (MLMs) employing “random effects.” We review the claims given in published works regarding this choice, then clarify how these approaches work and compare by showing that: (i) random effects employed in MLMs are simply “regularized” fixed effects; (ii) unmodified MLMs are consequently susceptible to bias—but there is a longstanding remedy; and (iii) the “default” MLM standard errors rely on narrow assumptions that can lead to undercoverage in many settings. Our review of over 100 papers using MLM in political science, education, and sociology show that these “known” concerns have been widely ignored in practice. We describe how to debias MLM’s coefficient estimates, and provide an option to more flexibly estimate their standard errors. Most illuminating, once MLMs are adjusted in these two ways the point estimate and standard error for the target coefficient are exactly equal to those of the analogous FE model with cluster-robust standard errors. For investigators working with observational data and who are interested only in inference on the target coefficient, either approach is equally appropriate and preferable to uncorrected MLM.


2021 ◽  
Vol 12 (2) ◽  
pp. 317-334
Author(s):  
Omar Alaqeeli ◽  
Li Xing ◽  
Xuekui Zhang

Classification tree is a widely used machine learning method. It has multiple implementations as R packages; rpart, ctree, evtree, tree and C5.0. The details of these implementations are not the same, and hence their performances differ from one application to another. We are interested in their performance in the classification of cells using the single-cell RNA-Sequencing data. In this paper, we conducted a benchmark study using 22 Single-Cell RNA-sequencing data sets. Using cross-validation, we compare packages’ prediction performances based on their Precision, Recall, F1-score, Area Under the Curve (AUC). We also compared the Complexity and Run-time of these R packages. Our study shows that rpart and evtree have the best Precision; evtree is the best in Recall, F1-score and AUC; C5.0 prefers more complex trees; tree is consistently much faster than others, although its complexity is often higher than others.


Complexity ◽  
2018 ◽  
Vol 2018 ◽  
pp. 1-16 ◽  
Author(s):  
Yiwen Zhang ◽  
Yuanyuan Zhou ◽  
Xing Guo ◽  
Jintao Wu ◽  
Qiang He ◽  
...  

The K-means algorithm is one of the ten classic algorithms in the area of data mining and has been studied by researchers in numerous fields for a long time. However, the value of the clustering number k in the K-means algorithm is not always easy to be determined, and the selection of the initial centers is vulnerable to outliers. This paper proposes an improved K-means clustering algorithm called the covering K-means algorithm (C-K-means). The C-K-means algorithm can not only acquire efficient and accurate clustering results but also self-adaptively provide a reasonable numbers of clusters based on the data features. It includes two phases: the initialization of the covering algorithm (CA) and the Lloyd iteration of the K-means. The first phase executes the CA. CA self-organizes and recognizes the number of clusters k based on the similarities in the data, and it requires neither the number of clusters to be prespecified nor the initial centers to be manually selected. Therefore, it has a “blind” feature, that is, k is not preselected. The second phase performs the Lloyd iteration based on the results of the first phase. The C-K-means algorithm combines the advantages of CA and K-means. Experiments are carried out on the Spark platform, and the results verify the good scalability of the C-K-means algorithm. This algorithm can effectively solve the problem of large-scale data clustering. Extensive experiments on real data sets show that the accuracy and efficiency of the C-K-means algorithm outperforms the existing algorithms under both sequential and parallel conditions.


2018 ◽  
Vol 8 (1) ◽  
Author(s):  
Otávio Bartalotti

AbstractIn regression discontinuity designs (RD), for a given bandwidth, researchers can estimate standard errors based on different variance formulas obtained under different asymptotic frameworks. In the traditional approach the bandwidth shrinks to zero as sample size increases; alternatively, the bandwidth could be treated as fixed. The main theoretical results for RD rely on the former, while most applications in the literature treat the estimates as parametric, implementing the usual heteroskedasticity-robust standard errors. This paper develops the “fixed-bandwidth” alternative asymptotic theory for RD designs, which sheds light on the connection between both approaches. I provide alternative formulas (approximations) for the bias and variance of common RD estimators, and conditions under which both approximations are equivalent. Simulations document the improvements in test coverage that fixed-bandwidth approximations achieve relative to traditional approximations, especially when there is local heteroskedasticity. Feasible estimators of fixed-bandwidth standard errors are easy to implement and are akin to treating RD estimators aslocallyparametric, validating the common empirical practice of using heteroskedasticity-robust standard errors in RD settings. Bias mitigation approaches are discussed and a novel bootstrap higher-order bias correction procedure based on the fixed bandwidth asymptotics is suggested.


2021 ◽  
Author(s):  
Amanda Justine Lai ◽  
Ramya Ambikapathi ◽  
Oliver Cumming ◽  
Krisna Seng ◽  
Irene Velez ◽  
...  

Background Inadequate nutrition in early life and exposure to sanitation-related enteric pathogens have been linked to poor growth outcomes in children. Despite rapid development in Cambodia, high prevalence of growth faltering and stunting persist among children. This study aimed to assess nutrition and WASH variables and their association with nutritional status of children under 24 months in rural Cambodia. Methods We conducted surveys in 491 villages across 55 rural communes in Cambodia in September 2016 to measure associations between child, household, and community-level risk factors for stunting and length-for-age z-score (LAZ). A primary survey measured child-level variables, including anthropometric measures and risk factors for growth faltering and stunting, for 4,036 children under 24 months of age from 3,877 households (approximately 8 households per village). A secondary survey of 5,341 households, including the same households from the primary survey and an additional 1,464 households (approximately 3 additional household per village) from the same villages, assessed village-level WASH variables to understand community water, sanitation, and hygiene (WASH) conditions that may influence child growth outcomes. For LAZ, we calculated bivariate and adjusted associations (as mean differences) with 95% confidence intervals using generalized estimating equations (GEEs) to fit linear regression models with robust standard errors. For stunting, we calculated unadjusted and adjusted prevalence ratios (PRs) with 95% confidence intervals using GEEs to fit Poisson regression models with robust standard errors. For all models assessing effects of household-level variables, we used GEEs to account for clustering at the village level. Findings After adjustment for potential confounders, presence of water and soap at a household's handwashing station was found to be significantly associated (p<0.05) with increased LAZ (adjusted mean difference in LAZ +0.10, 95% CI 0.03 to 0.16), and household use of an improved drinking water source was associated with less stunting in children compared to households that did not use an improved source of drinking water (aPR 0.81, 95% CI 0.66 to 0.98); breastfeeding and community-level access to an improved drinking water source were associated with a lower LAZ score (-0.16, 95% CI -0.27 to -0.05; -0.13, 95% CI -0.26 to 0.00). No other nutrition (i.e., dietary diversity, meal frequency) or sanitation variables (i.e., household's safe disposal of child stools, household-level sanitation, community-level sanitation) were measured to be associated with LAZ scores or stunting in children under 24 months of age.


Sign in / Sign up

Export Citation Format

Share Document