Doubly robust tests of exposure effects under high‐dimensional confounding

Oliver Dukes; Vahe Avagyan; Stijn Vansteelandt

doi:10.1111/biom.13231

Robust tests of the equality of two high-dimensional covariance matrices

Communication in Statistics- Theory and Methods ◽

10.1080/03610926.2020.1788085 ◽

2020 ◽

pp. 1-22

Author(s):

Xuemin Zi ◽

Hui Chen

Keyword(s):

Covariance Matrices ◽

High Dimensional ◽

Robust Tests

Download Full-text

Doubly robust inference when combining probability and non-probability samples with high dimensional data

Journal of the Royal Statistical Society Series B (Statistical Methodology) ◽

10.1111/rssb.12354 ◽

2020 ◽

Vol 82 (2) ◽

pp. 445-465 ◽

Cited By ~ 2

Author(s):

Shu Yang ◽

Jae Kwang Kim ◽

Rui Song

Keyword(s):

High Dimensional Data ◽

Robust Inference ◽

High Dimensional ◽

Doubly Robust

Download Full-text

A robust mean and variance test with application to high-dimensional phenotypes

10.1101/2020.02.06.926584 ◽

2020 ◽

Author(s):

James R Staley ◽

Frank Windmeijer ◽

Matthew Suderman ◽

Matthew S Lyon ◽

George Davey Smith ◽

...

Keyword(s):

Association Studies ◽

Score Test ◽

High Dimensional ◽

Biological Mechanisms ◽

Variance Test ◽

Cpg Sites ◽

Joint Location ◽

Robust Tests ◽

Mean And Variance ◽

Alternative Approaches

AbstractMost studies of high-dimensional phenotypes focus on assessing differences in mean levels (location) of the phenotype by exposure, e.g. epigenome-wide association studies of DNA methylation at CpG sites. However, identifying effects on the variability (scale) of these outcomes, and combining tests of mean and variability (location-and-scale), could provide additional insights into biological mechanisms. Here, we review variability tests, specifically an extended (for continuous exposures) version of the Brown-Forsythe test, and develop a novel joint location-and-scale score test for both categorical and continuous exposures (JLSsc). The Brown-Forsythe test and JLSsc performed well in comparison to alternative approaches in simulations. These approaches identified >7500 CpG sites that were associated with either mean or variability with gender or gestational age in cord blood methylation in ARIES (Accessible Resource for Integrated Studies). The Brown-Forsythe test and JLSsc are robust tests that can be used to detect associations not solely driven by a mean effect.

Download Full-text

521Performance of doubly-robust, machine learning effect estimators in realistic epidemiologic data settings and practical recommendations

International Journal of Epidemiology ◽

10.1093/ije/dyab168.293 ◽

2021 ◽

Vol 50 (Supplement_1) ◽

Author(s):

Jonathan Huang ◽

Xiang Meng

Keyword(s):

Machine Learning ◽

Real World ◽

Nuisance Parameter ◽

Parametric Models ◽

High Dimensional ◽

Real World Data ◽

Epidemiologic Data ◽

Doubly Robust ◽

Parametric Algorithms ◽

Non Parametric

Abstract Background Flexible, data-adaptive algorithms (machine learning; ML) for nuisance parameter estimation in epidemiologic causal inference have promising asymptotic properties for complex, high-dimensional data. However, recently proposed applications (e.g. targeted maximum likelihood estimation; TMLE) may produce biases parameter and standard error estimates in common real-world cohort settings. The relative performance of these novel estimators over simpler approaches in such settings is unclear. Methods We apply double-crossfit TMLE, augmented inverse probability weighting (AIPW), and standard IPW to simple simulations (5 covariates) and “real-world” data using covariate-structure-preserving (“plasmode”) simulations of 1,178 subjects and 331 covariates from a longitudinal birth cohort. We evaluate various data generating and estimation scenarios including: under- and over- (e.g. excess orthogonal covariates) identification, poor data support, near-instruments, and mis-specified biological interactions. We also track representative computation times. Results We replicate optimal performance of cross-fit, doubly robust estimators in simple data generating processes. However, in nearly every real world-based scenario, estimators fit with parametric learners outperform those that include non-parametric learners in terms of mean bias and confidence interval coverage. Even when correctly specified, estimators fit with non-parametric algorithms (xgboost, random forest) performed poorly (e.g. 24% bias, 57% coverage vs. 10% bias, 79% coverage for parametric fit), at times underperforming simple IPW. Conclusions In typical epidemiologic data sets, double-crossfit estimators fit with simple smooth, parametric learners may be the optimal solution, taking 2-5 times less computation time than flexible non-parametric models, while having equal or better performance. No approaches are optimal, and estimators should be compared on simulations close to the source data. Key messages In epidemiologic studies, use of flexible non-parametric algorithms for effect estimation should be strongly justified (i.e. high-dimensional covariates) and performed with care. Parametric learners may be a safer option with few drawbacks.

Download Full-text

Doubly robust and efficient estimators for heteroscedastic partially linear single-index models allowing high dimensional covariates

Journal of the Royal Statistical Society Series B (Statistical Methodology) ◽

10.1111/j.1467-9868.2012.01040.x ◽

2012 ◽

Vol 75 (2) ◽

pp. 305-322 ◽

Cited By ~ 21

Author(s):

Yanyuan Ma ◽

Liping Zhu

Keyword(s):

High Dimensional ◽

Single Index ◽

Partially Linear ◽

Single Index Models ◽

Doubly Robust

Download Full-text

Doubly robust matching estimators for high dimensional confounding adjustment

Biometrics ◽

10.1111/biom.12887 ◽

2018 ◽

Vol 74 (4) ◽

pp. 1171-1179 ◽

Cited By ~ 5

Author(s):

Joseph Antonelli ◽

Matthew Cefalu ◽

Nathan Palmer ◽

Denis Agniel

Keyword(s):

High Dimensional ◽

Matching Estimators ◽

Doubly Robust ◽

Confounding Adjustment

Download Full-text

314Data-adaptive methods for high-dimensional mediation analysis: Application to a randomised trial of tuberculosis vaccination

International Journal of Epidemiology ◽

10.1093/ije/dyab168.456 ◽

2021 ◽

Vol 50 (Supplement_1) ◽

Author(s):

Margarita Moreno-Betancur ◽

Nicole L Messina ◽

Kaya Gardiner ◽

Nigel Curtis ◽

Stijn Vansteelandt

Keyword(s):

Machine Learning ◽

Statistical Inference ◽

Mediation Analysis ◽

Learning Algorithms ◽

Adaptive Methods ◽

Machine Learning Algorithms ◽

High Dimensional ◽

Data Adaptive ◽

Randomised Controlled ◽

Doubly Robust

Abstract Focus of Presentation Statistical methods for causal mediation analysis are useful for understanding the pathways by which a certain treatment or exposure impacts health outcomes. Existing methods necessitate modelling of the distribution of the mediators, which quickly becomes infeasible when mediators are high-dimensional (e.g., biomarkers). We propose novel data-adaptive methods for estimating the indirect effect of a randomised treatment that acts via a pathway represented by a high-dimensional set of measurements. This work was motivated by the Melbourne Infant Study: BCG for Allergy and Infection Reduction (MIS BAIR), a randomised controlled trial investigating the effect of neonatal tuberculosis vaccination on clinical allergy and infection outcomes, and its mechanisms of action. Findings The proposed methods are doubly robust, which allows us to achieve (uniformly) valid statistical inference, even when machine learning algorithms are used for the two required models. We illustrate these in the context of the MIS BAIR study, investigating the mediating role of immune pathways represented by a high-dimensional vector of cytokine responses under various stimulants. We confirm adequate performance of the proposed methods in an extensive simulation study. Conclusions/Implications The proposed methods provide a feasible and flexible analytic strategy for examining high-dimensional mediators in randomised controlled trials. Key messages Data-adaptive methods for mediation analysis are desirable in the context of high-dimensional mediators, such as biomarkers. We propose novel doubly robust methods, which enable valid statistical inference when using machine learning algorithms for estimation.

Download Full-text

Effect Estimation in Point-Exposure Studies with Binary Outcomes and High-Dimensional Covariate Data – A Comparison of Targeted Maximum Likelihood Estimation and Inverse Probability of Treatment Weighting

The International Journal of Biostatistics ◽

10.1515/ijb-2015-0034 ◽

2016 ◽

Vol 12 (2) ◽

Cited By ~ 6

Author(s):

Menglan Pang ◽

Tibor Schuster ◽

Kristian B. Filion ◽

Mireille E. Schnitzer ◽

Maria Eberg ◽

...

Keyword(s):

Maximum Likelihood ◽

Maximum Likelihood Estimation ◽

Likelihood Estimation ◽

High Dimensional ◽

Binary Outcomes ◽

Treatment Model ◽

Inverse Probability ◽

Targeted Maximum Likelihood Estimation ◽

Targeted Maximum Likelihood ◽

Doubly Robust

AbstractInverse probability of treatment weighting (IPW) and targeted maximum likelihood estimation (TMLE) are relatively new methods proposed for estimating marginal causal effects. TMLE is doubly robust, yielding consistent estimators even under misspecification of either the treatment or the outcome model. While IPW methods are known to be sensitive to near violations of the practical positivity assumption (e. g., in the case of data sparsity), the consequences of this violation in the TMLE framework for binary outcomes have been less widely investigated. As near practical positivity violations are particularly likely in high-dimensional covariate settings, a better understanding of the performance of TMLE is of particular interest for pharmcoepidemiological studies using large databases. Using plasmode and Monte-Carlo simulation studies, we evaluated the performance of TMLE compared to that of IPW estimators based on a point-exposure cohort study of the marginal causal effect of post-myocardial infarction statin use on the 1-year risk of all-cause mortality from the Clinical Practice Research Datalink. A variety of treatment model specifications were considered, inducing different degrees of near practical non-positivity. Our simulation study showed that the performance of the TMLE and IPW estimators were comparable when the dimension of the fitted treatment model was small to moderate; however, they differed when a large number of covariates was considered. When a rich outcome model was included in the TMLE, estimators were unbiased. In some cases, we found irregular bias and large standard errors with both methods even with a correctly specified high-dimensional treatment model. The IPW estimator showed a slightly better root MSE with high-dimensional treatment model specifications in our simulation setting. In conclusion, for estimation of the marginal expectation of the outcome under a fixed treatment, TMLE and IPW estimators employing the same treatment model specification may perform differently due to differential sensitivity to practical positivity violations; however, TMLE, being doubly robust, shows improved performance with richer specifications of the outcome model. Although TMLE is appealing for its double robustness property, such violations in a high-dimensional covariate setting are problematic for both methods.

Download Full-text