scholarly journals Doubly robust tests of exposure effects under high‐dimensional confounding

Biometrics ◽  
2020 ◽  
Vol 76 (4) ◽  
pp. 1190-1200 ◽  
Author(s):  
Oliver Dukes ◽  
Vahe Avagyan ◽  
Stijn Vansteelandt
2020 ◽  
Author(s):  
James R Staley ◽  
Frank Windmeijer ◽  
Matthew Suderman ◽  
Matthew S Lyon ◽  
George Davey Smith ◽  
...  

AbstractMost studies of high-dimensional phenotypes focus on assessing differences in mean levels (location) of the phenotype by exposure, e.g. epigenome-wide association studies of DNA methylation at CpG sites. However, identifying effects on the variability (scale) of these outcomes, and combining tests of mean and variability (location-and-scale), could provide additional insights into biological mechanisms. Here, we review variability tests, specifically an extended (for continuous exposures) version of the Brown-Forsythe test, and develop a novel joint location-and-scale score test for both categorical and continuous exposures (JLSsc). The Brown-Forsythe test and JLSsc performed well in comparison to alternative approaches in simulations. These approaches identified >7500 CpG sites that were associated with either mean or variability with gender or gestational age in cord blood methylation in ARIES (Accessible Resource for Integrated Studies). The Brown-Forsythe test and JLSsc are robust tests that can be used to detect associations not solely driven by a mean effect.


2021 ◽  
Vol 50 (Supplement_1) ◽  
Author(s):  
Jonathan Huang ◽  
Xiang Meng

Abstract Background Flexible, data-adaptive algorithms (machine learning; ML) for nuisance parameter estimation in epidemiologic causal inference have promising asymptotic properties for complex, high-dimensional data. However, recently proposed applications (e.g. targeted maximum likelihood estimation; TMLE) may produce biases parameter and standard error estimates in common real-world cohort settings. The relative performance of these novel estimators over simpler approaches in such settings is unclear. Methods We apply double-crossfit TMLE, augmented inverse probability weighting (AIPW), and standard IPW to simple simulations (5 covariates) and “real-world” data using covariate-structure-preserving (“plasmode”) simulations of 1,178 subjects and 331 covariates from a longitudinal birth cohort. We evaluate various data generating and estimation scenarios including: under- and over- (e.g. excess orthogonal covariates) identification, poor data support, near-instruments, and mis-specified biological interactions. We also track representative computation times. Results We replicate optimal performance of cross-fit, doubly robust estimators in simple data generating processes. However, in nearly every real world-based scenario, estimators fit with parametric learners outperform those that include non-parametric learners in terms of mean bias and confidence interval coverage. Even when correctly specified, estimators fit with non-parametric algorithms (xgboost, random forest) performed poorly (e.g. 24% bias, 57% coverage vs. 10% bias, 79% coverage for parametric fit), at times underperforming simple IPW. Conclusions In typical epidemiologic data sets, double-crossfit estimators fit with simple smooth, parametric learners may be the optimal solution, taking 2-5 times less computation time than flexible non-parametric models, while having equal or better performance. No approaches are optimal, and estimators should be compared on simulations close to the source data. Key messages In epidemiologic studies, use of flexible non-parametric algorithms for effect estimation should be strongly justified (i.e. high-dimensional covariates) and performed with care. Parametric learners may be a safer option with few drawbacks.


Biometrics ◽  
2018 ◽  
Vol 74 (4) ◽  
pp. 1171-1179 ◽  
Author(s):  
Joseph Antonelli ◽  
Matthew Cefalu ◽  
Nathan Palmer ◽  
Denis Agniel

2021 ◽  
Vol 50 (Supplement_1) ◽  
Author(s):  
Margarita Moreno-Betancur ◽  
Nicole L Messina ◽  
Kaya Gardiner ◽  
Nigel Curtis ◽  
Stijn Vansteelandt

Abstract Focus of Presentation Statistical methods for causal mediation analysis are useful for understanding the pathways by which a certain treatment or exposure impacts health outcomes. Existing methods necessitate modelling of the distribution of the mediators, which quickly becomes infeasible when mediators are high-dimensional (e.g., biomarkers). We propose novel data-adaptive methods for estimating the indirect effect of a randomised treatment that acts via a pathway represented by a high-dimensional set of measurements. This work was motivated by the Melbourne Infant Study: BCG for Allergy and Infection Reduction (MIS BAIR), a randomised controlled trial investigating the effect of neonatal tuberculosis vaccination on clinical allergy and infection outcomes, and its mechanisms of action. Findings The proposed methods are doubly robust, which allows us to achieve (uniformly) valid statistical inference, even when machine learning algorithms are used for the two required models. We illustrate these in the context of the MIS BAIR study, investigating the mediating role of immune pathways represented by a high-dimensional vector of cytokine responses under various stimulants. We confirm adequate performance of the proposed methods in an extensive simulation study. Conclusions/Implications The proposed methods provide a feasible and flexible analytic strategy for examining high-dimensional mediators in randomised controlled trials. Key messages Data-adaptive methods for mediation analysis are desirable in the context of high-dimensional mediators, such as biomarkers. We propose novel doubly robust methods, which enable valid statistical inference when using machine learning algorithms for estimation.


Author(s):  
Menglan Pang ◽  
Tibor Schuster ◽  
Kristian B. Filion ◽  
Mireille E. Schnitzer ◽  
Maria Eberg ◽  
...  

AbstractInverse probability of treatment weighting (IPW) and targeted maximum likelihood estimation (TMLE) are relatively new methods proposed for estimating marginal causal effects. TMLE is doubly robust, yielding consistent estimators even under misspecification of either the treatment or the outcome model. While IPW methods are known to be sensitive to near violations of the practical positivity assumption (e. g., in the case of data sparsity), the consequences of this violation in the TMLE framework for binary outcomes have been less widely investigated. As near practical positivity violations are particularly likely in high-dimensional covariate settings, a better understanding of the performance of TMLE is of particular interest for pharmcoepidemiological studies using large databases. Using plasmode and Monte-Carlo simulation studies, we evaluated the performance of TMLE compared to that of IPW estimators based on a point-exposure cohort study of the marginal causal effect of post-myocardial infarction statin use on the 1-year risk of all-cause mortality from the Clinical Practice Research Datalink. A variety of treatment model specifications were considered, inducing different degrees of near practical non-positivity. Our simulation study showed that the performance of the TMLE and IPW estimators were comparable when the dimension of the fitted treatment model was small to moderate; however, they differed when a large number of covariates was considered. When a rich outcome model was included in the TMLE, estimators were unbiased. In some cases, we found irregular bias and large standard errors with both methods even with a correctly specified high-dimensional treatment model. The IPW estimator showed a slightly better root MSE with high-dimensional treatment model specifications in our simulation setting. In conclusion, for estimation of the marginal expectation of the outcome under a fixed treatment, TMLE and IPW estimators employing the same treatment model specification may perform differently due to differential sensitivity to practical positivity violations; however, TMLE, being doubly robust, shows improved performance with richer specifications of the outcome model. Although TMLE is appealing for its double robustness property, such violations in a high-dimensional covariate setting are problematic for both methods.


Sign in / Sign up

Export Citation Format

Share Document