Aggregating Knockoffs for False Discovery Rate Control with an Application to Gut Microbiome Data

Fang Xie; Johannes Lederer

doi:10.3390/e23020230

Aggregating Knockoffs for False Discovery Rate Control with an Application to Gut Microbiome Data

Entropy ◽

10.3390/e23020230 ◽

2021 ◽

Vol 23 (2) ◽

pp. 230

Author(s):

Fang Xie ◽

Johannes Lederer

Keyword(s):

Microbial Diversity ◽

Rate Control ◽

Gut Microbiome ◽

High Dimensional ◽

Health And Wellbeing ◽

False Discovery ◽

Recent Approach ◽

Lasso Estimator ◽

Microbiome Data ◽

False Discoveries

Recent discoveries suggest that our gut microbiome plays an important role in our health and wellbeing. However, the gut microbiome data are intricate; for example, the microbial diversity in the gut makes the data high-dimensional. While there are dedicated high-dimensional methods, such as the lasso estimator, they always come with the risk of false discoveries. Knockoffs are a recent approach to control the number of false discoveries. In this paper, we show that knockoffs can be aggregated to increase power while retaining sharp control over the false discoveries. We support our method both in theory and simulations, and we show that it can lead to new discoveries on microbiome data from the American Gut Project. In particular, our results indicate that several phyla that have been overlooked so far are associated with obesity.

Download Full-text

False discovery control for penalized variable selections with high-dimensional covariates

Statistical Applications in Genetics and Molecular Biology ◽

10.1515/sagmb-2018-0038 ◽

2018 ◽

Vol 17 (6) ◽

Cited By ~ 1

Author(s):

Kevin He ◽

Xiang Zhou ◽

Hui Jiang ◽

Xiaoquan Wen ◽

Yi Li

Keyword(s):

Variable Selection ◽

Linear Models ◽

Broad Class ◽

High Dimensional ◽

High Throughput Data ◽

False Discovery ◽

Dimensional Variable ◽

Selection Algorithms ◽

False Discoveries ◽

Linear Regressions

Abstract Modern bio-technologies have produced a vast amount of high-throughput data with the number of predictors much exceeding the sample size. Penalized variable selection has emerged as a powerful and efficient dimension reduction tool. However, control of false discoveries (i.e. inclusion of irrelevant variables) for penalized high-dimensional variable selection presents serious challenges. To effectively control the fraction of false discoveries for penalized variable selections, we propose a false discovery controlling procedure. The proposed method is general and flexible, and can work with a broad class of variable selection algorithms, not only for linear regressions, but also for generalized linear models and survival analysis.

Download Full-text

False discovery rate control for high dimensional networks of quantile associations conditioning on covariates

Journal of the Royal Statistical Society Series B (Statistical Methodology) ◽

10.1111/rssb.12288 ◽

2018 ◽

Vol 80 (5) ◽

pp. 1015-1034 ◽

Cited By ~ 2

Author(s):

Jichun Xie ◽

Ruosha Li

Keyword(s):

False Discovery Rate ◽

Rate Control ◽

High Dimensional ◽

False Discovery Rate Control ◽

False Discovery

Download Full-text

Joint testing and false discovery rate control in high-dimensional multivariate regression

Biometrika ◽

10.1093/biomet/asx085 ◽

2018 ◽

Vol 105 (2) ◽

pp. 249-269 ◽

Cited By ~ 1

Author(s):

Yin Xia ◽

T Tony Cai ◽

Hongzhe Li

Keyword(s):

False Discovery Rate ◽

Rate Control ◽

Multivariate Regression ◽

High Dimensional ◽

False Discovery Rate Control ◽

False Discovery

Download Full-text

Hypothesis testing for high-dimensional multivariate regression with false discovery rate control

Communication in Statistics- Theory and Methods ◽

10.1080/03610926.2021.1873378 ◽

2021 ◽

pp. 1-20

Author(s):

Yunlong Zhu

Keyword(s):

Hypothesis Testing ◽

False Discovery Rate ◽

Rate Control ◽

Multivariate Regression ◽

High Dimensional ◽

False Discovery Rate Control ◽

False Discovery

Download Full-text

A practical guide to methods controlling false discoveries in computational biology

10.1101/458786 ◽

2018 ◽

Cited By ~ 1

Author(s):

Keegan Korthauer ◽

Patrick K Kimes ◽

Claire Duvallet ◽

Alejandro Reyes ◽

Ayshwarya Subramanian ◽

...

Keyword(s):

Computational Biology ◽

Rate Control ◽

Ease Of Use ◽

Simulation Studies ◽

Complementary Information ◽

Practical Guide ◽

False Discovery ◽

Modern Methods ◽

Error Rate Control ◽

False Discoveries

AbstractBackgroundIn high-throughput studies, hundreds to millions of hypotheses are typically tested. Statistical methods that control the false discovery rate (FDR) have emerged as popular and powerful tools for error rate control. While classic FDR methods use only p-values as input, more modern FDR methods have been shown to increase power by incorporating complementary information as “informative covariates” to prioritize, weight, and group hypotheses. However, there is currently no consensus on how the modern methods compare to one another. We investigated the accuracy, applicability, and ease of use of two classic and six modern FDR-controlling methods by performing a systematic benchmark comparison using simulation studies as well as six case studies in computational biologyResultsMethods that incorporate informative covariates were modestly more powerful than classic approaches, and did not underperform classic approaches, even when the covariate was completely uninformative. The majority of methods were successful at controlling the FDR, with the exception of two modern methods under certain settings. Furthermore, we found the improvement of the modern FDR methods over the classic methods increased with the informativeness of the covariate, total number of hypothesis tests, and proportion of truly non-null hypotheses.ConclusionsModern FDR methods that use an informative covariate provide advantages over classic FDR-controlling procedures, with the relative gain dependent on the application and informativeness of available covariates. We present our findings as a practical guide and provide recommendations to aid researchers in their choice of methods to correct for false discoveries.

Download Full-text

Constructing Long Short-Term Memory Networks to Predict Ulcerative Colitis Progression from Longitudinal Gut Microbiome Profiles

University of Toronto Journal of Public Health ◽

10.33137/utjph.v2i2.36763 ◽

2021 ◽

Vol 2 (2) ◽

Author(s):

Xu Li ◽

Pingzhao Hu

Keyword(s):

Ulcerative Colitis ◽

Gut Microbiome ◽

Short Term Memory ◽

High Dimensional Data ◽

High Dimensional ◽

Short Term ◽

Term Memory ◽

Long Short Term Memory ◽

Lstm Network ◽

Microbiome Data

Introduction & Objective: Ulcerative colitis is an intestinal disorder with an erratic progression in which the patients suffer from capricious remissions and changeful severities. Lacking prognosis to the UC progression can lead to irrational treatments that adversely affect the patients’ quality of life. Existing studies have stated a connection between gut microbiomes and UC progression. We aim to construct Long Short-Term Memory (LSTM) networks to predict UC progression (remission & severity) from longitudinal gut microbiome data. Methods: Using one-step and two-step modelling strategies, we develop a standard LSTM network, an encoder-decoder LSTM network, a convolutional LSTM network, and several benchmarking classifiers such as random forests. For high-dimensional data, we also implement auto-encoder to select variables in addition to baseline procedures like principal component analysis. We train each model using a longitudinal microbiome data, and validate them via a 10-round set splitting approach. Results: Each proposed model shows the potential to predict UC progression, but they do not reach an optimal level for medical utilizations. The encoder-decoder LSTM demonstrates superiority over the other classifiers while the auto-encoder outperformed the baseline variable selectors. Conclusion: We support the capacity of Long Short-Term Memory (LSTM) networks to predict UC progression from longitudinal microbiome data, and verify the strength of autoencoder networks in selecting features from high dimensional data.

Download Full-text

Semi-penalized inference with direct false discovery rate control for high-dimensional AFT model

2017 IEEE 2nd International Conference on Big Data Analysis (ICBDA)( ◽

10.1109/icbda.2017.8078704 ◽

2017 ◽

Author(s):

Chi Ma

Keyword(s):

False Discovery Rate ◽

Rate Control ◽

High Dimensional ◽

False Discovery Rate Control ◽

Aft Model ◽

False Discovery

Download Full-text

Testing and support recovery of multiple high-dimensional covariance matrices with false discovery rate control

Test ◽

10.1007/s11749-017-0533-7 ◽

2017 ◽

Vol 26 (4) ◽

pp. 782-801 ◽

Cited By ~ 2

Author(s):

Yin Xia

Keyword(s):

False Discovery Rate ◽

Rate Control ◽

Covariance Matrices ◽

High Dimensional ◽

False Discovery Rate Control ◽

False Discovery ◽

Support Recovery

Download Full-text

Highly reproducible 16S sequencing facilitates measurement of host genetic influences on the stickleback gut microbiome

10.1101/497792 ◽

2018 ◽

Cited By ~ 1

Author(s):

Clayton M. Small ◽

Mark Currey ◽

Emily A. Beck ◽

Susan Bassham ◽

William A. Cresko

Keyword(s):

Microbial Diversity ◽

High Throughput ◽

Gut Microbiome ◽

Dna Isolation ◽

Minor Effect ◽

Tissue Preparation ◽

Host Genotype ◽

16S Sequencing ◽

Host Microbe Interactions ◽

Microbiome Data

AbstractMulticellular organisms interact with resident microbes in important ways, and a better understanding of host-microbe interactions is aided by tools such as high-throughput 16S sequencing. However, rigorous evaluation of the veracity of these tools in a different context from which they were developed has often lagged behind. Our goal was to perform one such critical test by examining how variation in tissue preparation and DNA isolation could affect inferences about gut microbiome variation between two genetically divergent lines of threespine stickleback fish maintained in the same lab environment. Using careful experimental design and intensive sampling of individuals, we addressed technical and biological sources of variation in 16S-based estimates of microbial diversity. After employing a two-tiered bead beating approach consisting of tissue homogenization followed by microbial lysis in subsamples, we found an extremely minor effect of DNA isolation protocol relative to among-host microbial diversity differences. Individual abundance estimates for rare OTUs, however, showed much lower reproducibility. We found that the stickleback gut microbiome was highly variable, even among siblings housed together, but that an effect of host genotype (stickleback lineage) was detectable for some microbial taxa. Our findings demonstrate the importance of appropriately quantifying biological and technical variance components when attempting to understand major influences on high-throughput microbiome data.

Download Full-text

Compositional knockoff filter for high-dimensional regression analysis of microbiome data

10.1101/851337 ◽

2019 ◽

Author(s):

Arun Srinivasan ◽

Lingzhou Xue ◽

Xiang Zhan

Keyword(s):

Regression Analysis ◽

Compositional Data ◽

Asymptotic Properties ◽

High Dimensional ◽

Finite Sample ◽

Gene Expressions ◽

False Discovery ◽

Step Procedure ◽

Finite Sample Properties ◽

Microbiome Data

SummaryA critical task in microbiome data analysis is to explore the association between a scalar response of interest and a large number of microbial taxa that are summarized as compositional data at different taxonomic levels. Motivated by fine-mapping of the microbiome, we propose a two-step compositional knockoff filter (CKF) to provide the effective finite-sample false discovery rate (FDR) control in high-dimensional linear log-contrast regression analysis of microbiome compositional data. In the first step, we employ the compositional screening procedure to remove insignificant microbial taxa while retaining the essential sum-to-zero constraint. In the second step, we extend the knockoff filter to identify the significant microbial taxa in the sparse regression model for compositional data. Thereby, a subset of the microbes is selected from the high-dimensional microbial taxa as related to the response using a pre-specified FDR threshold. We study the asymptotic properties of the proposed two-step procedure, including both sure screening and effective false discovery control. We demonstrate the finite-sample properties in simulation studies, which show the gain in the empirical power while controlling the nominal FDR. The potential usefulness of the proposed method is also illustrated with application to an inflammatory bowel disease dataset to identify microbial taxa that influence host gene expressions.

Download Full-text