scholarly journals Aggregating Knockoffs for False Discovery Rate Control with an Application to Gut Microbiome Data

Entropy ◽  
2021 ◽  
Vol 23 (2) ◽  
pp. 230
Author(s):  
Fang Xie ◽  
Johannes Lederer

Recent discoveries suggest that our gut microbiome plays an important role in our health and wellbeing. However, the gut microbiome data are intricate; for example, the microbial diversity in the gut makes the data high-dimensional. While there are dedicated high-dimensional methods, such as the lasso estimator, they always come with the risk of false discoveries. Knockoffs are a recent approach to control the number of false discoveries. In this paper, we show that knockoffs can be aggregated to increase power while retaining sharp control over the false discoveries. We support our method both in theory and simulations, and we show that it can lead to new discoveries on microbiome data from the American Gut Project. In particular, our results indicate that several phyla that have been overlooked so far are associated with obesity.

Author(s):  
Kevin He ◽  
Xiang Zhou ◽  
Hui Jiang ◽  
Xiaoquan Wen ◽  
Yi Li

Abstract Modern bio-technologies have produced a vast amount of high-throughput data with the number of predictors much exceeding the sample size. Penalized variable selection has emerged as a powerful and efficient dimension reduction tool. However, control of false discoveries (i.e. inclusion of irrelevant variables) for penalized high-dimensional variable selection presents serious challenges. To effectively control the fraction of false discoveries for penalized variable selections, we propose a false discovery controlling procedure. The proposed method is general and flexible, and can work with a broad class of variable selection algorithms, not only for linear regressions, but also for generalized linear models and survival analysis.


2018 ◽  
Author(s):  
Keegan Korthauer ◽  
Patrick K Kimes ◽  
Claire Duvallet ◽  
Alejandro Reyes ◽  
Ayshwarya Subramanian ◽  
...  

AbstractBackgroundIn high-throughput studies, hundreds to millions of hypotheses are typically tested. Statistical methods that control the false discovery rate (FDR) have emerged as popular and powerful tools for error rate control. While classic FDR methods use only p-values as input, more modern FDR methods have been shown to increase power by incorporating complementary information as “informative covariates” to prioritize, weight, and group hypotheses. However, there is currently no consensus on how the modern methods compare to one another. We investigated the accuracy, applicability, and ease of use of two classic and six modern FDR-controlling methods by performing a systematic benchmark comparison using simulation studies as well as six case studies in computational biologyResultsMethods that incorporate informative covariates were modestly more powerful than classic approaches, and did not underperform classic approaches, even when the covariate was completely uninformative. The majority of methods were successful at controlling the FDR, with the exception of two modern methods under certain settings. Furthermore, we found the improvement of the modern FDR methods over the classic methods increased with the informativeness of the covariate, total number of hypothesis tests, and proportion of truly non-null hypotheses.ConclusionsModern FDR methods that use an informative covariate provide advantages over classic FDR-controlling procedures, with the relative gain dependent on the application and informativeness of available covariates. We present our findings as a practical guide and provide recommendations to aid researchers in their choice of methods to correct for false discoveries.


Author(s):  
Xu Li ◽  
Pingzhao Hu

Introduction & Objective: Ulcerative colitis is an intestinal disorder with an erratic progression in which the patients suffer from capricious remissions and changeful severities. Lacking prognosis to the UC progression can lead to irrational treatments that adversely affect the patients’ quality of life. Existing studies have stated a connection between gut microbiomes and UC progression. We aim to construct Long Short-Term Memory (LSTM) networks to predict UC progression (remission & severity) from longitudinal gut microbiome data. Methods: Using one-step and two-step modelling strategies, we develop a standard LSTM network, an encoder-decoder LSTM network, a convolutional LSTM network, and several benchmarking classifiers such as random forests. For high-dimensional data, we also implement auto-encoder to select variables in addition to baseline procedures like principal component analysis. We train each model using a longitudinal microbiome data, and validate them via a 10-round set splitting approach. Results: Each proposed model shows the potential to predict UC progression, but they do not reach an optimal level for medical utilizations. The encoder-decoder LSTM demonstrates superiority over the other classifiers while the auto-encoder outperformed the baseline variable selectors. Conclusion: We support the capacity of Long Short-Term Memory (LSTM) networks to predict UC progression from longitudinal microbiome data, and verify the strength of autoencoder networks in selecting features from high dimensional data.


2018 ◽  
Author(s):  
Clayton M. Small ◽  
Mark Currey ◽  
Emily A. Beck ◽  
Susan Bassham ◽  
William A. Cresko

AbstractMulticellular organisms interact with resident microbes in important ways, and a better understanding of host-microbe interactions is aided by tools such as high-throughput 16S sequencing. However, rigorous evaluation of the veracity of these tools in a different context from which they were developed has often lagged behind. Our goal was to perform one such critical test by examining how variation in tissue preparation and DNA isolation could affect inferences about gut microbiome variation between two genetically divergent lines of threespine stickleback fish maintained in the same lab environment. Using careful experimental design and intensive sampling of individuals, we addressed technical and biological sources of variation in 16S-based estimates of microbial diversity. After employing a two-tiered bead beating approach consisting of tissue homogenization followed by microbial lysis in subsamples, we found an extremely minor effect of DNA isolation protocol relative to among-host microbial diversity differences. Individual abundance estimates for rare OTUs, however, showed much lower reproducibility. We found that the stickleback gut microbiome was highly variable, even among siblings housed together, but that an effect of host genotype (stickleback lineage) was detectable for some microbial taxa. Our findings demonstrate the importance of appropriately quantifying biological and technical variance components when attempting to understand major influences on high-throughput microbiome data.


2019 ◽  
Author(s):  
Arun Srinivasan ◽  
Lingzhou Xue ◽  
Xiang Zhan

SummaryA critical task in microbiome data analysis is to explore the association between a scalar response of interest and a large number of microbial taxa that are summarized as compositional data at different taxonomic levels. Motivated by fine-mapping of the microbiome, we propose a two-step compositional knockoff filter (CKF) to provide the effective finite-sample false discovery rate (FDR) control in high-dimensional linear log-contrast regression analysis of microbiome compositional data. In the first step, we employ the compositional screening procedure to remove insignificant microbial taxa while retaining the essential sum-to-zero constraint. In the second step, we extend the knockoff filter to identify the significant microbial taxa in the sparse regression model for compositional data. Thereby, a subset of the microbes is selected from the high-dimensional microbial taxa as related to the response using a pre-specified FDR threshold. We study the asymptotic properties of the proposed two-step procedure, including both sure screening and effective false discovery control. We demonstrate the finite-sample properties in simulation studies, which show the gain in the empirical power while controlling the nominal FDR. The potential usefulness of the proposed method is also illustrated with application to an inflammatory bowel disease dataset to identify microbial taxa that influence host gene expressions.


Sign in / Sign up

Export Citation Format

Share Document