scholarly journals A direct approach to estimating false discovery rates conditional on covariates

PeerJ ◽  
2018 ◽  
Vol 6 ◽  
pp. e6035 ◽  
Author(s):  
Simina M. Boca ◽  
Jeffrey T. Leek

Modern scientific studies from many diverse areas of research abound with multiple hypothesis testing concerns. The false discovery rate (FDR) is one of the most commonly used approaches for measuring and controlling error rates when performing multiple tests. Adaptive FDRs rely on an estimate of the proportion of null hypotheses among all the hypotheses being tested. This proportion is typically estimated once for each collection of hypotheses. Here, we propose a regression framework to estimate the proportion of null hypotheses conditional on observed covariates. This may then be used as a multiplication factor with the Benjamini–Hochberg adjusted p-values, leading to a plug-in FDR estimator. We apply our method to a genome-wise association meta-analysis for body mass index. In our framework, we are able to use the sample sizes for the individual genomic loci and the minor allele frequencies as covariates. We further evaluate our approach via a number of simulation scenarios. We provide an implementation of this novel method for estimating the proportion of null hypotheses in a regression framework as part of the Bioconductor package swfdr.

2015 ◽  
Author(s):  
Simina M. Boca ◽  
Jeffrey T. Leek

AbstractModern scientific studies from many diverse areas of research abound with multiple hypothesis testing concerns. The false discovery rate is one of the most commonly used error rates for measuring and controlling rates of false discoveries when performing multiple tests. Adaptive false discovery rates rely on an estimate of the proportion of null hypotheses among all the hypotheses being tested. This proportion is typically estimated once for each collection of hypotheses. Here we propose a regression framework to estimate the proportion of null hypotheses conditional on observed covariates. This may then be used as a multiplication factor with the Benjamini-Hochberg adjusted p-values, leading to a plug-in false discovery rate estimator. Our case study concerns a genome-wise association meta-analysis which considers associations with body mass index. In our framework, we are able to use the sample sizes for the individual genomic loci and the minor allele frequencies as covariates. We further evaluate our approach via a number of simulation scenarios.


2021 ◽  
Vol 2 (2) ◽  
pp. p1
Author(s):  
Kirk Davis ◽  
Rodney Maiden

Although the limitations of null hypothesis significance testing (NHST) are well documented in the psychology literature, the accuracy paradox, which concisely states an important limitation of published research, is never mentioned. The accuracy paradox appears when a test with higher accuracy does a poorer job of correctly classifying a particular outcome than a test with lower accuracy, which suggests that a reliance on accuracy as a metric for a test’s usefulness is not always the best metric. Since accuracy is a function of type I and II error rates, it can be misleading to interpret a study’s results as accurate simply because these errors are minimized. Once a decision has been made regarding statistical significance, type I and II error rates are not directly informative to the reader. Instead, false discovery and false omission rates are more informative when evaluating the results of a study. Given the prevalence of publication bias and small effect sizes in the literature, the possibility of a false discovery is especially important to consider. When false discovery rates are estimated, it is easy to understand why many studies in psychology cannot be replicated.


2018 ◽  
Author(s):  
Tuomas Puoliväli ◽  
Satu Palva ◽  
J. Matias Palva

AbstractBackgroundReproducibility of research findings has been recently questioned in many fields of science, including psychology and neurosciences. One factor influencing reproducibility is the simultaneous testing of multiple hypotheses, which increases the number of false positive findings unless the p-values are carefully corrected. While this multiple testing problem is well known and has been studied for decades, it continues to be both a theoretical and practical problem.New MethodHere we assess the reproducibility of research involving multiple-testing corrected for family-wise error rate (FWER) or false discovery rate (FDR) by techniques based on random field theory (RFT), cluster-mass based permutation testing, adaptive FDR, and several classical methods. We also investigate the performance of these methods under two different models.ResultsWe found that permutation testing is the most powerful method among the considered approaches to multiple testing, and that grouping hypotheses based on prior knowledge can improve power. We also found that emphasizing primary and follow-up studies equally produced most reproducible outcomes.Comparison with Existing Method(s)We have extended the use of two-group and separate-classes models for analyzing reproducibility and provide a new open-source software “MultiPy” for multiple hypothesis testing.ConclusionsOur results suggest that performing strict corrections for multiple testing is not sufficient to improve reproducibility of neuroimaging experiments. The methods are freely available as a Python toolkit “MultiPy” and we aim this study to help in improving statistical data analysis practices and to assist in conducting power and reproducibility analyses for new experiments.


2013 ◽  
Vol 36 (1) ◽  
pp. 33-36
Author(s):  
A. Martínez–Abraín ◽  

Hypothesis testing is commonly used in ecology and conservation biology as a tool to test statistical–population parameter properties against null hypotheses. This tool was first invented by lab biologists and statisticians to deal with experimental data for which the magnitude of biologically relevant effects was known beforehand. The latter often makes the use of this tool inadequate in ecology because we field ecologists usually deal with observational data and seldom know the magnitude of biologically relevant effects. This precludes us from using hypothesis testing in the correct way, which is posing informed null hypotheses and making use of a priori power tests to calculate necessary sample sizes, and it forces us to use null hypotheses of equality to zero effects which are of little use for field ecologists because we know beforehand that zero effects do not exist in nature. This is why only ‘positive’ (statistically significant) results are sought by ecologists, because negative results always derive from a lack of power to detect small (usually biologically irrelevant) effects. Despite this, ‘negative’ results should be published, as they are important within the context of meta–analysis (which accounts for uncertainty when weighting individual studies by sample size) to allow proper decision–making. The use of multiple hypothesis testing and Bayesian statistics puts an end to this black or white dichotomy and moves us towards a more realistic continuum of grey tones.


2018 ◽  
Author(s):  
Martin J. Zhang ◽  
Fei Xia ◽  
James Zou

Multiple hypothesis testing is an essential component of modern data science. Its goal is to maximize the number of discoveries while controlling the fraction of false discoveries. In many settings, in addition to the p-value, additional information/covariates for each hypothesis are available. For example, in eQTL studies, each hypothesis tests the correlation between a variant and the expression of a gene. We also have additional covariates such as the location, conservation and chromatin status of the variant, which could inform how likely the association is to be due to noise. However, popular multiple hypothesis testing approaches, such as Benjamini-Hochberg procedure (BH) and independent hypothesis weighting (IHW), either ignore these covariates or assume the covariate to be univariate. We introduce AdaFDR, a fast and flexible method that adaptively learns the optimal p-value threshold from covariates to significantly improve detection power. On eQTL analysis of the GTEx data, AdaFDR discovers 32% and 27% more associations than BH and IHW, respectively, at the same false discovery rate. We prove that AdaFDR controls false discovery proportion, and show that it makes substantially more discoveries while controlling FDR in extensive experiments. AdaFDR is computationally efficient and can process more than 100 million hypotheses within an hour and allows multi-dimensional covariates with both numeric and categorical values. It also provides exploratory plots for the user to interpret how each covariate affects the significance of hypotheses, making it broadly useful across many applications.


Author(s):  
Amir hassan Ghaseminejad tafreshi

This paper identifies a criterion for choosing the largest set of rejected hypotheses in high-dimensional data analysis where Multiple Hypothesis testing is used in exploratory research to identify significant associations among many variables. The method neither requires predetermined thresholds for level of significance, nor uses presumed thresholds for false discovery rate. The upper limit for number of rejected hypotheses is determined by finding maximum difference between expected true hypotheses and false hypotheses among all possible sets of rejected hypotheses. Methods of choosing a reasonable number of rejected hypotheses and application to non-parametric analysis of ordinal survey data are presented.


Biometrika ◽  
2021 ◽  
Author(s):  
J H loper ◽  
L Lei ◽  
W Fithian ◽  
W Tansey

Summary We consider the problem of multiple hypothesis testing when there is a logical nested structure to the hypotheses. When one hypothesis is nested inside another, the outer hypothesis must be false if the inner hypothesis is false. We model the nested structure as a directed acyclic graph, including chain and tree graphs as special cases. Each node in the graph is a hypothesis and rejecting a node requires also rejecting all of its ancestors. We propose a general framework for adjusting node-level test statistics using the known logical constraints. Within this framework, we study a smoothing procedure that combines each node with all of its descendants to form a more powerful statistic. We prove a broad class of smoothing strategies can be used with existing selection procedures to control the familywise error rate, false discovery exceedance rate, or false discovery rate, so long as the original test statistics are independent under the null. When the null statistics are not independent but are derived from positively-correlated normal observations, we prove control for all three error rates when the smoothing method is arithmetic averaging of the observations. Simulations and an application to a real biology dataset demonstrate that smoothing leads to substantial power gains.


2020 ◽  
Author(s):  
Yubo Wang ◽  
Liguan Li ◽  
Yu Xia ◽  
Feng Ju ◽  
Tong Zhang

AbstractNeither the abundance of the exo/endoglucase GH modules nor the taxonomy affiliation is informative enough in inferring whether a genome is of a potential cellulolytic microbe or not. By interpreting the complete genomes of 2642 microbe strains whose phenotypes have been well documented, we are trying to reveal a more reliable genotype and phenotype correlation on the specific function niche of cellulose hydrolysis. By incorporating into the annotation approach an automatic recognition of the potential synergy machineries, a more reliable prediction on the corresponding microbes’ cellulolytic competency could be achieved. The potential cellulose hydrolyzing microbes could be categorized into 5 groups according to the varying synergy machineries among the carbohydrate active modules/genes annotated. Results of the meta-analysis on the 2642 genomes revealed that some cellulosome gene clusters were in lack of the surface layer homology module (SLH) and microbe strains annotated with such cellulosome gene clusters were not certainly cellulolytic. Hypothesized in this study was that cellulosome-independent genes harboring both the SLH module and the cellulose-binding carbohydrate binding module (CBM) were likely an alternative gene apparatus initiating the formation of the cellulose-enzyme-microbe (CEM) complexes; and their role is important especially for the cellulolytic anaerobes without cellulosome gene clusters.ImportanceIn the genome-centric prediction on the corresponding microbes’ cellulolytic activity, recognition of the synergy machineries that include but are not limited to the cellulosome gene clusters is equally important as the annotation of the individual carbohydrate active modules or genes. This is the first time that a pipeline was developed for an automatic recognition of the synergy among the carbohydrate active units annotated. With promising resolution and reliability, this pipeline should be a good add to the bioinformatic tools for the genome-centric interpretations on the specific function niche of cellulose hydrolysis.


Sign in / Sign up

Export Citation Format

Share Document