A direct approach to estimating false discovery rates conditional on covariates

PeerJ ◽

10.7717/peerj.6035 ◽

2018 ◽

Vol 6 ◽

pp. e6035 ◽

Cited By ~ 15

Author(s):

Simina M. Boca ◽

Jeffrey T. Leek

Keyword(s):

Meta Analysis ◽

Error Rates ◽

Multiple Hypothesis Testing ◽

False Discovery Rates ◽

False Discovery ◽

Multiple Hypothesis ◽

A Genome ◽

Regression Framework ◽

Novel Method ◽

The Individual

Modern scientific studies from many diverse areas of research abound with multiple hypothesis testing concerns. The false discovery rate (FDR) is one of the most commonly used approaches for measuring and controlling error rates when performing multiple tests. Adaptive FDRs rely on an estimate of the proportion of null hypotheses among all the hypotheses being tested. This proportion is typically estimated once for each collection of hypotheses. Here, we propose a regression framework to estimate the proportion of null hypotheses conditional on observed covariates. This may then be used as a multiplication factor with the Benjamini–Hochberg adjusted p-values, leading to a plug-in FDR estimator. We apply our method to a genome-wise association meta-analysis for body mass index. In our framework, we are able to use the sample sizes for the individual genomic loci and the minor allele frequencies as covariates. We further evaluate our approach via a number of simulation scenarios. We provide an implementation of this novel method for estimating the proportion of null hypotheses in a regression framework as part of the Bioconductor package swfdr.

Download Full-text

A direct approach to estimating false discovery rates conditional on covariates

10.1101/035675 ◽

2015 ◽

Cited By ~ 3

Author(s):

Simina M. Boca ◽

Jeffrey T. Leek

Keyword(s):

False Discovery Rate ◽

Meta Analysis ◽

Error Rates ◽

Multiple Hypothesis Testing ◽

False Discovery Rates ◽

False Discovery ◽

A Genome ◽

The Individual ◽

False Discoveries ◽

Discovery Rates

AbstractModern scientific studies from many diverse areas of research abound with multiple hypothesis testing concerns. The false discovery rate is one of the most commonly used error rates for measuring and controlling rates of false discoveries when performing multiple tests. Adaptive false discovery rates rely on an estimate of the proportion of null hypotheses among all the hypotheses being tested. This proportion is typically estimated once for each collection of hypotheses. Here we propose a regression framework to estimate the proportion of null hypotheses conditional on observed covariates. This may then be used as a multiplication factor with the Benjamini-Hochberg adjusted p-values, leading to a plug-in false discovery rate estimator. Our case study concerns a genome-wise association meta-analysis which considers associations with body mass index. In our framework, we are able to use the sample sizes for the individual genomic loci and the minor allele frequencies as covariates. We further evaluate our approach via a number of simulation scenarios.

Download Full-text

The Importance of Understanding False Discoveries and the Accuracy Paradox When Evaluating Quantitative Studies

Studies in Social Science Research ◽

10.22158/sssr.v2n2p1 ◽

2021 ◽

Vol 2 (2) ◽

pp. p1

Author(s):

Kirk Davis ◽

Rodney Maiden

Keyword(s):

Statistical Significance ◽

Error Rates ◽

Type I ◽

False Discovery Rates ◽

Quantitative Studies ◽

False Discovery ◽

Lower Accuracy ◽

Psychology Literature ◽

False Discoveries ◽

Published Research

Although the limitations of null hypothesis significance testing (NHST) are well documented in the psychology literature, the accuracy paradox, which concisely states an important limitation of published research, is never mentioned. The accuracy paradox appears when a test with higher accuracy does a poorer job of correctly classifying a particular outcome than a test with lower accuracy, which suggests that a reliance on accuracy as a metric for a test’s usefulness is not always the best metric. Since accuracy is a function of type I and II error rates, it can be misleading to interpret a study’s results as accurate simply because these errors are minimized. Once a decision has been made regarding statistical significance, type I and II error rates are not directly informative to the reader. Instead, false discovery and false omission rates are more informative when evaluating the results of a study. Given the prevalence of publication bias and small effect sizes in the literature, the possibility of a false discovery is especially important to consider. When false discovery rates are estimated, it is easy to understand why many studies in psychology cannot be replicated.

Download Full-text

Influence of multiple hypothesis testing on reproducibility in neuroimaging research

10.1101/488353 ◽

2018 ◽

Author(s):

Tuomas Puoliväli ◽

Satu Palva ◽

J. Matias Palva

Keyword(s):

Hypothesis Testing ◽

Multiple Testing ◽

Multiple Hypothesis Testing ◽

Permutation Testing ◽

Random Field Theory ◽

False Discovery ◽

Multiple Hypothesis ◽

Simultaneous Testing ◽

Research Findings

AbstractBackgroundReproducibility of research findings has been recently questioned in many fields of science, including psychology and neurosciences. One factor influencing reproducibility is the simultaneous testing of multiple hypotheses, which increases the number of false positive findings unless the p-values are carefully corrected. While this multiple testing problem is well known and has been studied for decades, it continues to be both a theoretical and practical problem.New MethodHere we assess the reproducibility of research involving multiple-testing corrected for family-wise error rate (FWER) or false discovery rate (FDR) by techniques based on random field theory (RFT), cluster-mass based permutation testing, adaptive FDR, and several classical methods. We also investigate the performance of these methods under two different models.ResultsWe found that permutation testing is the most powerful method among the considered approaches to multiple testing, and that grouping hypotheses based on prior knowledge can improve power. We also found that emphasizing primary and follow-up studies equally produced most reproducible outcomes.Comparison with Existing Method(s)We have extended the use of two-group and separate-classes models for analyzing reproducibility and provide a new open-source software “MultiPy” for multiple hypothesis testing.ConclusionsOur results suggest that performing strict corrections for multiple testing is not sufficient to improve reproducibility of neuroimaging experiments. The methods are freely available as a Python toolkit “MultiPy” and we aim this study to help in improving statistical data analysis practices and to assist in conducting power and reproducibility analyses for new experiments.

Download Full-text

Why do ecologists aim to get positive results? Once again, negative results are necessary for better knowledge accumulation

Animal Biodiversity and Conservation ◽

10.32800/abc.2013.36.0033 ◽

2013 ◽

Vol 36 (1) ◽

pp. 33-36

Author(s):

A. Martínez–Abraín ◽

Keyword(s):

Hypothesis Testing ◽

Meta Analysis ◽

A Priori ◽

Multiple Hypothesis Testing ◽

Population Parameter ◽

Statistical Population ◽

Biologically Relevant ◽

Negative Results ◽

Multiple Hypothesis ◽

Positive Results

Hypothesis testing is commonly used in ecology and conservation biology as a tool to test statistical–population parameter properties against null hypotheses. This tool was first invented by lab biologists and statisticians to deal with experimental data for which the magnitude of biologically relevant effects was known beforehand. The latter often makes the use of this tool inadequate in ecology because we field ecologists usually deal with observational data and seldom know the magnitude of biologically relevant effects. This precludes us from using hypothesis testing in the correct way, which is posing informed null hypotheses and making use of a priori power tests to calculate necessary sample sizes, and it forces us to use null hypotheses of equality to zero effects which are of little use for field ecologists because we know beforehand that zero effects do not exist in nature. This is why only ‘positive’ (statistically significant) results are sought by ecologists, because negative results always derive from a lack of power to detect small (usually biologically irrelevant) effects. Despite this, ‘negative’ results should be published, as they are important within the context of meta–analysis (which accounts for uncertainty when weighting individual studies by sample size) to allow proper decision–making. The use of multiple hypothesis testing and Bayesian statistics puts an end to this black or white dichotomy and moves us towards a more realistic continuum of grey tones.

Download Full-text

A review of modern multiple hypothesis testing, with particular attention to the false discovery proportion

Statistical Methods in Medical Research ◽

10.1177/0962280206079046 ◽

2007 ◽

Vol 17 (4) ◽

pp. 347-388 ◽

Cited By ~ 110

Author(s):

Alessio Farcomeni

Keyword(s):

Hypothesis Testing ◽

Multiple Hypothesis Testing ◽

False Discovery ◽

Multiple Hypothesis ◽

False Discovery Proportion

Download Full-text

AdaFDR: a Fast, Powerful and Covariate-Adaptive Approach to Multiple Hypothesis Testing

10.1101/496372 ◽

2018 ◽

Author(s):

Martin J. Zhang ◽

Fei Xia ◽

James Zou

Keyword(s):

Hypothesis Testing ◽

Data Science ◽

Multiple Hypothesis Testing ◽

P Value ◽

Eqtl Analysis ◽

Computationally Efficient ◽

Additional Information ◽

False Discovery ◽

Multiple Hypothesis ◽

False Discoveries

Multiple hypothesis testing is an essential component of modern data science. Its goal is to maximize the number of discoveries while controlling the fraction of false discoveries. In many settings, in addition to the p-value, additional information/covariates for each hypothesis are available. For example, in eQTL studies, each hypothesis tests the correlation between a variant and the expression of a gene. We also have additional covariates such as the location, conservation and chromatin status of the variant, which could inform how likely the association is to be due to noise. However, popular multiple hypothesis testing approaches, such as Benjamini-Hochberg procedure (BH) and independent hypothesis weighting (IHW), either ignore these covariates or assume the covariate to be univariate. We introduce AdaFDR, a fast and flexible method that adaptively learns the optimal p-value threshold from covariates to significantly improve detection power. On eQTL analysis of the GTEx data, AdaFDR discovers 32% and 27% more associations than BH and IHW, respectively, at the same false discovery rate. We prove that AdaFDR controls false discovery proportion, and show that it makes substantially more discoveries while controlling FDR in extensive experiments. AdaFDR is computationally efficient and can process more than 100 million hypotheses within an hour and allows multi-dimensional covariates with both numeric and categorical values. It also provides exploratory plots for the user to interpret how each covariate affects the significance of hypotheses, making it broadly useful across many applications.

Download Full-text

A Non-Parametric Maximum for Reasonable Number of Rejected Hypotheses: Objective Optima for False Discovery Rate and Significance Threshold in Exploratory Research with Application to Ordinal Survey Analysis

10.20944/preprints201703.0191.v1 ◽

2017 ◽

Author(s):

Amir hassan Ghaseminejad tafreshi

Keyword(s):

False Discovery Rate ◽

Multiple Hypothesis Testing ◽

Exploratory Research ◽

Survey Analysis ◽

Significance Threshold ◽

False Discovery ◽

Multiple Hypothesis ◽

Reasonable Number ◽

Level Of Significance ◽

Non Parametric

This paper identifies a criterion for choosing the largest set of rejected hypotheses in high-dimensional data analysis where Multiple Hypothesis testing is used in exploratory research to identify significant associations among many variables. The method neither requires predetermined thresholds for level of significance, nor uses presumed thresholds for false discovery rate. The upper limit for number of rejected hypotheses is determined by finding maximum difference between expected true hypotheses and false hypotheses among all possible sets of rejected hypotheses. Methods of choosing a reasonable number of rejected hypotheses and application to non-parametric analysis of ordinal survey data are presented.

Download Full-text

Sequential Bonferroni Methods for Multiple Hypothesis Testing with Strong Control of Family-Wise Error Rates I and II

Sequential Analysis ◽

10.1080/07474946.2012.665730 ◽

2012 ◽

Vol 31 (2) ◽

pp. 238-262 ◽

Cited By ~ 13

Author(s):

Shyamal K. De ◽

Michael Baron

Keyword(s):

Hypothesis Testing ◽

Error Rates ◽

Multiple Hypothesis Testing ◽

Multiple Hypothesis ◽

Strong Control

Download Full-text

Smoothed Nested Testing on Directed Acyclic Graphs

Biometrika ◽

10.1093/biomet/asab041 ◽

2021 ◽

Author(s):

J H loper ◽

L Lei ◽

W Fithian ◽

W Tansey

Keyword(s):

Broad Class ◽

Directed Acyclic Graphs ◽

Error Rates ◽

Multiple Hypothesis Testing ◽

Test Statistics ◽

Familywise Error Rate ◽

Tree Graphs ◽

False Discovery ◽

Special Cases ◽

Nested Structure

Summary We consider the problem of multiple hypothesis testing when there is a logical nested structure to the hypotheses. When one hypothesis is nested inside another, the outer hypothesis must be false if the inner hypothesis is false. We model the nested structure as a directed acyclic graph, including chain and tree graphs as special cases. Each node in the graph is a hypothesis and rejecting a node requires also rejecting all of its ancestors. We propose a general framework for adjusting node-level test statistics using the known logical constraints. Within this framework, we study a smoothing procedure that combines each node with all of its descendants to form a more powerful statistic. We prove a broad class of smoothing strategies can be used with existing selection procedures to control the familywise error rate, false discovery exceedance rate, or false discovery rate, so long as the original test statistics are independent under the null. When the null statistics are not independent but are derived from positively-correlated normal observations, we prove control for all three error rates when the smoothing method is arithmetic averaging of the observations. Simulations and an application to a real biology dataset demonstrate that smoothing leads to substantial power gains.

Download Full-text

Genome-centric portrait of the microbes’ cellulolytic competency

10.1101/2020.03.09.984823 ◽

2020 ◽

Author(s):

Yubo Wang ◽

Liguan Li ◽

Yu Xia ◽

Feng Ju ◽

Tong Zhang

Keyword(s):

Meta Analysis ◽

Gene Clusters ◽

Cellulose Hydrolysis ◽

Automatic Recognition ◽

Carbohydrate Binding ◽

Cellulolytic Activity ◽

Specific Function ◽

Bioinformatic Tools ◽

A Genome ◽

The Individual

AbstractNeither the abundance of the exo/endoglucase GH modules nor the taxonomy affiliation is informative enough in inferring whether a genome is of a potential cellulolytic microbe or not. By interpreting the complete genomes of 2642 microbe strains whose phenotypes have been well documented, we are trying to reveal a more reliable genotype and phenotype correlation on the specific function niche of cellulose hydrolysis. By incorporating into the annotation approach an automatic recognition of the potential synergy machineries, a more reliable prediction on the corresponding microbes’ cellulolytic competency could be achieved. The potential cellulose hydrolyzing microbes could be categorized into 5 groups according to the varying synergy machineries among the carbohydrate active modules/genes annotated. Results of the meta-analysis on the 2642 genomes revealed that some cellulosome gene clusters were in lack of the surface layer homology module (SLH) and microbe strains annotated with such cellulosome gene clusters were not certainly cellulolytic. Hypothesized in this study was that cellulosome-independent genes harboring both the SLH module and the cellulose-binding carbohydrate binding module (CBM) were likely an alternative gene apparatus initiating the formation of the cellulose-enzyme-microbe (CEM) complexes; and their role is important especially for the cellulolytic anaerobes without cellulosome gene clusters.ImportanceIn the genome-centric prediction on the corresponding microbes’ cellulolytic activity, recognition of the synergy machineries that include but are not limited to the cellulosome gene clusters is equally important as the annotation of the individual carbohydrate active modules or genes. This is the first time that a pipeline was developed for an automatic recognition of the synergy among the carbohydrate active units annotated. With promising resolution and reliability, this pipeline should be a good add to the bioinformatic tools for the genome-centric interpretations on the specific function niche of cellulose hydrolysis.

Download Full-text