An illustration of reproducibility in neuroscience research in the absence of selective reporting

The earth is flat (p > 0.05): significance thresholds and the crisis of unreplicable research

PeerJ ◽

10.7717/peerj.3544 ◽

2017 ◽

Vol 5 ◽

pp. e3544 ◽

Cited By ~ 87

Author(s):

Valentin Amrhein ◽

Fränzi Korner-Nievergelt ◽

Tobias Roth

Keyword(s):

Publication Bias ◽

Null Hypothesis ◽

Statistical Power ◽

Statistical Significance ◽

Practical Importance ◽

Effect Sizes ◽

Point Estimate ◽

Selective Reporting ◽

Interval Estimate ◽

True Effect

The widespread use of ‘statistical significance’ as a license for making a claim of a scientific finding leads to considerable distortion of the scientific process (according to the American Statistical Association). We review why degradingp-values into ‘significant’ and ‘nonsignificant’ contributes to making studies irreproducible, or to making them seem irreproducible. A major problem is that we tend to take smallp-values at face value, but mistrust results with largerp-values. In either case,p-values tell little about reliability of research, because they are hardly replicable even if an alternative hypothesis is true. Also significance (p ≤ 0.05) is hardly replicable: at a good statistical power of 80%, two studies will be ‘conflicting’, meaning that one is significant and the other is not, in one third of the cases if there is a true effect. A replication can therefore not be interpreted as having failed only because it is nonsignificant. Many apparent replication failures may thus reflect faulty judgment based on significance thresholds rather than a crisis of unreplicable research. Reliable conclusions on replicability and practical importance of a finding can only be drawn using cumulative evidence from multiple independent studies. However, applying significance thresholds makes cumulative knowledge unreliable. One reason is that with anything but ideal statistical power, significant effect sizes will be biased upwards. Interpreting inflated significant results while ignoring nonsignificant results will thus lead to wrong conclusions. But current incentives to hunt for significance lead to selective reporting and to publication bias against nonsignificant findings. Data dredging,p-hacking, and publication bias should be addressed by removing fixed significance thresholds. Consistent with the recommendations of the late Ronald Fisher,p-values should be interpreted as graded measures of the strength of evidence against the null hypothesis. Also largerp-values offer some evidence against the null hypothesis, and they cannot be interpreted as supporting the null hypothesis, falsely concluding that ‘there is no effect’. Information on possible true effect sizes that are compatible with the data must be obtained from the point estimate, e.g., from a sample average, and from the interval estimate, such as a confidence interval. We review how confusion about interpretation of largerp-values can be traced back to historical disputes among the founders of modern statistics. We further discuss potential arguments against removing significance thresholds, for example that decision rules should rather be more stringent, that sample sizes could decrease, or thatp-values should better be completely abandoned. We conclude that whatever method of statistical inference we use, dichotomous threshold thinking must give way to non-automated informed judgment.

Download Full-text

The earth is flat (p>0.05): Significance thresholds and the crisis of unreplicable research

10.7287/peerj.preprints.2921 ◽

2017 ◽

Author(s):

Valentin Amrhein ◽

Fränzi Korner-Nievergelt ◽

Tobias Roth

Keyword(s):

Publication Bias ◽

Null Hypothesis ◽

Statistical Power ◽

Statistical Significance ◽

Practical Importance ◽

Effect Sizes ◽

Selective Reporting ◽

Interval Estimate ◽

P Values ◽

True Effect

The widespread use of 'statistical significance' as a license for making a claim of a scientific finding leads to considerable distortion of the scientific process (according to the American Statistical Association). We review why degrading p-values into 'significant' and 'nonsignificant' contributes to making studies irreproducible, or to making them seem irreproducible. A major problem is that we tend to take small p-values at face value, but mistrust results with larger p-values. In either case, p-values tell little about reliability of research, because they are hardly replicable even if an alternative hypothesis is true. Also significance (p≤0.05) is hardly replicable: at a good statistical power of 80%, two studies will be 'conflicting', meaning that one is significant and the other is not, in one third of the cases if there is a true effect. A replication can therefore not be interpreted as having failed only because it is nonsignificant. Many apparent replication failures may thus reflect faulty judgment based on significance thresholds rather than a crisis of unreplicable research. Reliable conclusions on replicability and practical importance of a finding can only be drawn using cumulative evidence from multiple independent studies. However, applying significance thresholds makes cumulative knowledge unreliable. One reason is that with anything but ideal statistical power, significant effect sizes will be biased upwards. Interpreting inflated significant results while ignoring nonsignificant results will thus lead to wrong conclusions. But current incentives to hunt for significance lead to selective reporting and to publication bias against nonsignificant findings. Data dredging, p-hacking, and publication bias should be addressed by removing fixed significance thresholds. Consistent with the recommendations of the late Ronald Fisher, p-values should be interpreted as graded measures of the strength of evidence against the null hypothesis. Also larger p-values offer some evidence against the null hypothesis, and they cannot be interpreted as supporting the null hypothesis, falsely concluding that 'there is no effect'. Information on possible true effect sizes that are compatible with the data must be obtained from the point estimate, e.g., from a sample average, and from the interval estimate, such as a confidence interval. We review how confusion about interpretation of larger p-values can be traced back to historical disputes among the founders of modern statistics. We further discuss potential arguments against removing significance thresholds, for example that decision rules should rather be more stringent, that sample sizes could decrease, or that p-values should better be completely abandoned. We conclude that whatever method of statistical inference we use, dichotomous threshold thinking must give way to non-automated informed judgment.

Download Full-text

The earth is flat (p>0.05): Significance thresholds and the crisis of unreplicable research

10.7287/peerj.preprints.2921v2 ◽

2017 ◽

Author(s):

Valentin Amrhein ◽

Fränzi Korner-Nievergelt ◽

Tobias Roth

Keyword(s):

Publication Bias ◽

Null Hypothesis ◽

Statistical Power ◽

Statistical Significance ◽

Practical Importance ◽

Effect Sizes ◽

Selective Reporting ◽

Interval Estimate ◽

P Values ◽

True Effect

The widespread use of 'statistical significance' as a license for making a claim of a scientific finding leads to considerable distortion of the scientific process (according to the American Statistical Association). We review why degrading p-values into 'significant' and 'nonsignificant' contributes to making studies irreproducible, or to making them seem irreproducible. A major problem is that we tend to take small p-values at face value, but mistrust results with larger p-values. In either case, p-values tell little about reliability of research, because they are hardly replicable even if an alternative hypothesis is true. Also significance (p≤0.05) is hardly replicable: at a good statistical power of 80%, two studies will be 'conflicting', meaning that one is significant and the other is not, in one third of the cases if there is a true effect. A replication can therefore not be interpreted as having failed only because it is nonsignificant. Many apparent replication failures may thus reflect faulty judgment based on significance thresholds rather than a crisis of unreplicable research. Reliable conclusions on replicability and practical importance of a finding can only be drawn using cumulative evidence from multiple independent studies. However, applying significance thresholds makes cumulative knowledge unreliable. One reason is that with anything but ideal statistical power, significant effect sizes will be biased upwards. Interpreting inflated significant results while ignoring nonsignificant results will thus lead to wrong conclusions. But current incentives to hunt for significance lead to selective reporting and to publication bias against nonsignificant findings. Data dredging, p-hacking, and publication bias should be addressed by removing fixed significance thresholds. Consistent with the recommendations of the late Ronald Fisher, p-values should be interpreted as graded measures of the strength of evidence against the null hypothesis. Also larger p-values offer some evidence against the null hypothesis, and they cannot be interpreted as supporting the null hypothesis, falsely concluding that 'there is no effect'. Information on possible true effect sizes that are compatible with the data must be obtained from the point estimate, e.g., from a sample average, and from the interval estimate, such as a confidence interval. We review how confusion about interpretation of larger p-values can be traced back to historical disputes among the founders of modern statistics. We further discuss potential arguments against removing significance thresholds, for example that decision rules should rather be more stringent, that sample sizes could decrease, or that p-values should better be completely abandoned. We conclude that whatever method of statistical inference we use, dichotomous threshold thinking must give way to non-automated informed judgment.

Download Full-text

Power-Enhanced Funnel Plots for Meta-Analysis

Zeitschrift für Psychologie ◽

10.1027/2151-2604/a000392 ◽

2020 ◽

Vol 228 (1) ◽

pp. 43-49 ◽

Cited By ~ 1

Author(s):

Michael Kossmeier ◽

Ulrich S. Tran ◽

Martin Voracek

Keyword(s):

Statistical Power ◽

Meta Analysis ◽

Relevant Information ◽

Selective Reporting ◽

Graphical Display ◽

Funnel Plot ◽

Evidential Value ◽

Graphical Displays ◽

Funnel Plots ◽

Meta Analyses

Abstract. Currently, dedicated graphical displays to depict study-level statistical power in the context of meta-analysis are unavailable. Here, we introduce the sunset (power-enhanced) funnel plot to visualize this relevant information for assessing the credibility, or evidential value, of a set of studies. The sunset funnel plot highlights the statistical power of primary studies to detect an underlying true effect of interest in the well-known funnel display with color-coded power regions and a second power axis. This graphical display allows meta-analysts to incorporate power considerations into classic funnel plot assessments of small-study effects. Nominally significant, but low-powered, studies might be seen as less credible and as more likely being affected by selective reporting. We exemplify the application of the sunset funnel plot with two published meta-analyses from medicine and psychology. Software to create this variation of the funnel plot is provided via a tailored R function. In conclusion, the sunset (power-enhanced) funnel plot is a novel and useful graphical display to critically examine and to present study-level power in the context of meta-analysis.

Download Full-text

How to Detect Publication Bias in Psychological Research

Zeitschrift für Psychologie ◽

10.1027/2151-2604/a000386 ◽

2019 ◽

Vol 227 (4) ◽

pp. 261-279 ◽

Cited By ~ 2

Author(s):

Frank Renkewitz ◽

Melanie Keiner

Keyword(s):

Publication Bias ◽

Effect Size ◽

Statistical Power ◽

Type I Error ◽

Psychological Research ◽

Type I ◽

True Effect Size ◽

Questionable Research Practices ◽

True Effect ◽

Meta Analyses

Abstract. Publication biases and questionable research practices are assumed to be two of the main causes of low replication rates. Both of these problems lead to severely inflated effect size estimates in meta-analyses. Methodologists have proposed a number of statistical tools to detect such bias in meta-analytic results. We present an evaluation of the performance of six of these tools. To assess the Type I error rate and the statistical power of these methods, we simulated a large variety of literatures that differed with regard to true effect size, heterogeneity, number of available primary studies, and sample sizes of these primary studies; furthermore, simulated studies were subjected to different degrees of publication bias. Our results show that across all simulated conditions, no method consistently outperformed the others. Additionally, all methods performed poorly when true effect sizes were heterogeneous or primary studies had a small chance of being published, irrespective of their results. This suggests that in many actual meta-analyses in psychology, bias will remain undiscovered no matter which detection method is used.

Download Full-text

A Collaborative Study to Establish the Second International Standard for Streptokinase

Thrombosis and Haemostasis ◽

10.1055/s-0038-1647298 ◽

1990 ◽

Vol 64 (02) ◽

pp. 267-269 ◽

Cited By ~ 7

Author(s):

A B Heath ◽

P J Gaffney

Keyword(s):

High Purity ◽

Collaborative Study ◽

World Health Organisation ◽

World Health ◽

Clot Lysis ◽

Current Standard ◽

International Standard ◽

Accelerated Degradation ◽

The World ◽

International Collaborative Study

SummaryAn International Standard for Streptokinase - Streptodomase (62/7) has been used to calibrate high purity clinical batches of SK since 1965. An international collaborative study, involving six laboratories, was undertaken to replace this standard with a high purity standard for SK. Two candidate preparations (88/826 and 88/824) were compared by a clot lysis assay with the current standard (62/7). Potencies of 671 i.u. and 461 i.u. were established for preparations A (88/826) and B (88/824), respectively.Either preparation appeared suitable to serve as a standard for SK. However, each ampoule of preparation A (88/826) contains a more appropriate amount of SK activity for potency testing, and is therefore preferred. Accelerated degradation tests indicate that preparation A (88/826) is very stable.The high purity streptokinase preparation, coded 88/826, has been established by the World Health Organisation as the 2nd International Standard for Streptokinase, with an assigned potency of 700 i.u. per ampoule.

Download Full-text

Restrained Eating Is Associated with Lower Cortical Thickness in the Inferior Frontal Gyrus in Adolescents

Brain Sciences ◽

10.3390/brainsci11080978 ◽

2021 ◽

Vol 11 (8) ◽

pp. 978

Author(s):

Isabel García-García ◽

Maite Garolera ◽

Jonatan Ottino-González ◽

Xavier Prats-Soteras ◽

Anna Prunell-Castañé ◽

...

Keyword(s):

Cortical Thickness ◽

Eating Behaviors ◽

Inferior Frontal Gyrus ◽

Brain Mri ◽

Neural Mechanism ◽

Restrained Eating ◽

Eating Patterns ◽

Restrictive Eating ◽

And Performance ◽

Uncontrolled Eating

Some eating patterns, such as restrained eating and uncontrolled eating, are risk factors for eating disorders. However, it is not yet clear whether they are associated with neurocognitive differences. In the current study, we analyzed whether eating patterns can be used to classify participants into meaningful clusters, and we examined whether there are neurocognitive differences between the clusters. Adolescents (n = 108; 12 to 17 years old) and adults (n = 175, 18 to 40 years old) completed the Three Factor Eating Questionnaire, which was used to classify participants according to their eating profile using k means clustering. Participants also completed personality questionnaires and a neuropsychological examination. A subsample of participants underwent a brain MRI acquisition. In both samples, we obtained a cluster characterized by high uncontrolled eating patterns, a cluster with high scores in restrictive eating, and a cluster with low scores in problematic eating behaviors. The clusters were equivalent with regards to personality and performance in executive functions. In adolescents, the cluster with high restrictive eating showed lower cortical thickness in the inferior frontal gyrus compared to the other two clusters. We hypothesize that this difference in cortical thickness represents an adaptive neural mechanism that facilitates inhibition processes.

Download Full-text

Homeopathy: statistical significance versus the sample size in experiments with Toxoplasma gondii

International Journal of High Dilution Research - ISSN 1982-6206 ◽

10.51910/ijhdr.v10i36.466 ◽

2021 ◽

Vol 10 (36) ◽

pp. 115-118

Author(s):

ÃƒÆ’Ã¢â‚¬Â°rika Cristina Ferreira ◽

Paula Fernanda Massini ◽

Caroline Felicio Braga ◽

Ricardo Nascimento Drozino ◽

Neide Martins Moreira ◽

...

Keyword(s):

Toxoplasma Gondii ◽

Sample Size ◽

Biological Effects ◽

Statistical Significance ◽

Rare Event ◽

Public Health Problem ◽

Control Group ◽

Bootstrap Analysis ◽

Significant Difference ◽

The World

Introduction: Toxoplasmosis is a zoonosis that represents a serious public health problem, caused by Toxoplasma gondii, which affects 20-90% of the world human population [1,2]. It is a serious problem especially when considering the congenital transmission due to congenital sequels. Treatment with highly diluted substances is one of the alternative/complementary medicines most employed in the world [3,4]. The current ethical rules regarding the number of animals used in animal experimental protocols with the use of more conservative statistical methods [5] can not enhance the biological effects of highly diluted substances observed by the experience of the researcher. Aim: To evaluate the minimum number of animals per group to achieve a significant difference among the groups of animals treated with biotherapic T. gondii and infected with the protozoan regarding the number of cysts observed in the brain. Material and methods: A blind randomized controlled trial was performed using eleven Swiss male mice, aged 57 days, divided into two groups: BIOT-200DH - treated with biotherapic (n=6) and CONTROL - treated with hydroalcoholic solution 7% (n=7).The animals of the group BIOT-200DH were treated for 3 consecutive days in a single dose 0.1ml/dose/day. The animals of BIOT ÃƒÂ¢Ã¢â€šÂ¬Ã¢â‚¬Å“ 200DH group were orally infected with 20 cysts of ME49-T. gondii. The animals of the control group were treated with cereal alcohol 7% (n=7) for 3 consecutive days and then were infected with 20 cysts of ME49 -T. gondii orally. The biotherapic 200DH T. gondii was prepared with homogenized mouse brain, with 20 cysts of T. gondii / 100ÃƒÅ½Ã‚Â¼L according to the Brazilian Homeopathic Pharmacopoeia [6] in laminar flow. After 60 days post-infection the animals were killed in a chamber saturated with halothane, the brains were homogenized and resuspended in 1 ml of saline solution. Cysts were counted in 25 ml of this suspension, covered with a 24x24 mm coverglass, examined in its full length. This study was approved by the Ethics Committee for animal experimentation of the UEM - Protocol 036/2009. The data were compared using the tests Mann Whitney and Bootstrap [7] with the statistical software BioStat 5.0. Results and discussion: There was no significant difference when analyzed with the Mann-Whitney, even multiplying the "n" ten times (p=0.0618). The number of cysts observed in BIOT 200DH group was 4.5 Ãƒâ€šÃ‚Â± 3.3 and 12.8 Ãƒâ€šÃ‚Â± 9.7 in the CONTROL group. Table 1 shows the results obtained using the bootstrap analysis for each data changed from 2n until 2n+5, and their respective p-values. With the inclusion of more elements in the different groups, tested one by one, randomly, increasing gradually the samples, we observed the sample size needed to statistically confirm the results seen experimentally. Using 17 mice in group BIOT 200DH and 19 in the CONTROL group we have already observed statistical significance. This result suggests that experiments involving highly diluted substances and infection of mice with T. gondii should work with experimental groups with 17 animals at least. Despite the current and relevant ethical discussions about the number of animals used for experimental procedures the number of animals involved in each experiment must meet the characteristics of each item to be studied. In the case of experiments involving highly diluted substances, experimental animal models are still rudimentary and the biological effects observed appear to be also individualized, as described in literature for homeopathy [8]. The fact that the statistical significance was achieved by increasing the sample observed in this trial, tell us about a rare event, with a strong individual behavior, difficult to demonstrate in a result set, treated simply with a comparison of means or medians. Conclusion: Bootstrap seems to be an interesting methodology for the analysis of data obtained from experiments with highly diluted substances. Experiments involving highly diluted substances and infection of mice with T. gondii should be better work with experimental groups using 17 animals at least.

Download Full-text

Statistical Reliability of 10 Years of Cyber Security User Studies

Lecture Notes in Computer Science - Socio-Technical Aspects in Security and Trust ◽

10.1007/978-3-030-79318-0_10 ◽

2021 ◽

pp. 171-190

Author(s):

Thomas Groß

Keyword(s):

Publication Bias ◽

Cyber Security ◽

Power Distribution ◽

Statistical Power ◽

Total Sample ◽

User Studies ◽

Effect Sizes ◽

Standard Errors ◽

Statistical Reliability ◽

Statistical Inferences

AbstractBackground. In recent years, cyber security user studies have been appraised in meta-research, mostly focusing on the completeness of their statistical inferences and the fidelity of their statistical reporting. However, estimates of the field’s distribution of statistical power and its publication bias have not received much attention.Aim. In this study, we aim to estimate the effect sizes and their standard errors present as well as the implications on statistical power and publication bias.Method. We built upon a published systematic literature review of 146 user studies in cyber security (2006–2016). We took into account 431 statistical inferences including t-, $$\chi ^2$$ χ 2 -, r-, one-way F-tests, and Z-tests. In addition, we coded the corresponding total sample sizes, group sizes and test families. Given these data, we established the observed effect sizes and evaluated the overall publication bias. We further computed the statistical power vis-à-vis of parametrized population thresholds to gain unbiased estimates of the power distribution.Results. We obtained a distribution of effect sizes and their conversion into comparable log odds ratios together with their standard errors. We, further, gained funnel-plot estimates of the publication bias present in the sample as well as insights into the power distribution and its consequences.Conclusions. Through the lenses of power and publication bias, we shed light on the statistical reliability of the studies in the field. The upshot of this introspection is practical recommendations on conducting and evaluating studies to advance the field.

Download Full-text

Improvement of Statistical Power to Detect Publication Bias in Meta-analysis Using the Clinical Trial Registration System

Japanese Journal of Biometrics ◽

10.5691/jjb.32.13 ◽

2011 ◽

Vol 32 (1) ◽

pp. 13-31

Author(s):

Nobushige Matsuoka ◽

Hiroshi Horio ◽

Chikuma Hamada

Keyword(s):

Clinical Trial ◽

Publication Bias ◽

Statistical Power ◽

Meta Analysis ◽

Trial Registration ◽

Clinical Trial Registration ◽

Registration System

Download Full-text