scholarly journals Damaging Real Lives through Obstinacy: Re-Emphasising Why Significance Testing is Wrong

2016 ◽  
Vol 21 (1) ◽  
pp. 102-115 ◽  
Author(s):  
Stephen Gorard

This paper reminds readers of the absurdity of statistical significance testing, despite its continued widespread use as a supposed method for analysing numeric data. There have been complaints about the poor quality of research employing significance tests for a hundred years, and repeated calls for researchers to stop using and reporting them. There have even been attempted bans. Many thousands of papers have now been written, in all areas of research, explaining why significance tests do not work. There are too many for all to be cited here. This paper summarises the logical problems as described in over 100 of these prior pieces. It then presents a series of demonstrations showing that significance tests do not work in practice. In fact, they are more likely to produce the wrong answer than a right one. The confused use of significance testing has practical and damaging consequences for people's lives. Ending the use of significance tests is a pressing ethical issue for research. Anyone knowing the problems, as described over one hundred years, who continues to teach, use or publish significance tests is acting unethically, and knowingly risking the damage that ensues.

2013 ◽  
Vol 12 (3) ◽  
pp. 345-351 ◽  
Author(s):  
Jessica Middlemis Maher ◽  
Jonathan C. Markey ◽  
Diane Ebert-May

Statistical significance testing is the cornerstone of quantitative research, but studies that fail to report measures of effect size are potentially missing a robust part of the analysis. We provide a rationale for why effect size measures should be included in quantitative discipline-based education research. Examples from both biological and educational research demonstrate the utility of effect size for evaluating practical significance. We also provide details about some effect size indices that are paired with common statistical significance tests used in educational research and offer general suggestions for interpreting effect size measures. Finally, we discuss some inherent limitations of effect size measures and provide further recommendations about reporting confidence intervals.


1998 ◽  
Vol 21 (2) ◽  
pp. 205-206 ◽  
Author(s):  
John F. Kihlstrom

Statistical significance testing has its problems, but so do the alternatives that are proposed; and the alternatives may be both more cumbersome and less informative. Significance tests remain legitimate aspects of the rhetoric of scientific persuasion.


2020 ◽  
pp. bmjebm-2019-111257 ◽  
Author(s):  
Phoebe Rose Marson Smith ◽  
Lynda Ware ◽  
Clive Adams ◽  
Iain Chalmers

Estimates of treatment effects/differences derived from controlled comparisons are subject to uncertainty, both because of the quality of the data and the play of chance. Despite this, authors sometimes use statistical significance testing to make definitive statements that ‘no difference exists between’ treatments. A survey to assess abstracts of Cochrane reviews published in 2001/2002 identified unqualified claims of ‘no difference’ or ‘no effect’ in 259 (21.3%) out of 1212 review abstracts surveyed. We have repeated the survey to assess the frequency of such claims among the abstracts of Cochrane and other systematic reviews published in 2017. We surveyed the 643 Cochrane review abstracts published in 2017 and a random sample of 643 abstracts of other systematic reviews published in the same year. We excluded review abstracts that referred only to a protocol, lacked a conclusion or did not contain any relevant information. We took steps to reduce biases during our survey. 'No difference/no effect' was claimed in the abstracts of 36 (7.8%) of 460 Cochrane reviews and in the abstracts of 13 (6.0%) of 218 other systematic reviews. Incorrect claims of no difference/no effect of treatments were substantially less common in Cochrane reviews published in in 2017 than they were in abstracts of reviews published in 2001/2002. We hope that this reflects greater efforts to reduce biases and inconsistent judgements in the later survey as well as more careful wording of review abstracts. There are numerous other ways of wording treatment claims incorrectly. These must be addressed because they can have adverse effects on healthcare and health research.


2021 ◽  
pp. 174569162097060
Author(s):  
Klaus Fiedler ◽  
Linda McCaughey ◽  
Johannes Prager

The current debate about how to improve the quality of psychological science revolves, almost exclusively, around the subordinate level of statistical significance testing. In contrast, research design and strict theorizing, which are superordinate to statistics in the methods hierarchy, are sorely neglected. The present article is devoted to the key role assigned to manipulation checks (MCs) for scientific quality control. MCs not only afford a critical test of the premises of hypothesis testing but also (a) prompt clever research design and validity control, (b) carry over to refined theorizing, and (c) have important implications for other facets of methodology, such as replication science. On the basis of an analysis of the reality of MCs reported in current issues of the Journal of Personality and Social Psychology, we propose a future methodology for the post– p < .05 era that replaces scrutiny in significance testing with refined validity control and diagnostic research designs.


2020 ◽  
Author(s):  
Jan Benjamin Vornhagen ◽  
April Tyack ◽  
Elisa D Mekler

Statistical Significance Testing -- or Null Hypothesis Significance Testing (NHST) -- is common to quantitative CHI PLAY research. Drawing from recent work in HCI and psychology promoting transparent statistics and the reduction of questionable research practices, we systematically review the reporting quality of 119 CHI PLAY papers using NHST (data and analysis plan at https://osf.io/4mcbn/. We find that over half of these papers employ NHST without specific statistical hypotheses or research questions, which may risk the proliferation of false positive findings. Moreover, we observe inconsistencies in the reporting of sample sizes and statistical tests. These issues reflect fundamental incompatibilities between NHST and the frequently exploratory work common to CHI PLAY. We discuss the complementary roles of exploratory and confirmatory research, and provide a template for more transparent research and reporting practices.


1983 ◽  
Vol 20 (2) ◽  
pp. 122-133 ◽  
Author(s):  
Alan G. Sawyer ◽  
J. Paul Peter

Classical statistical significance testing is the primary method by which marketing researchers empirically test hypotheses and draw inferences about theories. The authors discuss the interpretation and value of classical statistical significance tests and suggest that classical inferential statistics may be misinterpreted and overvalued by marketing researchers in judging research results. Replication, Bayesian hypothesis testing, meta-analysis, and strong inference are examined as approaches for augmenting conventional statistical analyses.


2020 ◽  
pp. 34-36
Author(s):  
M. A. Pokhaznikova ◽  
E. A. Andreeva ◽  
O. Yu. Kuznetsova

The article discusses the experience of teaching and conducting spirometry of general practitioners as part of the RESPECT study (RESearch on the PrEvalence and the diagnosis of COPD and its Tobacco-related aetiology). A total of 33 trained in spirometry general practitioners performed a study of 3119 patients. Quality criteria met 84.1% of spirometric studies. The analysis of the most common mistakes made by doctors during the forced expiratory maneuver is included. The most frequent errors were expiration exhalation of less than 6s (54%), non-maximal effort throughout the test and lack of reproducibility (11.3%). Independent predictors of poor spirogram quality were male gender, obstruction (FEV1 /FVC<0.7), and the center where the study was performed. The number of good-quality spirograms ranged from 96.1% (95% CI 83.2–110.4) to 59.8% (95% CI 49.6–71.4) depending on the center. Subsequently, an analysis of the reasons behind the poor quality of research in individual centers was conducted and the identified shortcomings were eliminated. The poor quality of the spirograms was associated either with the errors of the doctors who undertook the study or with the technical malfunctions of the spirometer.


2020 ◽  
Vol 103 (6) ◽  
pp. 548-552

Objective: To predict the quality of anticoagulation control in patients with atrial fibrillation (AF) receiving warfarin in Thailand. Materials and Methods: The present study retrospectively recruited Thai AF patients receiving warfarin for three months or longer between June 2012 and December 2017 in Central Chest Institute of Thailand. The patients were classified into those with SAMe-TT₂R₂ of 2 or less, and 3 or more. The Chi-square test or Fisher’s exact test was used to compare the proportion of the patients with poor time in therapeutic range (TTR) between the two groups of SAMe-TT₂R₂ score. The discrimination performance of SAMe-TT₂R₂ score was demonstrated with c-statistics. Results: Ninety AF patients were enrolled. An average age was 69.89±10.04 years. Most patients were persistent AF. An average CHA₂DS₂-VASc, SAMe-TT₂R₂, and HAS-BLED score were 3.68±1.51, 3.26±0.88, and 1.98±0.85, respectively. The present study showed the increased proportion of AF patients with poor TTR with higher SAMe-TT₂R₂ score. The AF patients with SAMe-TT₂R₂ score of 3 or more had a larger proportion of patients with poor TTR than those with SAMe-TT₂R₂ score of 2 or less with statistical significance when TTR was below 70% (p=0.03) and 65% (p=0.04), respectively. The discrimination performance of SAMe-TT₂R₂ score was demonstrated with c-statistics of 0.60, 0.59, and 0.55 when TTR was below 70%, 65% and 60%, respectively. Conclusion: Thai AF patients receiving warfarin had a larger proportion of patients with poor TTR when the SAMe-TT₂R₂ score was higher. The score of 3 or more could predict poor quality of anticoagulation control in those patients. Keywords: Time in therapeutic range, Poor quality of anticoagulation control, Warfarin, SAMe-TT₂R₂, Labile INR


Sign in / Sign up

Export Citation Format

Share Document