Damaging Real Lives through Obstinacy: Re-Emphasising Why Significance Testing is Wrong

Stephen Gorard

doi:10.5153/sro.3857

Damaging Real Lives through Obstinacy: Re-Emphasising Why Significance Testing is Wrong

Sociological Research Online ◽

10.5153/sro.3857 ◽

2016 ◽

Vol 21 (1) ◽

pp. 102-115 ◽

Cited By ~ 6

Author(s):

Stephen Gorard

Keyword(s):

Statistical Significance ◽

Poor Quality ◽

Significance Testing ◽

Significance Tests ◽

Statistical Significance Testing ◽

Wrong Answer ◽

Repeated Calls ◽

Numeric Data ◽

Quality Of Research

This paper reminds readers of the absurdity of statistical significance testing, despite its continued widespread use as a supposed method for analysing numeric data. There have been complaints about the poor quality of research employing significance tests for a hundred years, and repeated calls for researchers to stop using and reporting them. There have even been attempted bans. Many thousands of papers have now been written, in all areas of research, explaining why significance tests do not work. There are too many for all to be cited here. This paper summarises the logical problems as described in over 100 of these prior pieces. It then presents a series of demonstrations showing that significance tests do not work in practice. In fact, they are more likely to produce the wrong answer than a right one. The confused use of significance testing has practical and damaging consequences for people's lives. Ending the use of significance tests is a pressing ethical issue for research. Anyone knowing the problems, as described over one hundred years, who continues to teach, use or publish significance tests is acting unethically, and knowingly risking the damage that ensues.

Download Full-text

The Other Half of the Story: Effect Size Analysis in Quantitative Research

CBE—Life Sciences Education ◽

10.1187/cbe.13-04-0082 ◽

2013 ◽

Vol 12 (3) ◽

pp. 345-351 ◽

Cited By ~ 134

Author(s):

Jessica Middlemis Maher ◽

Jonathan C. Markey ◽

Diane Ebert-May

Keyword(s):

Educational Research ◽

Effect Size ◽

Quantitative Research ◽

Statistical Significance ◽

The Other ◽

Practical Significance ◽

Significance Testing ◽

Size Analysis ◽

Significance Tests ◽

Statistical Significance Testing

Statistical significance testing is the cornerstone of quantitative research, but studies that fail to report measures of effect size are potentially missing a robust part of the analysis. We provide a rationale for why effect size measures should be included in quantitative discipline-based education research. Examples from both biological and educational research demonstrate the utility of effect size for evaluating practical significance. We also provide details about some effect size indices that are paired with common statistical significance tests used in educational research and offer general suggestions for interpreting effect size measures. Finally, we discuss some inherent limitations of effect size measures and provide further recommendations about reporting confidence intervals.

Download Full-text

If you've got an effect, test its significance; if you've got a weak effect, do a meta-analysis

Behavioral and Brain Sciences ◽

10.1017/s0140525x98341163 ◽

1998 ◽

Vol 21 (2) ◽

pp. 205-206 ◽

Cited By ~ 1

Author(s):

John F. Kihlstrom

Keyword(s):

Meta Analysis ◽

Statistical Significance ◽

Significance Testing ◽

Weak Effect ◽

Significance Tests ◽

Statistical Significance Testing ◽

Effect Test

Statistical significance testing has its problems, but so do the alternatives that are proposed; and the alternatives may be both more cumbersome and less informative. Significance tests remain legitimate aspects of the rhetoric of scientific persuasion.

Download Full-text

Claims of ‘no difference’ or ‘no effect’ in Cochrane and other systematic reviews

BMJ evidence-based medicine ◽

10.1136/bmjebm-2019-111257 ◽

2020 ◽

pp. bmjebm-2019-111257 ◽

Cited By ~ 1

Author(s):

Phoebe Rose Marson Smith ◽

Lynda Ware ◽

Clive Adams ◽

Iain Chalmers

Keyword(s):

Adverse Effects ◽

Systematic Reviews ◽

Statistical Significance ◽

Cochrane Review ◽

Relevant Information ◽

Significance Testing ◽

Statistical Significance Testing ◽

Cochrane Reviews ◽

Treatment Claims

Estimates of treatment effects/differences derived from controlled comparisons are subject to uncertainty, both because of the quality of the data and the play of chance. Despite this, authors sometimes use statistical significance testing to make definitive statements that ‘no difference exists between’ treatments. A survey to assess abstracts of Cochrane reviews published in 2001/2002 identified unqualified claims of ‘no difference’ or ‘no effect’ in 259 (21.3%) out of 1212 review abstracts surveyed. We have repeated the survey to assess the frequency of such claims among the abstracts of Cochrane and other systematic reviews published in 2017. We surveyed the 643 Cochrane review abstracts published in 2017 and a random sample of 643 abstracts of other systematic reviews published in the same year. We excluded review abstracts that referred only to a protocol, lacked a conclusion or did not contain any relevant information. We took steps to reduce biases during our survey. 'No difference/no effect' was claimed in the abstracts of 36 (7.8%) of 460 Cochrane reviews and in the abstracts of 13 (6.0%) of 218 other systematic reviews. Incorrect claims of no difference/no effect of treatments were substantially less common in Cochrane reviews published in in 2017 than they were in abstracts of reviews published in 2001/2002. We hope that this reflects greater efforts to reduce biases and inconsistent judgements in the later survey as well as more careful wording of review abstracts. There are numerous other ways of wording treatment claims incorrectly. These must be addressed because they can have adverse effects on healthcare and health research.

Download Full-text

Quo Vadis, Methodology? The Key Role of Manipulation Checks for Validity Control and Quality of Science

Perspectives on Psychological Science ◽

10.1177/1745691620970602 ◽

2021 ◽

pp. 174569162097060

Author(s):

Klaus Fiedler ◽

Linda McCaughey ◽

Johannes Prager

Keyword(s):

Research Design ◽

Statistical Significance ◽

Significance Testing ◽

Current Debate ◽

Scientific Quality ◽

Statistical Significance Testing ◽

Subordinate Level ◽

Quo Vadis ◽

Carry Over

The current debate about how to improve the quality of psychological science revolves, almost exclusively, around the subordinate level of statistical significance testing. In contrast, research design and strict theorizing, which are superordinate to statistics in the methods hierarchy, are sorely neglected. The present article is devoted to the key role assigned to manipulation checks (MCs) for scientific quality control. MCs not only afford a critical test of the premises of hypothesis testing but also (a) prompt clever research design and validity control, (b) carry over to refined theorizing, and (c) have important implications for other facets of methodology, such as replication science. On the basis of an analysis of the reality of MCs reported in current issues of the Journal of Personality and Social Psychology, we propose a future methodology for the post– p < .05 era that replaces scrutiny in significance testing with refined validity control and diagnostic research designs.

Download Full-text

Statistical Significance Testing at CHI PLAY: Challenges and Opportunities for More Transparency

10.31219/osf.io/58wzp ◽

2020 ◽

Author(s):

Jan Benjamin Vornhagen ◽

April Tyack ◽

Elisa D Mekler

Keyword(s):

Statistical Tests ◽

Statistical Significance ◽

Reporting Quality ◽

Significance Testing ◽

Statistical Significance Testing ◽

Questionable Research Practices ◽

Reporting Practices ◽

Challenges And Opportunities ◽

Statistical Hypotheses

Statistical Significance Testing -- or Null Hypothesis Significance Testing (NHST) -- is common to quantitative CHI PLAY research. Drawing from recent work in HCI and psychology promoting transparent statistics and the reduction of questionable research practices, we systematically review the reporting quality of 119 CHI PLAY papers using NHST (data and analysis plan at https://osf.io/4mcbn/. We find that over half of these papers employ NHST without specific statistical hypotheses or research questions, which may risk the proliferation of false positive findings. Moreover, we observe inconsistencies in the reporting of sample sizes and statistical tests. These issues reflect fundamental incompatibilities between NHST and the frequently exploratory work common to CHI PLAY. We discuss the complementary roles of exploratory and confirmatory research, and provide a template for more transparent research and reporting practices.

Download Full-text

Statistical quality of experience analysis: on planning the sample size and statistical significance testing

Journal of Electronic Imaging ◽

10.1117/1.jei.27.5.053013 ◽

2018 ◽

Vol 27 (05) ◽

pp. 1 ◽

Cited By ~ 11

Author(s):

Kjell Brunnström ◽

Marcus Barkowsky

Keyword(s):

Sample Size ◽

Quality Of Experience ◽

Statistical Significance ◽

Significance Testing ◽

Statistical Significance Testing ◽

Statistical Quality

Download Full-text

The Significance of Statistical Significance Tests in Marketing Research

Journal of Marketing Research ◽

10.1177/002224378302000203 ◽

1983 ◽

Vol 20 (2) ◽

pp. 122-133 ◽

Cited By ~ 41

Author(s):

Alan G. Sawyer ◽

J. Paul Peter

Keyword(s):

Hypothesis Testing ◽

Meta Analysis ◽

Statistical Significance ◽

Marketing Research ◽

Significance Testing ◽

Primary Method ◽

Significance Tests ◽

Statistical Significance Testing ◽

Bayesian Hypothesis Testing ◽

Strong Inference

Classical statistical significance testing is the primary method by which marketing researchers empirically test hypotheses and draw inferences about theories. The authors discuss the interpretation and value of classical statistical significance tests and suggest that classical inferential statistics may be misinterpreted and overvalued by marketing researchers in judging research results. Replication, Bayesian hypothesis testing, meta-analysis, and strong inference are examined as approaches for augmenting conventional statistical analyses.

Download Full-text

Statistical Significance Testing: Useful Tool or Bone-Headedly Misguided Procedure? Review of What If There Were No Significance Tests? by Lisa L. Harlow, Stanley A. Mulaik, and James H. Steiger (Eds.)

Journal of Mathematical Psychology ◽

10.1006/jmps.1999.1290 ◽

1999 ◽

Vol 43 (3) ◽

pp. 455-471 ◽

Cited By ~ 2

Author(s):

Raymond S. Nickerson

Keyword(s):

Statistical Significance ◽

Significance Testing ◽

Significance Tests ◽

Statistical Significance Testing

Download Full-text

An experience of performing spirometry by trained general practitioners

Medical alphabet ◽

10.33667/2078-5631-2020-25-34-36 ◽

2020 ◽

pp. 34-36

Author(s):

M. A. Pokhaznikova ◽

E. A. Andreeva ◽

O. Yu. Kuznetsova

Keyword(s):

General Practitioners ◽

Quality Criteria ◽

Poor Quality ◽

Male Gender ◽

Study Research ◽

The Poor ◽

Quality Of Research ◽

Forced Expiratory Maneuver ◽

Maximal Effort

The article discusses the experience of teaching and conducting spirometry of general practitioners as part of the RESPECT study (RESearch on the PrEvalence and the diagnosis of COPD and its Tobacco-related aetiology). A total of 33 trained in spirometry general practitioners performed a study of 3119 patients. Quality criteria met 84.1% of spirometric studies. The analysis of the most common mistakes made by doctors during the forced expiratory maneuver is included. The most frequent errors were expiration exhalation of less than 6s (54%), non-maximal effort throughout the test and lack of reproducibility (11.3%). Independent predictors of poor spirogram quality were male gender, obstruction (FEV1 /FVC<0.7), and the center where the study was performed. The number of good-quality spirograms ranged from 96.1% (95% CI 83.2–110.4) to 59.8% (95% CI 49.6–71.4) depending on the center. Subsequently, an analysis of the reasons behind the poor quality of research in individual centers was conducted and the identified shortcomings were eliminated. The poor quality of the spirograms was associated either with the errors of the doctors who undertook the study or with the technical malfunctions of the spirometer.

Download Full-text

Use of SAMe-TT₂R₂ Score to Predict the Quality of Anticoagulation Control in Patients with Atrial Fibrillation Receiving Warfarin in Thailand

Journal of the Medical Association of Thailand ◽

10.35755/jmedassocthai.2020.06.10827 ◽

2020 ◽

Vol 103 (6) ◽

pp. 548-552

Keyword(s):

Atrial Fibrillation ◽

Therapeutic Range ◽

Statistical Significance ◽

Discrimination Performance ◽

Poor Quality ◽

Chi Square ◽

Exact Test ◽

Time In Therapeutic Range ◽

Anticoagulation Control

Objective: To predict the quality of anticoagulation control in patients with atrial fibrillation (AF) receiving warfarin in Thailand. Materials and Methods: The present study retrospectively recruited Thai AF patients receiving warfarin for three months or longer between June 2012 and December 2017 in Central Chest Institute of Thailand. The patients were classified into those with SAMe-TT₂R₂ of 2 or less, and 3 or more. The Chi-square test or Fisher’s exact test was used to compare the proportion of the patients with poor time in therapeutic range (TTR) between the two groups of SAMe-TT₂R₂ score. The discrimination performance of SAMe-TT₂R₂ score was demonstrated with c-statistics. Results: Ninety AF patients were enrolled. An average age was 69.89±10.04 years. Most patients were persistent AF. An average CHA₂DS₂-VASc, SAMe-TT₂R₂, and HAS-BLED score were 3.68±1.51, 3.26±0.88, and 1.98±0.85, respectively. The present study showed the increased proportion of AF patients with poor TTR with higher SAMe-TT₂R₂ score. The AF patients with SAMe-TT₂R₂ score of 3 or more had a larger proportion of patients with poor TTR than those with SAMe-TT₂R₂ score of 2 or less with statistical significance when TTR was below 70% (p=0.03) and 65% (p=0.04), respectively. The discrimination performance of SAMe-TT₂R₂ score was demonstrated with c-statistics of 0.60, 0.59, and 0.55 when TTR was below 70%, 65% and 60%, respectively. Conclusion: Thai AF patients receiving warfarin had a larger proportion of patients with poor TTR when the SAMe-TT₂R₂ score was higher. The score of 3 or more could predict poor quality of anticoagulation control in those patients. Keywords: Time in therapeutic range, Poor quality of anticoagulation control, Warfarin, SAMe-TT₂R₂, Labile INR

Download Full-text