Beyond the t-Test: Statistical Equivalence Testing

Intensive English programs (IEPs) exist as an additional pathway into higher education for international students who need additional language support before full matriculation. Despite their long history in higher education, there is little research on the effectiveness of these programs. The current research examines the effectiveness of an IEP by comparing IEP students to directly-admitted international students. Results from regression models on first-semester and first-year GPA indicated no significant differences between these two student groups. Follow-up equivalence testing indicated statistical equivalence in several cases. The findings lead to the conclusion that the IEP is effective in helping students perform on par with directly-admitted international students. These findings imply further support for IEPs and alterative pathways to direct admission.

Download Full-text

Validation of automatic passenger counting: introducing the t-test-induced equivalence test

Transportation ◽

10.1007/s11116-019-09991-9 ◽

2019 ◽

Vol 47 (6) ◽

pp. 3031-3045

Author(s):

Michael Siebert ◽

David Ellenberger

Keyword(s):

Measurement Errors ◽

T Test ◽

Statistical Hypothesis ◽

Type I ◽

Equivalence Testing ◽

Sample Sizes ◽

Equivalence Test ◽

Stable Equivalence ◽

Student’S T ◽

Type Ii Errors

Abstract Automatic passenger counting (APC) in public transport has been introduced in the 1970s and has been rapidly emerging in recent years. Still, real-world applications continue to face events that are difficult to classify. The induced imprecision needs to be handled as statistical noise and thus methods have been defined to ensure that measurement errors do not exceed certain bounds. Various recommendations for such an APC validation have been made to establish criteria that limit the bias and the variability of the measurement errors. In those works, the misinterpretation of non-significance in statistical hypothesis tests for the detection of differences (e.g. Student’s t-test) proves to be prevalent, although existing methods which were developed under the term equivalence testing in biostatistics (i.e. bioequivalence trials, Schuirmann in J Pharmacokinet Pharmacodyn 15(6):657–680, 1987) would be appropriate instead. This heavily affects the calibration and validation process of APC systems and has been the reason for unexpected results when the sample sizes were not suitably chosen: Large sample sizes were assumed to improve the assessment of systematic measurement errors of the devices from a user’s perspective as well as from a manufacturers perspective, but the regular t-test fails to achieve that. We introduce a variant of the t-test, the revised t-test, which addresses both type I and type II errors appropriately and allows a comprehensible transition from the long-established t-test in a widely used industrial recommendation. This test is appealing, but still it is susceptible to numerical instability. Finally, we analytically reformulate it as a numerically stable equivalence test, which is thus easier to use. Our results therefore allow to induce an equivalence test from a t-test and increase the comparability of both tests, especially for decision makers.

Download Full-text

Statistical Equivalence Testing Approaches for Mantel–Haenszel DIF Analysis

Journal of Educational and Behavioral Statistics ◽

10.3102/1076998617742410 ◽

2017 ◽

Vol 43 (4) ◽

pp. 407-439 ◽

Cited By ~ 2

Author(s):

Jodi M. Casabianca ◽

Charles Lewis

Keyword(s):

Null Hypothesis ◽

Empirical Bayes ◽

International Student ◽

Hypothesis Test ◽

Student Assessment ◽

Equivalence Testing ◽

Test Form ◽

International Student Assessment ◽

Statistical Equivalence ◽

Alternative Hypotheses

The null hypothesis test used in differential item functioning (DIF) detection tests for a subgroup difference in item-level performance—if the null hypothesis of “no DIF” is rejected, the item is flagged for DIF. Conversely, an item is kept in the test form if there is insufficient evidence of DIF. We present frequentist and empirical Bayes approaches for implementing statistical equivalence testing for DIF using the Mantel–Haenszel (MH) DIF statistic. With these approaches, rejection of the null hypothesis of “DIF” allows the conclusion of statistical equivalence, a more stringent criterion for keeping items. In other words, the roles of the null and alternative hypotheses are interchanged in order to have positive evidence that the DIF of an item is small. A simulation study compares the equivalence testing approaches to the traditional MH DIF detection method with the Educational Testing Service classification system. We illustrate the methods with item response data from the 2012 Programme for International Student Assessment.

Download Full-text

Pooling morphometric estimates: a statistical equivalence approach

10.7287/peerj.preprints.808v2 ◽

2015 ◽

Author(s):

Heath R Pardoe ◽

Gary Cutter ◽

Rachel A Alter ◽

Rebecca Kucharsky Hiess ◽

Mira Semmelroch ◽

...

Keyword(s):

Corpus Callosum ◽

Cortical Thickness ◽

Power Analysis ◽

Hippocampal Volume ◽

Equivalence Testing ◽

Volume Data ◽

Experimental Conditions ◽

Automated Method ◽

Statistical Equivalence ◽

Volume Estimates

Changes in hardware or image processing settings are a common issue for large multi-center studies. In order to pool MRI data acquired under these changed conditions, it is necessary to demonstrate that the changes do not affect MRI-based measurements. In these circumstances classical inference testing is inappropriate because it is designed to detect differences, not prove similarity. We used a method known as statistical equivalence testing to address this limitation. Equivalence testing was carried out on three datasets: (i) cortical thickness and automated hippocampal volume estimates obtained from healthy individuals imaged using different multi-channel head coils; (ii) manual hippocampal volumetry obtained using two readers; and (iii) corpus callosum area estimates obtained using an automated method with manual cleanup carried out by two readers. Equivalence testing was carried out using the “two one-sided tests” (TOST) approach. Power analyses of the two one-sided tests were used to estimate sample sizes required for well-powered equivalence testing analyses. Mean and standard deviation estimates from the automated hippocampal volume dataset were used to carry out an example power analysis. Cortical thickness values were found to be equivalent over 61% of the cortex when different head coils were used (q < 0.05, FDR correction). Automated hippocampal volume estimates obtained using the same two coils were statistically equivalent (TOST p = 4.28 × 10-15). Manual hippocampal volume estimates obtained using two readers were not statistically equivalent (TOST p = 0.97). The use of different readers to carry out limited correction of automated corpus callosum segmentations yielded equivalent area estimates (TOST p = 1.28 × 10-14). Power analysis of simulated and automated hippocampal volume data demonstrated that the equivalence margin affects the number of subjects required for well-powered equivalence tests. We have presented a statistical method for determining if morphometric measures obtained under variable conditions can be pooled. The equivalence testing technique is applicable for analyses in which experimental conditions vary over the course of the study.

Download Full-text

Testing for additivity in chemical mixtures using a fixed-ratio ray design and statistical equivalence testing methods

Journal of Agricultural Biological and Environmental Statistics ◽

10.1198/108571107x249816 ◽

2007 ◽

Vol 12 (4) ◽

pp. 514-533 ◽

Cited By ~ 26

Author(s):

LeAnna G. Stork ◽

Chris Gennings ◽

Walter H. Carter ◽

Robert E. Johnson ◽

Darcy P. Mays ◽

...

Keyword(s):

Fixed Ratio ◽

Equivalence Testing ◽

Testing Methods ◽

Chemical Mixtures ◽

Statistical Equivalence

Download Full-text

Statistical Equivalence Testing of Higher-Order Protein Structures with Differential Hydrogen Exchange-Mass Spectrometry (HX-MS)

Analytical Chemistry ◽

10.1021/acs.analchem.0c05279 ◽

2021 ◽

Author(s):

Tyler S. Hageman ◽

Michael S. Wrigley ◽

David D. Weis

Keyword(s):

Mass Spectrometry ◽

Hydrogen Exchange ◽

Protein Structures ◽

Higher Order ◽

Equivalence Testing ◽

Exchange Mass Spectrometry ◽

Statistical Equivalence ◽

Hydrogen Exchange Mass Spectrometry

Download Full-text

Establishing Statistical Equivalence of Data from Different Sampling Approaches for Assessment of Bacterial Phenotypic Antimicrobial Resistance

Applied and Environmental Microbiology ◽

10.1128/aem.02724-17 ◽

2018 ◽

Vol 84 (9) ◽

Cited By ~ 1

Author(s):

Heman Shakeri ◽

Victoriya Volkova ◽

Xuesong Wen ◽

Andrea Deters ◽

Charley Cull ◽

...

Keyword(s):

Antimicrobial Resistance ◽

Antimicrobial Susceptibility ◽

Antimicrobial Drug ◽

Equivalence Testing ◽

Sample Type ◽

Sample Processing ◽

Processing Methods ◽

Content Type ◽

Statistical Equivalence ◽

The Difference

ABSTRACTTo assess phenotypic bacterial antimicrobial resistance (AMR) in different strata (e.g., host populations, environmental areas, manure, or sewage effluents) for epidemiological purposes, isolates of target bacteria can be obtained from a stratum using various sample types. Also, different sample processing methods can be applied. The MIC of each target antimicrobial drug for each isolate is measured. Statistical equivalence testing of the MIC data for the isolates allows evaluation of whether different sample types or sample processing methods yield equivalent estimates of the bacterial antimicrobial susceptibility in the stratum. We demonstrate this approach on the antimicrobial susceptibility estimates for (i) nontyphoidalSalmonellaspp. from ground or trimmed meat versus cecal content samples of cattle in processing plants in 2013-2014 and (ii) nontyphoidalSalmonellaspp. from urine, fecal, and blood human samples in 2015 (U.S. National Antimicrobial Resistance Monitoring System data). We found that the sample types for cattle yielded nonequivalent susceptibility estimates for several antimicrobial drug classes and thus may gauge distinct subpopulations of salmonellae. The quinolone and fluoroquinolone susceptibility estimates for nontyphoidal salmonellae from human blood are nonequivalent to those from urine or feces, conjecturally due to the fluoroquinolone (ciprofloxacin) use to treat infections caused by nontyphoidal salmonellae. We also demonstrate statistical equivalence testing for comparing sample processing methods for fecal samples (culturing one versus multiple aliquots per sample) to assess AMR in fecalEscherichia coli. These methods yield equivalent results, except for tetracyclines. Importantly, statistical equivalence testing provides the MIC difference at which the data from two sample types or sample processing methods differ statistically. Data users (e.g., microbiologists and epidemiologists) may then interpret practical relevance of the difference.IMPORTANCEBacterial antimicrobial resistance (AMR) needs to be assessed in different populations or strata for the purposes of surveillance and determination of the efficacy of interventions to halt AMR dissemination. To assess phenotypic antimicrobial susceptibility, isolates of target bacteria can be obtained from a stratum using different sample types or employing different sample processing methods in the laboratory. The MIC of each target antimicrobial drug for each of the isolates is measured, yielding the MIC distribution across the isolates from each sample type or sample processing method. We describe statistical equivalence testing for the MIC data for evaluating whether two sample types or sample processing methods yield equivalent estimates of the bacterial phenotypic antimicrobial susceptibility in the stratum. This includes estimating the MIC difference at which the data from the two approaches differ statistically. Data users (e.g., microbiologists, epidemiologists, and public health professionals) can then interpret whether that present difference is practically relevant.

Download Full-text

No Clinically Significant Difference Between Adult and Pediatric IKDC Subjective Knee Evaluation Scores in Adults

Sports Health A Multidisciplinary Approach ◽

10.1177/1941738116685299 ◽

2017 ◽

Vol 9 (5) ◽

pp. 450-455 ◽

Cited By ~ 6

Author(s):

Nicole Stegmeier ◽

Sameer R. Oak ◽

Colin O’Rourke ◽

Greg Strnad ◽

Kurt P. Spindler ◽

...

Keyword(s):

Study Design ◽

Adult Population ◽

Random Order ◽

International Knee Documentation Committee ◽

T Test ◽

Equivalence Testing ◽

Least Squares Regression ◽

Level Of Evidence ◽

Significant Difference ◽

Paired T Test

Background: Two versions of the International Knee Documentation Committee (IKDC) Subjective Knee Evaluation form currently exist: the original version (1999) and a recently modified pediatric-specific version (2011). Comparison of the pediatric IKDC with the adult version in the adult population may reveal that either version could be used longitudinally. Hypothesis: We hypothesize that the scores for the adult IKDC and pediatric IKDC will not be clinically different among adult patients aged 18 to 50 years. Study Design: Randomized crossover study design. Level of Evidence: Level 2. Methods: The study consisted of 100 participants, aged 18 to 50 years, who presented to orthopaedic outpatient clinics with knee problems. All participants completed both adult and pediatric versions of the IKDC in random order with a 10-minute break in between. We used a paired t test to test for a difference between the scores and a Welch’s 2-sample t test to test for equivalence. A least-squares regression model was used to model adult scores as a function of pediatric scores, and vice versa. Results: A paired t test revealed a statistically significant 1.6-point difference between the mean adult and pediatric scores. However, the 95% confidence interval (0.54-2.66) for this difference did not exceed our a priori threshold of 5 points, indicating that this difference was not clinically important. Equivalence testing with an equivalence region of 5 points further supported this finding. The adult and pediatric scores had a linear relationship and were highly correlated with an R2 of 92.6%. Conclusion: There is no clinically relevant difference between the scores of the adult and pediatric IKDC forms in adults, aged 18 to 50 years, with knee conditions. Clinical Relevance: Either form, adult or pediatric, of the IKDC can be used in this population for longitudinal studies. If the pediatric version is administered in adolescence, it can be used for follow-up into adulthood.

Download Full-text

Pooling morphometric estimates: a statistical equivalence approach

10.7287/peerj.preprints.808v1 ◽

2015 ◽

Author(s):

Heath R Pardoe ◽

Gary Cutter ◽

Rachel A Alter ◽

Rebecca Kucharsky Hiess ◽

Mira Semmelroch ◽

...

Keyword(s):

Corpus Callosum ◽

Cortical Thickness ◽

Hippocampal Volume ◽

Equivalence Testing ◽

Experimental Conditions ◽

Automated Method ◽

Statistical Equivalence ◽

Equivalent Area ◽

Classical Inference ◽

Volume Estimates

Changes in hardware or image processing settings are a common issue for large multi-center studies. In order to pool MRI data acquired under these changed conditions, it is necessary to demonstrate that the changes do not affect MRI-based measurements. In these circumstances classical inference testing is inappropriate because it is designed to detect differences, not prove similarity. We used a method known as statistical equivalence testing to address this limitation. Equivalence testing was carried out on three datasets: (i) cortical thickness and automated hippocampal volume estimates obtained from 16 healthy individuals imaged different multi-channel head coils; (ii) manual hippocampal volumetry obtained using two readers; and (iii) corpus callosum area estimates obtained using an automated method with manual cleanup carried out by two readers. Equivalence testing was carried out using the “two one-sided tests” approach. Cortical thickness values were found to be equivalent over 78% of the cortex when different head coils were used (p = 0.024). Automated hippocampal volume estimates obtained using the same two coils were statistically equivalent (p = 4.28 × 10-15). Manual hippocampal volume estimates obtained using two readers were not statistically equivalent (p = 0.97). The use of different readers to carry out limited correction of automated corpus callosum segmentations yielded equivalent area estimates (1.28 × 10-14). We have presented a statistical method for determining if morphometric measures obtained under variable conditions can be pooled. The equivalence testing technique is applicable for analyses in which experimental conditions vary over the course of the study.

Download Full-text

Pooling morphometric estimates: a statistical equivalence approach

10.7287/peerj.preprints.808 ◽

2015 ◽

Author(s):

Heath R Pardoe ◽

Gary Cutter ◽

Rachel A Alter ◽

Rebecca Kucharsky Hiess ◽

Mira Semmelroch ◽

...

Keyword(s):

Corpus Callosum ◽

Cortical Thickness ◽

Power Analysis ◽

Hippocampal Volume ◽

Equivalence Testing ◽

Volume Data ◽

Experimental Conditions ◽

Automated Method ◽

Statistical Equivalence ◽

Volume Estimates

Changes in hardware or image processing settings are a common issue for large multi-center studies. In order to pool MRI data acquired under these changed conditions, it is necessary to demonstrate that the changes do not affect MRI-based measurements. In these circumstances classical inference testing is inappropriate because it is designed to detect differences, not prove similarity. We used a method known as statistical equivalence testing to address this limitation. Equivalence testing was carried out on three datasets: (i) cortical thickness and automated hippocampal volume estimates obtained from healthy individuals imaged using different multi-channel head coils; (ii) manual hippocampal volumetry obtained using two readers; and (iii) corpus callosum area estimates obtained using an automated method with manual cleanup carried out by two readers. Equivalence testing was carried out using the “two one-sided tests” (TOST) approach. Power analyses of the two one-sided tests were used to estimate sample sizes required for well-powered equivalence testing analyses. Mean and standard deviation estimates from the automated hippocampal volume dataset were used to carry out an example power analysis. Cortical thickness values were found to be equivalent over 61% of the cortex when different head coils were used (q < 0.05, FDR correction). Automated hippocampal volume estimates obtained using the same two coils were statistically equivalent (TOST p = 4.28 × 10-15). Manual hippocampal volume estimates obtained using two readers were not statistically equivalent (TOST p = 0.97). The use of different readers to carry out limited correction of automated corpus callosum segmentations yielded equivalent area estimates (TOST p = 1.28 × 10-14). Power analysis of simulated and automated hippocampal volume data demonstrated that the equivalence margin affects the number of subjects required for well-powered equivalence tests. We have presented a statistical method for determining if morphometric measures obtained under variable conditions can be pooled. The equivalence testing technique is applicable for analyses in which experimental conditions vary over the course of the study.

Download Full-text