scholarly journals Beyond the t-Test: Statistical Equivalence Testing

2005 ◽  
Vol 77 (11) ◽  
pp. 221 A-226 A ◽  
Author(s):  
Giselle B. Limentani ◽  
Moira C. Ringo ◽  
Feng Ye ◽  
Mandy L. Bergquist ◽  
Ellen O. MCSorley
2020 ◽  
Author(s):  
Anthony Schmidt

Intensive English programs (IEPs) exist as an additional pathway into higher education for international students who need additional language support before full matriculation. Despite their long history in higher education, there is little research on the effectiveness of these programs. The current research examines the effectiveness of an IEP by comparing IEP students to directly-admitted international students. Results from regression models on first-semester and first-year GPA indicated no significant differences between these two student groups. Follow-up equivalence testing indicated statistical equivalence in several cases. The findings lead to the conclusion that the IEP is effective in helping students perform on par with directly-admitted international students. These findings imply further support for IEPs and alterative pathways to direct admission.


2019 ◽  
Vol 47 (6) ◽  
pp. 3031-3045
Author(s):  
Michael Siebert ◽  
David Ellenberger

Abstract Automatic passenger counting (APC) in public transport has been introduced in the 1970s and has been rapidly emerging in recent years. Still, real-world applications continue to face events that are difficult to classify. The induced imprecision needs to be handled as statistical noise and thus methods have been defined to ensure that measurement errors do not exceed certain bounds. Various recommendations for such an APC validation have been made to establish criteria that limit the bias and the variability of the measurement errors. In those works, the misinterpretation of non-significance in statistical hypothesis tests for the detection of differences (e.g. Student’s t-test) proves to be prevalent, although existing methods which were developed under the term equivalence testing in biostatistics (i.e. bioequivalence trials, Schuirmann in J Pharmacokinet Pharmacodyn 15(6):657–680, 1987) would be appropriate instead. This heavily affects the calibration and validation process of APC systems and has been the reason for unexpected results when the sample sizes were not suitably chosen: Large sample sizes were assumed to improve the assessment of systematic measurement errors of the devices from a user’s perspective as well as from a manufacturers perspective, but the regular t-test fails to achieve that. We introduce a variant of the t-test, the revised t-test, which addresses both type I and type II errors appropriately and allows a comprehensible transition from the long-established t-test in a widely used industrial recommendation. This test is appealing, but still it is susceptible to numerical instability. Finally, we analytically reformulate it as a numerically stable equivalence test, which is thus easier to use. Our results therefore allow to induce an equivalence test from a t-test and increase the comparability of both tests, especially for decision makers.


2017 ◽  
Vol 43 (4) ◽  
pp. 407-439 ◽  
Author(s):  
Jodi M. Casabianca ◽  
Charles Lewis

The null hypothesis test used in differential item functioning (DIF) detection tests for a subgroup difference in item-level performance—if the null hypothesis of “no DIF” is rejected, the item is flagged for DIF. Conversely, an item is kept in the test form if there is insufficient evidence of DIF. We present frequentist and empirical Bayes approaches for implementing statistical equivalence testing for DIF using the Mantel–Haenszel (MH) DIF statistic. With these approaches, rejection of the null hypothesis of “DIF” allows the conclusion of statistical equivalence, a more stringent criterion for keeping items. In other words, the roles of the null and alternative hypotheses are interchanged in order to have positive evidence that the DIF of an item is small. A simulation study compares the equivalence testing approaches to the traditional MH DIF detection method with the Educational Testing Service classification system. We illustrate the methods with item response data from the 2012 Programme for International Student Assessment.


2015 ◽  
Author(s):  
Heath R Pardoe ◽  
Gary Cutter ◽  
Rachel A Alter ◽  
Rebecca Kucharsky Hiess ◽  
Mira Semmelroch ◽  
...  

Changes in hardware or image processing settings are a common issue for large multi-center studies. In order to pool MRI data acquired under these changed conditions, it is necessary to demonstrate that the changes do not affect MRI-based measurements. In these circumstances classical inference testing is inappropriate because it is designed to detect differences, not prove similarity. We used a method known as statistical equivalence testing to address this limitation. Equivalence testing was carried out on three datasets: (i) cortical thickness and automated hippocampal volume estimates obtained from healthy individuals imaged using different multi-channel head coils; (ii) manual hippocampal volumetry obtained using two readers; and (iii) corpus callosum area estimates obtained using an automated method with manual cleanup carried out by two readers. Equivalence testing was carried out using the “two one-sided tests” (TOST) approach. Power analyses of the two one-sided tests were used to estimate sample sizes required for well-powered equivalence testing analyses. Mean and standard deviation estimates from the automated hippocampal volume dataset were used to carry out an example power analysis. Cortical thickness values were found to be equivalent over 61% of the cortex when different head coils were used (q < 0.05, FDR correction). Automated hippocampal volume estimates obtained using the same two coils were statistically equivalent (TOST p = 4.28 × 10-15). Manual hippocampal volume estimates obtained using two readers were not statistically equivalent (TOST p = 0.97). The use of different readers to carry out limited correction of automated corpus callosum segmentations yielded equivalent area estimates (TOST p = 1.28 × 10-14). Power analysis of simulated and automated hippocampal volume data demonstrated that the equivalence margin affects the number of subjects required for well-powered equivalence tests. We have presented a statistical method for determining if morphometric measures obtained under variable conditions can be pooled. The equivalence testing technique is applicable for analyses in which experimental conditions vary over the course of the study.


2007 ◽  
Vol 12 (4) ◽  
pp. 514-533 ◽  
Author(s):  
LeAnna G. Stork ◽  
Chris Gennings ◽  
Walter H. Carter ◽  
Robert E. Johnson ◽  
Darcy P. Mays ◽  
...  

2018 ◽  
Vol 84 (9) ◽  
Author(s):  
Heman Shakeri ◽  
Victoriya Volkova ◽  
Xuesong Wen ◽  
Andrea Deters ◽  
Charley Cull ◽  
...  

ABSTRACTTo assess phenotypic bacterial antimicrobial resistance (AMR) in different strata (e.g., host populations, environmental areas, manure, or sewage effluents) for epidemiological purposes, isolates of target bacteria can be obtained from a stratum using various sample types. Also, different sample processing methods can be applied. The MIC of each target antimicrobial drug for each isolate is measured. Statistical equivalence testing of the MIC data for the isolates allows evaluation of whether different sample types or sample processing methods yield equivalent estimates of the bacterial antimicrobial susceptibility in the stratum. We demonstrate this approach on the antimicrobial susceptibility estimates for (i) nontyphoidalSalmonellaspp. from ground or trimmed meat versus cecal content samples of cattle in processing plants in 2013-2014 and (ii) nontyphoidalSalmonellaspp. from urine, fecal, and blood human samples in 2015 (U.S. National Antimicrobial Resistance Monitoring System data). We found that the sample types for cattle yielded nonequivalent susceptibility estimates for several antimicrobial drug classes and thus may gauge distinct subpopulations of salmonellae. The quinolone and fluoroquinolone susceptibility estimates for nontyphoidal salmonellae from human blood are nonequivalent to those from urine or feces, conjecturally due to the fluoroquinolone (ciprofloxacin) use to treat infections caused by nontyphoidal salmonellae. We also demonstrate statistical equivalence testing for comparing sample processing methods for fecal samples (culturing one versus multiple aliquots per sample) to assess AMR in fecalEscherichia coli. These methods yield equivalent results, except for tetracyclines. Importantly, statistical equivalence testing provides the MIC difference at which the data from two sample types or sample processing methods differ statistically. Data users (e.g., microbiologists and epidemiologists) may then interpret practical relevance of the difference.IMPORTANCEBacterial antimicrobial resistance (AMR) needs to be assessed in different populations or strata for the purposes of surveillance and determination of the efficacy of interventions to halt AMR dissemination. To assess phenotypic antimicrobial susceptibility, isolates of target bacteria can be obtained from a stratum using different sample types or employing different sample processing methods in the laboratory. The MIC of each target antimicrobial drug for each of the isolates is measured, yielding the MIC distribution across the isolates from each sample type or sample processing method. We describe statistical equivalence testing for the MIC data for evaluating whether two sample types or sample processing methods yield equivalent estimates of the bacterial phenotypic antimicrobial susceptibility in the stratum. This includes estimating the MIC difference at which the data from the two approaches differ statistically. Data users (e.g., microbiologists, epidemiologists, and public health professionals) can then interpret whether that present difference is practically relevant.


2017 ◽  
Vol 9 (5) ◽  
pp. 450-455 ◽  
Author(s):  
Nicole Stegmeier ◽  
Sameer R. Oak ◽  
Colin O’Rourke ◽  
Greg Strnad ◽  
Kurt P. Spindler ◽  
...  

Background: Two versions of the International Knee Documentation Committee (IKDC) Subjective Knee Evaluation form currently exist: the original version (1999) and a recently modified pediatric-specific version (2011). Comparison of the pediatric IKDC with the adult version in the adult population may reveal that either version could be used longitudinally. Hypothesis: We hypothesize that the scores for the adult IKDC and pediatric IKDC will not be clinically different among adult patients aged 18 to 50 years. Study Design: Randomized crossover study design. Level of Evidence: Level 2. Methods: The study consisted of 100 participants, aged 18 to 50 years, who presented to orthopaedic outpatient clinics with knee problems. All participants completed both adult and pediatric versions of the IKDC in random order with a 10-minute break in between. We used a paired t test to test for a difference between the scores and a Welch’s 2-sample t test to test for equivalence. A least-squares regression model was used to model adult scores as a function of pediatric scores, and vice versa. Results: A paired t test revealed a statistically significant 1.6-point difference between the mean adult and pediatric scores. However, the 95% confidence interval (0.54-2.66) for this difference did not exceed our a priori threshold of 5 points, indicating that this difference was not clinically important. Equivalence testing with an equivalence region of 5 points further supported this finding. The adult and pediatric scores had a linear relationship and were highly correlated with an R2 of 92.6%. Conclusion: There is no clinically relevant difference between the scores of the adult and pediatric IKDC forms in adults, aged 18 to 50 years, with knee conditions. Clinical Relevance: Either form, adult or pediatric, of the IKDC can be used in this population for longitudinal studies. If the pediatric version is administered in adolescence, it can be used for follow-up into adulthood.


2015 ◽  
Author(s):  
Heath R Pardoe ◽  
Gary Cutter ◽  
Rachel A Alter ◽  
Rebecca Kucharsky Hiess ◽  
Mira Semmelroch ◽  
...  

Changes in hardware or image processing settings are a common issue for large multi-center studies. In order to pool MRI data acquired under these changed conditions, it is necessary to demonstrate that the changes do not affect MRI-based measurements. In these circumstances classical inference testing is inappropriate because it is designed to detect differences, not prove similarity. We used a method known as statistical equivalence testing to address this limitation. Equivalence testing was carried out on three datasets: (i) cortical thickness and automated hippocampal volume estimates obtained from 16 healthy individuals imaged different multi-channel head coils; (ii) manual hippocampal volumetry obtained using two readers; and (iii) corpus callosum area estimates obtained using an automated method with manual cleanup carried out by two readers. Equivalence testing was carried out using the “two one-sided tests” approach. Cortical thickness values were found to be equivalent over 78% of the cortex when different head coils were used (p = 0.024). Automated hippocampal volume estimates obtained using the same two coils were statistically equivalent (p = 4.28 × 10-15). Manual hippocampal volume estimates obtained using two readers were not statistically equivalent (p = 0.97). The use of different readers to carry out limited correction of automated corpus callosum segmentations yielded equivalent area estimates (1.28 × 10-14). We have presented a statistical method for determining if morphometric measures obtained under variable conditions can be pooled. The equivalence testing technique is applicable for analyses in which experimental conditions vary over the course of the study.


2015 ◽  
Author(s):  
Heath R Pardoe ◽  
Gary Cutter ◽  
Rachel A Alter ◽  
Rebecca Kucharsky Hiess ◽  
Mira Semmelroch ◽  
...  

Changes in hardware or image processing settings are a common issue for large multi-center studies. In order to pool MRI data acquired under these changed conditions, it is necessary to demonstrate that the changes do not affect MRI-based measurements. In these circumstances classical inference testing is inappropriate because it is designed to detect differences, not prove similarity. We used a method known as statistical equivalence testing to address this limitation. Equivalence testing was carried out on three datasets: (i) cortical thickness and automated hippocampal volume estimates obtained from healthy individuals imaged using different multi-channel head coils; (ii) manual hippocampal volumetry obtained using two readers; and (iii) corpus callosum area estimates obtained using an automated method with manual cleanup carried out by two readers. Equivalence testing was carried out using the “two one-sided tests” (TOST) approach. Power analyses of the two one-sided tests were used to estimate sample sizes required for well-powered equivalence testing analyses. Mean and standard deviation estimates from the automated hippocampal volume dataset were used to carry out an example power analysis. Cortical thickness values were found to be equivalent over 61% of the cortex when different head coils were used (q < 0.05, FDR correction). Automated hippocampal volume estimates obtained using the same two coils were statistically equivalent (TOST p = 4.28 × 10-15). Manual hippocampal volume estimates obtained using two readers were not statistically equivalent (TOST p = 0.97). The use of different readers to carry out limited correction of automated corpus callosum segmentations yielded equivalent area estimates (TOST p = 1.28 × 10-14). Power analysis of simulated and automated hippocampal volume data demonstrated that the equivalence margin affects the number of subjects required for well-powered equivalence tests. We have presented a statistical method for determining if morphometric measures obtained under variable conditions can be pooled. The equivalence testing technique is applicable for analyses in which experimental conditions vary over the course of the study.


Sign in / Sign up

Export Citation Format

Share Document