Equivalence Testing and the Second Generation P-Value

Meta-Psychology ◽

10.15626/mp.2018.933 ◽

2020 ◽

Vol 4 ◽

Author(s):

Daniël Lakens ◽

Marie Delacre

Keyword(s):

Confidence Intervals ◽

Second Generation ◽

Testing Procedure ◽

Error Rates ◽

P Value ◽

Type I ◽

Equivalence Testing ◽

Equivalence Tests ◽

Significant Difference ◽

Range Of Values

To move beyond the limitations of null-hypothesis tests, statistical approaches have been developed where the observed data are compared against a range of values that are equivalent to the absence of a meaningful effect. Specifying a range of values around zero allows researchers to statistically reject the presence of effects large enough to matter, and prevents practically insignificant effects from being interpreted as a statistically significant difference. We compare the behavior of the recently proposed second generation p-value (Blume, D’Agostino McGowan, Dupont, & Greevy, 2018) with the more established Two One-Sided Tests (TOST) equivalence testing procedure (Schuirmann, 1987). We show that the two approaches yield almost identical results under optimal conditions. Under suboptimal conditions (e.g., when the confidence interval is wider than the equivalence range, or when confidence intervals are asymmetric) the second generation p-value becomes difficult to interpret. The second generation p-value is interpretable in a dichotomous manner (i.e., when the SGPV equals 0 or 1 because the confidence intervals lies completely within or outside of the equivalence range), but this dichotomous interpretation does not require calculations. We conclude that equivalence tests yield more consistent p-values, distinguish between datasets that yield the same second generation p-value, and allow for easier control of Type I and Type II error rates.

Download Full-text

Equivalence Testing and the Second Generation P-Value

10.31234/osf.io/7k6ay ◽

2018 ◽

Cited By ~ 1

Author(s):

Daniel Lakens ◽

Marie Delacre

Keyword(s):

Confidence Intervals ◽

Second Generation ◽

Testing Procedure ◽

Error Rates ◽

P Value ◽

Type I ◽

Equivalence Testing ◽

Equivalence Tests ◽

Significant Difference ◽

Range Of Values

To move beyond the limitations of null-hypothesis tests, statistical approaches have been developed where the observed data is compared against a range of values that are equivalent to the absence of a meaningful effect. Specifying a range of values around zero allows researchers to statistically reject the presence of effects large enough to matter, and prevents practically insignificant effects from being interpreted as a statistically significant difference. We compare the behavior of the recently proposed second generation p-value (Blume, D’Agostino McGowan, Dupont, & Greevy, 2018) with the more established Two One-Sided Tests (TOST) equivalence testing procedure (Schuirmann, 1987). We show that the two approaches yield almost identical results under optimal conditions. Under suboptimal conditions (e.g., when the confidence interval is wider than the equivalence range, or when confidence intervals are asymmetric) the second generation p-value becomes difficult to interpret as a descriptive statistic. The second generation p-value is interpretable in a dichotomous manner (i.e., when the SGPV equals 0 or 1 because the confidence intervals lies completely within or outside of the equivalence range), but this dichotomous interpretation does not require calculations. We conclude that equivalence tests yield more consistent p-values, distinguish between datasets that yield the same second generation p-value, and allow for easier control of Type I and Type II error rates.

Download Full-text

Analyzing Nested Experimental Designs - A User-Friendly Resampling Method to Determine Experimental Significance

10.1101/2021.06.29.450439 ◽

2021 ◽

Author(s):

Rishikesh U Kulkarni ◽

Catherine L Wang ◽

Carolyn R Bertozzi

Keyword(s):

Confidence Intervals ◽

Biomedical Research ◽

Test Construction ◽

Error Rates ◽

Experimental Designs ◽

Statistical Hypothesis ◽

P Value ◽

Type I ◽

Hypothesis Tests ◽

Python Package

While hierarchical experimental designs are near-ubiquitous in neuroscience and biomedical research, researchers often do not take the structure of their datasets into account while performing statistical hypothesis tests. Resampling-based methods are a flexible strategy for performing these analyses but are difficult due to the lack of open-source software to automate test construction and execution. To address this, we report Hierarch, a Python package to perform hypothesis tests and compute confidence intervals on hierarchical experimental designs. Using a combination of permutation resampling and bootstrap aggregation, Hierarch can be used to perform hypothesis tests that maintain nominal Type I error rates and generate confidence intervals that maintain the nominal coverage probability without making distributional assumptions about the dataset of interest. Hierarch makes use of the Numba JIT compiler to reduce p-value computation times to under one second for typical datasets in biomedical research. Hierarch also enables researchers to construct user-defined resampling plans that take advantage of Hierarch's Numba-accelerated functions. Hierarch is freely available as a Python package at https://github.com/rishi-kulkarni/hierarch.

Download Full-text

Bayesian Hodges-Lehmann tests for statistical equivalence in the two-sample setting: Power analysis, type I error rates and equivalence boundary selection in biomedical research

BMC Medical Research Methodology ◽

10.1186/s12874-021-01341-7 ◽

2021 ◽

Vol 21 (1) ◽

Author(s):

Riko Kelter

Keyword(s):

Biomedical Research ◽

Bayes Factor ◽

Type I Error ◽

Error Rates ◽

Type I ◽

Equivalence Testing ◽

Type I Error Rates ◽

Equivalence Tests ◽

Statistical Equivalence ◽

The Relationship

Abstract Background Null hypothesis significance testing (NHST) is among the most frequently employed methods in the biomedical sciences. However, the problems of NHST and p-values have been discussed widely and various Bayesian alternatives have been proposed. Some proposals focus on equivalence testing, which aims at testing an interval hypothesis instead of a precise hypothesis. An interval hypothesis includes a small range of parameter values instead of a single null value and the idea goes back to Hodges and Lehmann. As researchers can always expect to observe some (although often negligibly small) effect size, interval hypotheses are more realistic for biomedical research. However, the selection of an equivalence region (the interval boundaries) often seems arbitrary and several Bayesian approaches to equivalence testing coexist. Methods A new proposal is made how to determine the equivalence region for Bayesian equivalence tests based on objective criteria like type I error rate and power. Existing approaches to Bayesian equivalence testing in the two-sample setting are discussed with a focus on the Bayes factor and the region of practical equivalence (ROPE). A simulation study derives the necessary results to make use of the new method in the two-sample setting, which is among the most frequently carried out procedures in biomedical research. Results Bayesian Hodges-Lehmann tests for statistical equivalence differ in their sensitivity to the prior modeling, power, and the associated type I error rates. The relationship between type I error rates, power and sample sizes for existing Bayesian equivalence tests is identified in the two-sample setting. Results allow to determine the equivalence region based on the new method by incorporating such objective criteria. Importantly, results show that not only can prior selection influence the type I error rate and power, but the relationship is even reverse for the Bayes factor and ROPE based equivalence tests. Conclusion Based on the results, researchers can select between the existing Bayesian Hodges-Lehmann tests for statistical equivalence and determine the equivalence region based on objective criteria, thus improving the reproducibility of biomedical research.

Download Full-text

Type I Error Rates, Coverage of Confidence Intervals, and Variance Estimation in Propensity-Score Matched Analyses

The International Journal of Biostatistics ◽

10.2202/1557-4679.1146 ◽

2009 ◽

Vol 5 (1) ◽

Cited By ~ 65

Author(s):

Peter C Austin

Keyword(s):

Propensity Score ◽

Confidence Intervals ◽

Variance Estimation ◽

Type I Error ◽

Error Rates ◽

Type I ◽

Type I Error Rates

Download Full-text

Latihan Fisik Intesitas Submaksimal dan Kalsitonin Salmon Meningkatkan Kepadatan Tulang Tikus Masa Pertumbuhan

10.31227/osf.io/35sfn ◽

2018 ◽

Author(s):

Heru Syarli Lesmana ◽

Gadis Meinar Sari ◽

Choesnan Effendi ◽

Shinta Arisant

Keyword(s):

Bone Density ◽

Salmon Calcitonin ◽

Bone Matrix ◽

Growth Period ◽

Control Group ◽

P Value ◽

Type I ◽

Femur Bone ◽

Risk Increase ◽

Significant Difference

Bone is a complex tissue consisting of cells and matrix. Mass and thick bone mass has a dynamic addition and subtraction through the process of remodeling (bone matrix absorbed and formed again). Bone is formatted by osteoblast cell and resorption by osteoclast cell. Osteoblasts produce a matrix of osteoid, which is composed mainly of type I collagen, and osteoclast removes bone tissue by removing its mineralized matrix and breaking up the organic bone. Bone remodeling purpose to defend shape and structure of bone. the purpose of this study is to prove that submaximal-intensity exercise and salmon calcitonin improve bone density in growing rat this research method uses design of the randomize posttest only control group design. We compered femur bone density in 24 male norvegicus rats aged 6 weeks that were divided into 4 groups: controls, calcitonin, exercise, combine. Exercise group swam 3 times a week in submaximal intensity, calcitonin group injected synthetic salmon calcitonin 2 iu /100 gram of rat weight every day and combine group did both of it. After 8 weeks, rat femur bone density measured using ultrasound. the result: there are significant differences in bone density between group 1 (control) and group 4 (combine) with p = 0.001, thus the p value <0.05 indicates that there is a significant difference to the average density in both groups. While comparisons to other groups found no significant difference because the value of p> 0.05. the benefits of this research are calcitonin salmon and submaximal-intensity exercise increase the density bone in the growth period. High bone density is mean the bone is strong and health, not porous and fragile so decrease bone fracture risk. increase the bone density in of growth period make the bone get the best mass, and avoid from early osteoporosis.

Download Full-text

Prevalence of elongated styloid process in Saudi population of Aseer region

European Journal of Dentistry ◽

10.4103/1305-7456.120687 ◽

2013 ◽

Vol 07 (04) ◽

pp. 449-454 ◽

Cited By ~ 11

Author(s):

Mohammed Asif Shaik ◽

Sultan Mohammed Kaleem ◽

Abdul Wahab ◽

Shahul Hameed ◽

Keyword(s):

Age Groups ◽

Styloid Process ◽

Frequent Pattern ◽

P Value ◽

Type I ◽

Saudi Population ◽

Radiographic Appearance ◽

Panoramic Radiographs ◽

Elongated Styloid Process ◽

Significant Difference

ABSTRACT Objective: The study was performed to investigate the prevalence, morphology and calcification pattern of elongated styloid process in Saudi population of Aseer (Southern) region and its relation to gender and sub-age groups. Materials and Methods: This study was analyzed digital panoramic radiographs of 1,162 adults. Any radiograph with questionable styloid process was excluded from the study. The apparent length of the styloid process was measured by a single experienced dental and maxillofacial Radiologist. The elongated styloid process was classified with the radiographic appearance based on the morphology and calcification pattern. The data were analyzed by using Student′s t-test and Chi-square test with P value less than 0.05. Results: A total of 1,085 Digital panoramic radiographs showed elongated styloid process of which 686 (63.2%) were noticed in males and 399 (36.8%) were noticed in female patients. There was a statistical significant difference noticed in the mean difference of elongated styloid process between 20-29, 50-59 and 60 years and above sub-age groups. The elongated styloid process was more prevalent in elderly aged male patients (P < 0.05). Type I morphology with calcified out line (a) was the most frequent pattern of calcification noticed in the present study.Conclusion: The panoramic radiographs are economical, easily accessible and useful diagnostic tool for early detection of elongated styloid process with or without symptoms. However, studies with larger sample size would further help to assess the prevalence of this elongated styloid process in Saudi population of various other regions.

Download Full-text

Pairwise Multiple Comparisons in Repeated Measures Designs

Journal of Educational Statistics ◽

10.3102/10769986005003269 ◽

1980 ◽

Vol 5 (3) ◽

pp. 269-287 ◽

Cited By ~ 53

Author(s):

Scott E. Maxwell

Keyword(s):

Repeated Measures ◽

Mixed Model ◽

Multiple Comparisons ◽

Error Rates ◽

Type I ◽

Omnibus Test ◽

Mixed Model Approach ◽

Significant Difference ◽

Repeated Measures Designs ◽

Necessary And Sufficient

Five methods of performing pairwise multiple comparisons in repeated measures designs were investigated. Tukey's Wholly Significant Difference (WSD) test, recommended by most experimental design texts, requires that all differences between pairs of means have a common variance. However, this assumption is equivalent to the sphericity condition that is necessary and sufficient for the validity of the mixed-model approach to the omnibus test. Monte Carlo methods revealed that Tukey's WSD leads to an inflated alpha level when the sphericity assumption is not met. Consideration of both Type I and Type II error rates found in the simulated conditions for the five procedures suggests that a Bonferroni method utilizing a separate error term for each comparison should be employed.

Download Full-text

Diastolic dysfunction in cases with Type II diabetes mellitus.

The Professional Medical Journal ◽

10.29309/tpmj/2019.26.12.229 ◽

2019 ◽

Vol 26 (12) ◽

pp. 2040-2043

Author(s):

Munir Ahmed ◽

Abdul Hayee ◽

Shahla Afsheen Memon ◽

Ismail Salim Memon ◽

Abdul Qayoom Memon

Keyword(s):

Diabetes Mellitus ◽

Diastolic Dysfunction ◽

Type Ii Diabetes ◽

Female Gender ◽

Type Ii Diabetes Mellitus ◽

P Value ◽

Type I ◽

Type Ii ◽

Cross Sectional ◽

Significant Difference

Objectives: To determine the frequency of diastolic dysfunction in patients presenting with type II Diabetes Mellitus. Study Design: Cross sectional study. Setting: Sheikh Zayed Hospital, Rahim Yar Khan. Period: From 01-01-2017 to 30-06-2017. Material & Methods: In this study the cases were selected via non probability consecutive sampling of both male and female gender with age more than 40 years having type II DM of at least more than 2 years were included. The cases suffering from type I DM, gestational DM and those with HTN, end stage kidney and liver failure were excluded. Trans thoracic echocardiography was done to label diastolic dysfunction and was labelled as yes when the E/A ratio was <0.8. The data was analysed using chi square test and p value less than 0.05 was taken as significant. Results: In this study, 100 cases of type II DM were included with mean age of 51.31±7.89 years at presentation. There were 61% males and 39% females. Diastolic dysfunction was observed in 53% of the cases. There was no significant difference in terms of gender where it affected 56.41% of females with p= 0.92. Diastolic dysfunction was more in cases that had duration of DM more than 3 years affecting 48 (70.58%) cases with p= 0.001 and it was also significantly high in cases that had BMI more than 30 where it was seen in 40 (70.17%) of cases with p= 0.001. Conclusion: Diastolic dysfunction seen in half of the cases suffering from type II DM and it is significantly high in cases that had duration of DM more than 3 years and BMI more than 30.

Download Full-text

Analyzing the biochemical, clinical, and hormonal characteristics of patients with polycystic ovary syndrome

Journal of Experimental and Clinical Medicine ◽

10.52142/omujecm.38.4.15 ◽

2021 ◽

Vol 38 (4) ◽

pp. 478-484

Author(s):

Tuğba GÜRBÜZ ◽

Şebnem ALANYA TOSUN

Keyword(s):

Polycystic Ovary Syndrome ◽

Blood Sugar ◽

Polycystic Ovary ◽

Fasting Blood Sugar ◽

P Value ◽

Type I ◽

Ovary Syndrome ◽

Gynecology And Obstetrics ◽

Significant Difference ◽

The Mean

To analyze the biochemical, clinical, and hormonal characteristics of patients with four phenotypes of Polycystic ovary syndrome (PCOS). A total of 225 patients admitted to Medistate Kavacık Hospital Gynecology and Obstetrics outpatient clinic and Giresun University Faculty of Medicine Gynecology and Obstetrics clinic between January 2019 and January 2020 diagnosed as PCOS and healthy controls were included in the study. The revised Rotterdam criteria were applied to diagnose PCOS. The patients with PCOS were divided into Type I classic, Type II classic, Ovulatory and Normoandrogenic PCOS. Biochemical, clinical, and hormonal values were compared. The mean age of the participants is 28 (±5.7) and the mean body mass index (BMI) is 26.15 (±5.36). The mean Ferriman Gallwey Score (FGS) is 7.4(±5.4), which is normal. There is a statistically significant difference between the four PCOS groups and control group in terms of age (p-value=0.000), BMI (p- value=0.000), Luteinizing hormone / Follicle stimulating hormone (LH/FSH) (p-value=0.000), and fasting blood sugar (p-value=0.01). There is a statistically significant difference among the four phenotypes in terms of BMI (p-value =0.002), LH/FSH (p-value =0.000), LH (p-value =0.000), free T4 (p-value =0.01), fasting insulin (p-value =0.001), total testosterone (p-value =0.000), FGS (p-value =0.000), etc. Age, BMI, LH/FSH, FSH, LH, fasting blood sugar, and hirsutism are good predictors of PCOS.

Download Full-text

Statistical Conclusion Validity

10.1093/oso/9780190661557.003.0006 ◽

2017 ◽

Author(s):

Richard McCleary ◽

David McDowall ◽

Bradley J. Bartos

Keyword(s):

Hypothesis Testing ◽

Null Hypothesis ◽

Hypothesis Test ◽

Model Misspecification ◽

Internal Validity ◽

Error Rates ◽

P Value ◽

Type I ◽

Null Hypothesis Testing ◽

Statistical Conclusion

Chapter 6 addresses the sub-category of internal validity defined by Shadish et al., as statistical conclusion validity, or “validity of inferences about the correlation (covariance) between treatment and outcome.” The common threats to statistical conclusion validity can arise, or become plausible through either model misspecification or through hypothesis testing. The risk of a serious model misspecification is inversely proportional to the length of the time series, for example, and so is the risk of mistating the Type I and Type II error rates. Threats to statistical conclusion validity arise from the classical and modern hybrid significance testing structures, the serious threats that weigh heavily in p-value tests are shown to be undefined in Beyesian tests. While the particularly vexing threats raised by modern null hypothesis testing are resolved through the elimination of the modern null hypothesis test, threats to statistical conclusion validity would inevitably persist and new threats would arise.

Download Full-text