scholarly journals SENSITIVITY OF MANTEL HAENSZEL MODEL AND RASCH MODEL AS VIEWED FROM SAMPLE SIZE

2017 ◽  
Vol 2 (1) ◽  
pp. 18
Author(s):  
IDRUS ALWI

The aims of this research is to study the sensitivity comparison of Mantel Haenszel and Rasch Model for detection differential item functioning, observed from the sample size. These two differential item functioning (DIF) methods were compared using simulate binary item respon data sets of varying sample size,  200 and 400 examinees were used in the analyses, a detection method of differential item functioning (DIF) based on gender difference. These test conditions were replication 4 times. For both differential item functioning (DIF) detection methods, a test length of 42 items was sufficient for satisfactory differential item functioning (DIF) detection with detection rate increasing as sample size increased. Finding the study revealed that the empirical result show Rasch Model are more sensitive to detection differential item functioning (DIF) than Mantel Haenszel. With reference to findings of this study, it is recomended that the use of Rasch Model in evaluation activities with multiple choice test. For this purpose, it is necessary for every school to have some teachers who are skillfull in analyzing results of test using modern methods (Item Response Theory).

2007 ◽  
Vol 10 (3) ◽  
pp. 309-324 ◽  
Author(s):  
John Brodersen ◽  
David Meads ◽  
Svend Kreiner ◽  
Hanne Thorsen ◽  
Lynda Doward ◽  
...  

2021 ◽  
Vol 10 (2) ◽  
pp. 270-281
Author(s):  
P. Susongko ◽  
Y. Arfiani ◽  
M. Kusuma

The emergence of Differential Item Functioning (DIF) indicates an external bias in an item. This study aims to identify items at scientific literacy skills with integrated science (SLiSIS) test that experience DIF based on gender. Moreover, it is analyzed the emergence of DIF, especially related to the test construct measured, and concluded on how far the validity of the SLiSIS test from the construct validity of consequential type. The study was conducted with a quantitative approach by using a survey or non-experimental methods. The samples of this study were the responses of the SLiSIS test taken from 310 eleventh-grade high school students in the science program from SMA 2 and SMA 3 Tegal. The DIF analysis technique used Wald Test with the Rasch model. From the findings, eight items contained DIF in a 95 % level of trust. In 99 % level of trust, three items contained DIF, items 1, 6, and 38 or 7%. The DIF is caused by differences in test-takers ability following the measured construct, so it is not a test bias. Thus, the emergence of DIF on SLiSIS test items does not threaten the construct validity of the consequential type.


The purpose of this study was to examine the differences in sensitivity of three methods: IRT-Likelihood Ratio (IRT-LR), Mantel-Haenszel (MH) and Logistics Regression (LR), in detecting gender differential item functioning (DIF) on National Mathematics Examination (Ujian Nasional: UN) for 2014/2015 academic year in North Sumatera Province of Indonesia. DIF item shows the unfairness. It advantages the test takers of certain groups and disadvantages other group test takers, in the case they have the same ability. The presence of DIF was reviewed in grouping by gender: men as reference groups (R) and women as focus groups (F). This study used the experimental method, 3x1 design, with one factor (i.e. method) with three treatments, in the form of 3 different DIF detection methods. There are 5 types of UN Mathematics Year 2015 packages (codes: 1107, 2207, 3307, 4407 and 5507). The 2207 package code was taken as the sample data, consisting of 5000 participants (3067 women, 1933 men; for 40 UN items). Item selection was carried out based on the classical test theory (CTT) on 40 UN items, producing 32 items that fulfilled, and item response theory selection (IRT) produced 18 items that fulfilled. With program R 3.333 and IRTLRDIF 2.0, it was found 5 items were detected as DIF by the IRT-Likelihood Ratio-method (IRTLR), 4 items were detected as DIF by the Logistic Regression method (LR), and 3 items were detected as DIF by the MantelHaenszel method (MH). To test the sensitivity of the three methods, it is not enough with just one time DIF detection, but formed six groups of data analysis: (4400,40),(4400,32), (4400,18), (3000,40), (3000,32), (3000,18), and generate 40 random data sets (without repetitions) in each group, and conduct detecting DIF on the items in each data set. Although the data lacks model fit, the 3 parameter logistic model (3PL) is chosen as the most suitable model. With the Tukey's HSD post hoc test, the IRT-LR method is known to be more sensitive than the MH and LR methods in the group (4400,40) and (3000,40). The IRT-LR method is not longer more sensitive than LR in the group (4400,32) and (3000,32), but still more sensitive than MH. In the groups (4400,18) and (3000,18) the IRT-LR method is more sensitive than LR, but not significantly more sensitive than MH. The LR method is consistently tested to be more sensitive than the MH method in the entire analysis groups.


2013 ◽  
Vol 93 (11) ◽  
pp. 1507-1519 ◽  
Author(s):  
Clayon B. Hamilton ◽  
Bert M. Chesworth

Background The original 20-item Upper Extremity Functional Index (UEFI) has not undergone Rasch validation. Objective The purpose of this study was to determine whether Rasch analysis supports the UEFI as a measure of a single construct (ie, upper extremity function) and whether a Rasch-validated UEFI has adequate reproducibility for individual-level patient evaluation. Design This was a secondary analysis of data from a repeated-measures study designed to evaluate the measurement properties of the UEFI over a 3-week period. Methods Patients (n=239) with musculoskeletal upper extremity disorders were recruited from 17 physical therapy clinics across 4 Canadian provinces. Rasch analysis of the UEFI measurement properties was performed. If the UEFI did not fit the Rasch model, misfitting patients were deleted, items with poor response structure were corrected, and misfitting items and redundant items were deleted. The impact of differential item functioning on the ability estimate of patients was investigated. Results A 15-item modified UEFI was derived to achieve fit to the Rasch model where the total score was supported as a measure of upper extremity function only. The resultant UEFI-15 interval-level scale (0–100, worst to best state) demonstrated excellent internal consistency (person separation index=0.94) and test-retest reliability (intraclass correlation coefficient [2,1]=.95). The minimal detectable change at the 90% confidence interval was 8.1. Limitations Patients who were ambidextrous or bilaterally affected were excluded to allow for the analysis of differential item functioning due to limb involvement and arm dominance. Conclusion Rasch analysis did not support the validity of the 20-item UEFI. However, the UEFI-15 was a valid and reliable interval-level measure of a single dimension: upper extremity function. Rasch analysis supports using the UEFI-15 in physical therapist practice to quantify upper extremity function in patients with musculoskeletal disorders of the upper extremity.


1995 ◽  
Vol 80 (3_suppl) ◽  
pp. 1071-1074 ◽  
Author(s):  
Thomas Uttaro

The Mantel-Haenszel chi-square (χ2MH) is widely used to detect differential item functioning (item bias) between ethnic and gender-based subgroups on educational and psychological tests. The empirical behavior of χ2MH has been incompletely understood; previous research is inconclusive. The present simulation study explored the effects of sample size, number of items, and trait distributions on the power of χ2MH to detect modeled differential item functioning. A significant effect was obtained for sample size with unacceptably low power for 250 subjects each in the focal and reference groups. The discussion supports the 1990 recommendations of Swaminathan and Rogers, opposes the 1993 view of Zieky that a sample size of 250 for each group is adequate.


2016 ◽  
Vol 2016 ◽  
pp. 1-8 ◽  
Author(s):  
Elahe Allahyari ◽  
Peyman Jafari ◽  
Zahra Bagheri

Objective.The present study uses simulated data to find what the optimal number of response categories is to achieve adequate power in ordinal logistic regression (OLR) model for differential item functioning (DIF) analysis in psychometric research.Methods.A hypothetical ten-item quality of life scale with three, four, and five response categories was simulated. The power and type I error rates of OLR model for detecting uniform DIF were investigated under different combinations of ability distribution (θ), sample size, sample size ratio, and the magnitude of uniform DIF across reference and focal groups.Results.Whenθwas distributed identically in the reference and focal groups, increasing the number of response categories from 3 to 5 resulted in an increase of approximately 8% in power of OLR model for detecting uniform DIF. The power of OLR was less than 0.36 when ability distribution in the reference and focal groups was highly skewed to the left and right, respectively.Conclusions.The clearest conclusion from this research is that the minimum number of response categories for DIF analysis using OLR is five. However, the impact of the number of response categories in detecting DIF was lower than might be expected.


Sign in / Sign up

Export Citation Format

Share Document