SENSITIVITY OF MANTEL HAENSZEL MODEL AND RASCH MODEL AS VIEWED FROM SAMPLE SIZE

IDRUS ALWI

doi:10.21009/jep.021.02

SENSITIVITY OF MANTEL HAENSZEL MODEL AND RASCH MODEL AS VIEWED FROM SAMPLE SIZE

Jurnal Evaluasi Pendidikan ◽

10.21009/jep.021.02 ◽

2017 ◽

Vol 2 (1) ◽

pp. 18

Author(s):

IDRUS ALWI

Keyword(s):

Sample Size ◽

Differential Item Functioning ◽

Rasch Model ◽

Detection Method ◽

Detection Methods ◽

Choice Test ◽

Data Sets ◽

Item Functioning ◽

Result Show ◽

Binary Item

The aims of this research is to study the sensitivity comparison of Mantel Haenszel and Rasch Model for detection differential item functioning, observed from the sample size. These two differential item functioning (DIF) methods were compared using simulate binary item respon data sets of varying sample size, 200 and 400 examinees were used in the analyses, a detection method of differential item functioning (DIF) based on gender difference. These test conditions were replication 4 times. For both differential item functioning (DIF) detection methods, a test length of 42 items was sufficient for satisfactory differential item functioning (DIF) detection with detection rate increasing as sample size increased. Finding the study revealed that the empirical result show Rasch Model are more sensitive to detection differential item functioning (DIF) than Mantel Haenszel. With reference to findings of this study, it is recomended that the use of Rasch Model in evaluation activities with multiple choice test. For this purpose, it is necessary for every school to have some teachers who are skillfull in analyzing results of test using modern methods (Item Response Theory).

Download Full-text

Methodological aspects of differential item functioning in the Rasch model

Journal of Medical Economics ◽

10.3111/13696990701557048 ◽

2007 ◽

Vol 10 (3) ◽

pp. 309-324 ◽

Cited By ~ 34

Author(s):

John Brodersen ◽

David Meads ◽

Svend Kreiner ◽

Hanne Thorsen ◽

Lynda Doward ◽

...

Keyword(s):

Differential Item Functioning ◽

Rasch Model ◽

Item Functioning ◽

Methodological Aspects ◽

The Rasch Model

Download Full-text

The Interaction of Ability Differences and Guessing When Modeling Differential Item Functioning With the Rasch Model

Educational and Psychological Measurement ◽

10.1177/0013164414554082 ◽

2014 ◽

Vol 75 (4) ◽

pp. 610-633 ◽

Cited By ~ 6

Author(s):

Christine E. DeMars ◽

Daniel P. Jurich

Keyword(s):

Differential Item Functioning ◽

Rasch Model ◽

Item Functioning ◽

The Rasch Model

Download Full-text

Determination of Gender Differential Item Functioning in Tegal Students' Scientific Literacy Skills with Integrated Science (SLiSIS) Test Using Rasch Model

Jurnal Pendidikan IPA Indonesia ◽

10.15294/jpii.v10i2.26775 ◽

2021 ◽

Vol 10 (2) ◽

pp. 270-281

Author(s):

P. Susongko ◽

Y. Arfiani ◽

M. Kusuma

Keyword(s):

High School Students ◽

Construct Validity ◽

Differential Item Functioning ◽

Rasch Model ◽

Scientific Literacy ◽

Wald Test ◽

Literacy Skills ◽

School Students ◽

Integrated Science ◽

Item Functioning

The emergence of Differential Item Functioning (DIF) indicates an external bias in an item. This study aims to identify items at scientific literacy skills with integrated science (SLiSIS) test that experience DIF based on gender. Moreover, it is analyzed the emergence of DIF, especially related to the test construct measured, and concluded on how far the validity of the SLiSIS test from the construct validity of consequential type. The study was conducted with a quantitative approach by using a survey or non-experimental methods. The samples of this study were the responses of the SLiSIS test taken from 310 eleventh-grade high school students in the science program from SMA 2 and SMA 3 Tegal. The DIF analysis technique used Wald Test with the Rasch model. From the findings, eight items contained DIF in a 95 % level of trust. In 99 % level of trust, three items contained DIF, items 1, 6, and 38 or 7%. The DIF is caused by differences in test-takers ability following the measured construct, so it is not a test bias. Thus, the emergence of DIF on SLiSIS test items does not threaten the construct validity of the consequential type.

Download Full-text

Sensitivity Of Differential Item Functioning Detection Methods On National Mathematics Examination In North Sumatera Province, Indonesia

International Journal of Engineering and Advanced Technology - Regular Issue ◽

10.35940/ijeat.e1226.0585c19 ◽

2019 ◽

Vol 8 (5C) ◽

pp. 1538-1549

Keyword(s):

Differential Item Functioning ◽

Likelihood Ratio ◽

Classical Test Theory ◽

Test Theory ◽

Detection Methods ◽

Ratio Method ◽

Suitable Model ◽

Data Set ◽

Item Functioning ◽

Mathematics Examination

The purpose of this study was to examine the differences in sensitivity of three methods: IRT-Likelihood Ratio (IRT-LR), Mantel-Haenszel (MH) and Logistics Regression (LR), in detecting gender differential item functioning (DIF) on National Mathematics Examination (Ujian Nasional: UN) for 2014/2015 academic year in North Sumatera Province of Indonesia. DIF item shows the unfairness. It advantages the test takers of certain groups and disadvantages other group test takers, in the case they have the same ability. The presence of DIF was reviewed in grouping by gender: men as reference groups (R) and women as focus groups (F). This study used the experimental method, 3x1 design, with one factor (i.e. method) with three treatments, in the form of 3 different DIF detection methods. There are 5 types of UN Mathematics Year 2015 packages (codes: 1107, 2207, 3307, 4407 and 5507). The 2207 package code was taken as the sample data, consisting of 5000 participants (3067 women, 1933 men; for 40 UN items). Item selection was carried out based on the classical test theory (CTT) on 40 UN items, producing 32 items that fulfilled, and item response theory selection (IRT) produced 18 items that fulfilled. With program R 3.333 and IRTLRDIF 2.0, it was found 5 items were detected as DIF by the IRT-Likelihood Ratio-method (IRTLR), 4 items were detected as DIF by the Logistic Regression method (LR), and 3 items were detected as DIF by the MantelHaenszel method (MH). To test the sensitivity of the three methods, it is not enough with just one time DIF detection, but formed six groups of data analysis: (4400,40),(4400,32), (4400,18), (3000,40), (3000,32), (3000,18), and generate 40 random data sets (without repetitions) in each group, and conduct detecting DIF on the items in each data set. Although the data lacks model fit, the 3 parameter logistic model (3PL) is chosen as the most suitable model. With the Tukey's HSD post hoc test, the IRT-LR method is known to be more sensitive than the MH and LR methods in the group (4400,40) and (3000,40). The IRT-LR method is not longer more sensitive than LR in the group (4400,32) and (3000,32), but still more sensitive than MH. In the groups (4400,18) and (3000,18) the IRT-LR method is more sensitive than LR, but not significantly more sensitive than MH. The LR method is consistently tested to be more sensitive than the MH method in the entire analysis groups.

Download Full-text

A Rasch-Validated Version of the Upper Extremity Functional Index for Interval-Level Measurement of Upper Extremity Function

Physical Therapy ◽

10.2522/ptj.20130041 ◽

2013 ◽

Vol 93 (11) ◽

pp. 1507-1519 ◽

Cited By ~ 13

Author(s):

Clayon B. Hamilton ◽

Bert M. Chesworth

Keyword(s):

Differential Item Functioning ◽

Upper Extremity ◽

Rasch Model ◽

Rasch Analysis ◽

Measurement Properties ◽

Functional Index ◽

Extremity Function ◽

Item Functioning ◽

Upper Extremity Function ◽

The Rasch Model

Background The original 20-item Upper Extremity Functional Index (UEFI) has not undergone Rasch validation. Objective The purpose of this study was to determine whether Rasch analysis supports the UEFI as a measure of a single construct (ie, upper extremity function) and whether a Rasch-validated UEFI has adequate reproducibility for individual-level patient evaluation. Design This was a secondary analysis of data from a repeated-measures study designed to evaluate the measurement properties of the UEFI over a 3-week period. Methods Patients (n=239) with musculoskeletal upper extremity disorders were recruited from 17 physical therapy clinics across 4 Canadian provinces. Rasch analysis of the UEFI measurement properties was performed. If the UEFI did not fit the Rasch model, misfitting patients were deleted, items with poor response structure were corrected, and misfitting items and redundant items were deleted. The impact of differential item functioning on the ability estimate of patients was investigated. Results A 15-item modified UEFI was derived to achieve fit to the Rasch model where the total score was supported as a measure of upper extremity function only. The resultant UEFI-15 interval-level scale (0–100, worst to best state) demonstrated excellent internal consistency (person separation index=0.94) and test-retest reliability (intraclass correlation coefficient [2,1]=.95). The minimal detectable change at the 90% confidence interval was 8.1. Limitations Patients who were ambidextrous or bilaterally affected were excluded to allow for the analysis of differential item functioning due to limb involvement and arm dominance. Conclusion Rasch analysis did not support the validity of the 20-item UEFI. However, the UEFI-15 was a valid and reliable interval-level measure of a single dimension: upper extremity function. Rasch analysis supports using the UEFI-15 in physical therapist practice to quantify upper extremity function in patients with musculoskeletal disorders of the upper extremity.

Download Full-text

Differential Item Functioning (DIF) in composite health measurement scale: Recommendations for characterizing DIF with meaningful consequences within the Rasch model framework

PLoS ONE ◽

10.1371/journal.pone.0215073 ◽

2019 ◽

Vol 14 (4) ◽

pp. e0215073 ◽

Cited By ~ 6

Author(s):

Alexandra Rouquette ◽

Jean-Benoit Hardouin ◽

Alexis Vanhaesebrouck ◽

Véronique Sébille ◽

Joël Coste

Keyword(s):

Differential Item Functioning ◽

Rasch Model ◽

Measurement Scale ◽

Health Measurement ◽

Model Framework ◽

Item Functioning ◽

The Rasch Model ◽

Health Measurement Scale

Download Full-text

Modeling and Testing Differential Item Functioning in Unidimensional Binary Item Response Models with a Single Continuous Covariate: A Functional Data Analysis Approach

Psychometrika ◽

10.1007/s11336-015-9473-x ◽

2015 ◽

Vol 81 (2) ◽

pp. 371-398 ◽

Cited By ~ 2

Author(s):

Yang Liu ◽

Brooke E. Magnus ◽

David Thissen

Keyword(s):

Data Analysis ◽

Differential Item Functioning ◽

Item Response ◽

Functional Data Analysis ◽

Functional Data ◽

Response Models ◽

Item Response Models ◽

Item Functioning ◽

Continuous Covariate ◽

Binary Item

Download Full-text

Influences on the Mantel-Haenszel Chi-Square in Detection of Differential Item Functioning under Rasch Conditions

Perceptual and Motor Skills ◽

10.2466/pms.1995.80.3c.1071 ◽

1995 ◽

Vol 80 (3_suppl) ◽

pp. 1071-1074 ◽

Cited By ~ 1

Author(s):

Thomas Uttaro

Keyword(s):

Sample Size ◽

Differential Item Functioning ◽

Item Bias ◽

Reference Groups ◽

Chi Square ◽

Item Functioning ◽

Size Number ◽

Gender Based ◽

And Gender ◽

Present Simulation

The Mantel-Haenszel chi-square (χ2MH) is widely used to detect differential item functioning (item bias) between ethnic and gender-based subgroups on educational and psychological tests. The empirical behavior of χ2MH has been incompletely understood; previous research is inconclusive. The present simulation study explored the effects of sample size, number of items, and trait distributions on the power of χ2MH to detect modeled differential item functioning. A significant effect was obtained for sample size with unacceptably low power for 250 subjects each in the focal and reference groups. The discussion supports the 1990 recommendations of Swaminathan and Rogers, opposes the 1993 view of Zieky that a sample size of 250 for each group is adequate.

Download Full-text

Fitting a Mixture Rasch Model to English as a Foreign Language Listening Tests: The Role of Cognitive and Background Variables in Explaining Latent Differential Item Functioning

International Journal of Testing ◽

10.1080/15305058.2015.1004409 ◽

2015 ◽

Vol 15 (3) ◽

pp. 216-238 ◽

Cited By ~ 12

Author(s):

Vahid Aryadoust

Keyword(s):

Foreign Language ◽

Differential Item Functioning ◽

Rasch Model ◽

Item Functioning ◽

Listening Tests

Download Full-text

A Simulation Study to Assess the Effect of the Number of Response Categories on the Power of Ordinal Logistic Regression for Differential Item Functioning Analysis in Rating Scales

Computational and Mathematical Methods in Medicine ◽

10.1155/2016/5080826 ◽

2016 ◽

Vol 2016 ◽

pp. 1-8 ◽

Cited By ~ 3

Author(s):

Elahe Allahyari ◽

Peyman Jafari ◽

Zahra Bagheri

Keyword(s):

Logistic Regression ◽

Sample Size ◽

Differential Item Functioning ◽

Rating Scales ◽

Error Rates ◽

Ordinal Logistic Regression ◽

Type I ◽

Item Functioning ◽

Quality Of Life Scale ◽

The Impact

Objective.The present study uses simulated data to find what the optimal number of response categories is to achieve adequate power in ordinal logistic regression (OLR) model for differential item functioning (DIF) analysis in psychometric research.Methods.A hypothetical ten-item quality of life scale with three, four, and five response categories was simulated. The power and type I error rates of OLR model for detecting uniform DIF were investigated under different combinations of ability distribution (θ), sample size, sample size ratio, and the magnitude of uniform DIF across reference and focal groups.Results.Whenθwas distributed identically in the reference and focal groups, increasing the number of response categories from 3 to 5 resulted in an increase of approximately 8% in power of OLR model for detecting uniform DIF. The power of OLR was less than 0.36 when ability distribution in the reference and focal groups was highly skewed to the left and right, respectively.Conclusions.The clearest conclusion from this research is that the minimum number of response categories for DIF analysis using OLR is five. However, the impact of the number of response categories in detecting DIF was lower than might be expected.

Download Full-text