Detecting Noneffortful Responses Based on a Residual Method Using an Iterative Purification Process

2021 ◽  
pp. 107699862199436
Author(s):  
Yue Liu ◽  
Hongyun Liu

The prevalence and serious consequences of noneffortful responses from unmotivated examinees are well-known in educational measurement. In this study, we propose to apply an iterative purification process based on a response time residual method with fixed item parameter estimates to detect noneffortful responses. The proposed method is compared with the traditional residual method and noniterative method with fixed item parameters in two simulation studies in terms of noneffort detection accuracy and parameter recovery. The results show that when severity of noneffort is high, the proposed method leads to a much higher true positive rate with a small increase of false discovery rate. In addition, parameter estimation is significantly improved by the strategies of fixing item parameters and iteratively cleansing. These results suggest that the proposed method is a potential solution to reduce the impact of data contamination due to severe low test-taking effort and to obtain more accurate parameter estimates. An empirical study is also conducted to show the differences in the detection rate and parameter estimates among different approaches.

2021 ◽  
Author(s):  
Angély Loubert ◽  
Antoine Regnault ◽  
Véronique Sébille ◽  
Jean-Benoit Hardouin

Abstract BackgroundIn the analysis of clinical trial endpoints, calibration of patient-reported outcomes (PRO) instruments ensures that resulting “scores” represent the same quantity of the measured concept between applications. Rasch measurement theory (RMT) is a psychometric approach that guarantees algebraic separation of person and item parameter estimates, allowing formal calibration of PRO instruments. In the RMT framework, calibration is performed using the item parameter estimates obtained from a previous “calibration” study. But if calibration is based on poorly estimated item parameters (e.g., because the sample size of the calibration sample was low), this may hamper the ability to detect a treatment effect, and direct estimation of item parameters from the trial data (non-calibration) may then be preferred. The objective of this simulation study was to assess the impact of calibration on the comparison of PRO results between treatment groups, using different analysis methods.MethodsPRO results were simulated following a polytomous Rasch model, for a calibration and a trial sample. Scenarios included varying sample sizes, with instrument of varying number of items and modalities, and varying item parameters distributions. Different treatment effect sizes and distributions of the two patient samples were also explored. Comparison of treatment groups was performed using different methods based on a random effect Rasch model. Calibrated and non-calibrated approaches were compared based on type-I error, power, bias, and variance of the estimates for the difference between groups.Results There was no impact of the calibration approach on type-I error, power, bias, and dispersion of the estimates. Among other findings, mistargeting between the PRO instrument and patients from the trial sample (regarding the level of measured concept) resulted in a lower power and higher position bias than appropriate targeting. ConclusionsCalibration of PROs in clinical trials does not compromise the ability to accurately assess a treatment effect and is essential to properly interpret PRO results. Given its important added value, calibration should thus always be performed when a PRO instrument is used as an endpoint in a clinical trial, in the RMT framework.


2018 ◽  
Vol 43 (3) ◽  
pp. 195-210 ◽  
Author(s):  
Chen-Wei Liu ◽  
Wen-Chung Wang

It is commonly known that respondents exhibit different response styles when responding to Likert-type items. For example, some respondents tend to select the extreme categories (e.g., strongly disagree and strongly agree), whereas some tend to select the middle categories (e.g., disagree, neutral, and agree). Furthermore, some respondents tend to disagree with every item (e.g., strongly disagree and disagree), whereas others tend to agree with every item (e.g., agree and strongly agree). In such cases, fitting standard unfolding item response theory (IRT) models that assume no response style will yield a poor fit and biased parameter estimates. Although there have been attempts to develop dominance IRT models to accommodate the various response styles, such models are usually restricted to a specific response style and cannot be used for unfolding data. In this study, a general unfolding IRT model is proposed that can be combined with a softmax function to accommodate various response styles via scoring functions. The parameters of the new model can be estimated using Bayesian Markov chain Monte Carlo algorithms. An empirical data set is used for demonstration purposes, followed by simulation studies to assess the parameter recovery of the new model, as well as the consequences of ignoring the impact of response styles on parameter estimators by fitting standard unfolding IRT models. The results suggest the new model to exhibit good parameter recovery and seriously biased estimates when the response styles are ignored.


2021 ◽  
pp. 014662162199074
Author(s):  
Carolin Strobl ◽  
Julia Kopf ◽  
Lucas Kohler ◽  
Timo von Oertzen ◽  
Achim Zeileis

For detecting differential item functioning (DIF) between two or more groups of test takers in the Rasch model, their item parameters need to be placed on the same scale. Typically this is done by means of choosing a set of so-called anchor items based on statistical tests or heuristics. Here the authors suggest an alternative strategy: By means of an inequality criterion from economics, the Gini Index, the item parameters are shifted to an optimal position where the item parameter estimates of the groups best overlap. Several toy examples, extensive simulation studies, and two empirical application examples are presented to illustrate the properties of the Gini Index as an anchor point selection criterion and compare its properties to those of the criterion used in the alignment approach of Asparouhov and Muthén. In particular, the authors show that—in addition to the globally optimal position for the anchor point—the criterion plot contains valuable additional information and may help discover unaccounted DIF-inducing multidimensionality. They further provide mathematical results that enable an efficient sparse grid optimization and make it feasible to extend the approach, for example, to multiple group scenarios.


2020 ◽  
Vol 80 (4) ◽  
pp. 775-807
Author(s):  
Yue Liu ◽  
Ying Cheng ◽  
Hongyun Liu

The responses of non-effortful test-takers may have serious consequences as non-effortful responses can impair model calibration and latent trait inferences. This article introduces a mixture model, using both response accuracy and response time information, to help differentiating non-effortful and effortful individuals, and to improve item parameter estimation based on the effortful group. Two mixture approaches are compared with the traditional response time mixture model (TMM) method and the normative threshold 10 (NT10) method with response behavior effort criteria in four simulation scenarios with regard to item parameter recovery and classification accuracy. The results demonstrate that the mixture methods and the TMM method can reduce the bias of item parameter estimates caused by non-effortful individuals, with the mixture methods showing more advantages when the non-effort severity is high or the response times are not lognormally distributed. An illustrative example is also provided.


2019 ◽  
Vol 34 (6) ◽  
pp. 873-873
Author(s):  
W Goette ◽  
A Schmitt ◽  
J Nici

Abstract Objective To identify item parameter estimates for the Halstead Category Test (HCT). Previous item response analyses have been conducted on the HCT but without implementing item response theory methods. Method Data were collected from a diagnostically heterogenous sample of 211 adults (110 males, 101 females) referred for neuropsychological evaluation. The sample had an average educational attainment of 14.18 years (SD = 3.05 years) and an average age of 59.75 years (SD = 18.28). Responses from items on Subtests III-VII were dichotomously coded (0 = incorrect, 1 = correct). A two-parameter, hierarchical, logistic item response model was fit to the data using code in Stan, which uses an adaptive variant of Hamiltonian Monte Carlo. Results The model converged appropriately with posterior estimates of item parameters all demonstrating adequate effective sample sizes (min. = 3485.74) and Rhat (max. = 1.002). The range of posterior difficulty estimates follows: -1.06-2.07 (III), -1.67-1.92 (IV), -3.80-2.62 (V), -2.35-4.38 (VI), and -2.28-1.80 (VII). The range of posterior discrimination estimates follows: 0.20-5.41 (III), 0.35-8.17 (IV), 0.11-4.14 (V), 0.69-5.88 (VI), and 0.53-2.83 (VII). Conclusions The HCT demonstrates a wide range of item difficulties with few items being excessively difficult, though some in this range were identified in Subtest VI. Ranges for item discriminations are also wide with some estimates returning high estimates, which may be related to the smaller sample size for a two-parameter model or may be due to less-than-ideal item functioning. These findings support the longstanding sensitivity of the HCT to a variety of neurological conditions and across the severity spectrum.


2016 ◽  
Vol 77 (3) ◽  
pp. 389-414 ◽  
Author(s):  
Yin Lin ◽  
Anna Brown

A fundamental assumption in computerized adaptive testing is that item parameters are invariant with respect to context—items surrounding the administered item. This assumption, however, may not hold in forced-choice (FC) assessments, where explicit comparisons are made between items included in the same block. We empirically examined the influence of context on item parameters by comparing parameter estimates from two FC instruments. The first instrument was composed of blocks of three items, whereas in the second, the context was manipulated by adding one item to each block, resulting in blocks of four. The item parameter estimates were highly similar. However, a small number of significant deviations were observed, confirming the importance of context when designing adaptive FC assessments. Two patterns of such deviations were identified, and methods to reduce their occurrences in an FC computerized adaptive testing setting were proposed. It was shown that with a small proportion of violations of the parameter invariance assumption, score estimation remained stable.


2021 ◽  
Vol 11 ◽  
Author(s):  
Sedat Sen ◽  
Allan S. Cohen

Results of a comprehensive simulation study are reported investigating the effects of sample size, test length, number of attributes and base rate of mastery on item parameter recovery and classification accuracy of four DCMs (i.e., C-RUM, DINA, DINO, and LCDMREDUCED). Effects were evaluated using bias and RMSE computed between true (i.e., generating) parameters and estimated parameters. Effects of simulated factors on attribute assignment were also evaluated using the percentage of classification accuracy. More precise estimates of item parameters were obtained with larger sample size and longer test length. Recovery of item parameters decreased as the number of attributes increased from three to five but base rate of mastery had a varying effect on the item recovery. Item parameter and classification accuracy were higher for DINA and DINO models.


2020 ◽  
Author(s):  
Joseph Rios ◽  
Jim Soland

As low-stakes testing contexts increase, low test-taking effort may serve as a serious validity threat. One common solution to this problem is to identify noneffortful responses and treat them as missing during parameter estimation via the Effort-Moderated IRT (EM-IRT) model. Although this model has been shown to outperform traditional IRT models (e.g., 2PL) in parameter estimation under simulated conditions, prior research has failed to examine its performance under violations to the model’s assumptions. Therefore, the objective of this simulation study was to examine item and mean ability parameter recovery when violating the assumptions that noneffortful responding occurs randomly (assumption #1) and is unrelated to the underlying ability of examinees (assumption #2). Results demonstrated that, across conditions, the EM-IRT model provided robust item parameter estimates to violations of assumption #1. However, bias values greater than 0.20 SDs were observed for the EM-IRT model when violating assumption #2; nonetheless, these values were still lower than the 2PL model. In terms of mean ability estimates, model results indicated equal performance between the EM-IRT and 2PL models across conditions. Across both models, mean ability estimates were found to be biased by more than 0.25 SDs when violating assumption #2. However, our accompanying empirical study suggested that this biasing occurred under extreme conditions that may not be present in some operational settings. Overall, these results suggest that the EM-IRT model provides superior item and equal mean ability parameter estimates in the presence of model violations under realistic conditions when compared to the 2PL model.


2020 ◽  
Author(s):  
Joseph Rios ◽  
Jim Soland

As low-stakes testing contexts increase, low test-taking effort may serve as a serious validity threat. One common solution to this problem is to identify noneffortful responses and treat them as missing during parameter estimation via the Effort-Moderated IRT (EM-IRT) model. Although this model has been shown to outperform traditional IRT models (e.g., 2PL) in parameter estimation under simulated conditions, prior research has failed to examine its performance under violations to the model’s assumptions. Therefore, the objective of this simulation study was to examine item and mean ability parameter recovery when violating the assumptions that noneffortful responding occurs randomly (assumption #1) and is unrelated to the underlying ability of examinees (assumption #2). Results demonstrated that, across conditions, the EM-IRT model provided robust item parameter estimates to violations of assumption #1. However, bias values greater than 0.20 SDs were observed for the EM-IRT model when violating assumption #2; nonetheless, these values were still lower than the 2PL model. In terms of mean ability estimates, model results indicated equal performance between the EM-IRT and 2PL models across conditions. Across both models, mean ability estimates were found to be biased by more than 0.25 SDs when violating assumption #2. However, our accompanying empirical study suggested that this biasing occurred under extreme conditions that may not be present in some operational settings. Overall, these results suggest that the EM-IRT model provides superior item and equal mean ability parameter estimates in the presence of model violations under realistic conditions when compared to the 2PL model.


Sign in / Sign up

Export Citation Format

Share Document