Detecting Noneffortful Responses Based on a Residual Method Using an Iterative Purification Process

Evaluation of the Impact of Calibration of Patient-Reported Outcomes Measures on Clinical Trial Results: A Simulation Study based on Rasch Measurement Theory

10.21203/rs.3.rs-1182025/v1 ◽

2021 ◽

Author(s):

Angély Loubert ◽

Antoine Regnault ◽

Véronique Sébille ◽

Jean-Benoit Hardouin

Keyword(s):

Clinical Trial ◽

Treatment Effect ◽

Type I Error ◽

Measurement Theory ◽

Item Parameter ◽

Parameter Estimates ◽

Type I ◽

Item Parameters ◽

Patient Reported ◽

The Impact

Abstract BackgroundIn the analysis of clinical trial endpoints, calibration of patient-reported outcomes (PRO) instruments ensures that resulting “scores” represent the same quantity of the measured concept between applications. Rasch measurement theory (RMT) is a psychometric approach that guarantees algebraic separation of person and item parameter estimates, allowing formal calibration of PRO instruments. In the RMT framework, calibration is performed using the item parameter estimates obtained from a previous “calibration” study. But if calibration is based on poorly estimated item parameters (e.g., because the sample size of the calibration sample was low), this may hamper the ability to detect a treatment effect, and direct estimation of item parameters from the trial data (non-calibration) may then be preferred. The objective of this simulation study was to assess the impact of calibration on the comparison of PRO results between treatment groups, using different analysis methods.MethodsPRO results were simulated following a polytomous Rasch model, for a calibration and a trial sample. Scenarios included varying sample sizes, with instrument of varying number of items and modalities, and varying item parameters distributions. Different treatment effect sizes and distributions of the two patient samples were also explored. Comparison of treatment groups was performed using different methods based on a random effect Rasch model. Calibrated and non-calibrated approaches were compared based on type-I error, power, bias, and variance of the estimates for the difference between groups.Results There was no impact of the calibration approach on type-I error, power, bias, and dispersion of the estimates. Among other findings, mistargeting between the PRO instrument and patients from the trial sample (regarding the level of measured concept) resulted in a lower power and higher position bias than appropriate targeting. ConclusionsCalibration of PROs in clinical trials does not compromise the ability to accurately assess a treatment effect and is essential to properly interpret PRO results. Given its important added value, calibration should thus always be performed when a PRO instrument is used as an endpoint in a clinical trial, in the RMT framework.

Download Full-text

A General Unfolding IRT Model for Multiple Response Styles

Applied Psychological Measurement ◽

10.1177/0146621618762743 ◽

2018 ◽

Vol 43 (3) ◽

pp. 195-210 ◽

Cited By ~ 3

Author(s):

Chen-Wei Liu ◽

Wen-Chung Wang

Keyword(s):

Response Style ◽

Response Styles ◽

Parameter Estimates ◽

Data Set ◽

Parameter Recovery ◽

New Model ◽

Irt Model ◽

Irt Models ◽

Strongly Agree ◽

The Impact

It is commonly known that respondents exhibit different response styles when responding to Likert-type items. For example, some respondents tend to select the extreme categories (e.g., strongly disagree and strongly agree), whereas some tend to select the middle categories (e.g., disagree, neutral, and agree). Furthermore, some respondents tend to disagree with every item (e.g., strongly disagree and disagree), whereas others tend to agree with every item (e.g., agree and strongly agree). In such cases, fitting standard unfolding item response theory (IRT) models that assume no response style will yield a poor fit and biased parameter estimates. Although there have been attempts to develop dominance IRT models to accommodate the various response styles, such models are usually restricted to a specific response style and cannot be used for unfolding data. In this study, a general unfolding IRT model is proposed that can be combined with a softmax function to accommodate various response styles via scoring functions. The parameters of the new model can be estimated using Bayesian Markov chain Monte Carlo algorithms. An empirical data set is used for demonstration purposes, followed by simulation studies to assess the parameter recovery of the new model, as well as the consequences of ignoring the impact of response styles on parameter estimators by fitting standard unfolding IRT models. The results suggest the new model to exhibit good parameter recovery and seriously biased estimates when the response styles are ignored.

Download Full-text

Anchor Point Selection: Scale Alignment Based on an Inequality Criterion

Applied Psychological Measurement ◽

10.1177/0146621621990743 ◽

2021 ◽

pp. 014662162199074

Author(s):

Carolin Strobl ◽

Julia Kopf ◽

Lucas Kohler ◽

Timo von Oertzen ◽

Achim Zeileis

Keyword(s):

Gini Index ◽

Statistical Tests ◽

Selection Criterion ◽

Anchor Point ◽

Item Parameter ◽

Parameter Estimates ◽

Point Selection ◽

Optimal Position ◽

Additional Information ◽

Item Parameters

For detecting differential item functioning (DIF) between two or more groups of test takers in the Rasch model, their item parameters need to be placed on the same scale. Typically this is done by means of choosing a set of so-called anchor items based on statistical tests or heuristics. Here the authors suggest an alternative strategy: By means of an inequality criterion from economics, the Gini Index, the item parameters are shifted to an optimal position where the item parameter estimates of the groups best overlap. Several toy examples, extensive simulation studies, and two empirical application examples are presented to illustrate the properties of the Gini Index as an anchor point selection criterion and compare its properties to those of the criterion used in the alignment approach of Asparouhov and Muthén. In particular, the authors show that—in addition to the globally optimal position for the anchor point—the criterion plot contains valuable additional information and may help discover unaccounted DIF-inducing multidimensionality. They further provide mathematical results that enable an efficient sparse grid optimization and make it feasible to extend the approach, for example, to multiple group scenarios.

Download Full-text

Identifying Effortful Individuals With Mixture Modeling Response Accuracy and Response Time Simultaneously to Improve Item Parameter Estimation

Educational and Psychological Measurement ◽

10.1177/0013164419895068 ◽

2020 ◽

Vol 80 (4) ◽

pp. 775-807

Author(s):

Yue Liu ◽

Ying Cheng ◽

Hongyun Liu

Keyword(s):

Parameter Estimation ◽

Response Time ◽

Mixture Model ◽

Response Times ◽

Latent Trait ◽

Item Parameter ◽

Parameter Estimates ◽

Response Accuracy ◽

Parameter Recovery ◽

Item Parameter Estimation

The responses of non-effortful test-takers may have serious consequences as non-effortful responses can impair model calibration and latent trait inferences. This article introduces a mixture model, using both response accuracy and response time information, to help differentiating non-effortful and effortful individuals, and to improve item parameter estimation based on the effortful group. Two mixture approaches are compared with the traditional response time mixture model (TMM) method and the normative threshold 10 (NT10) method with response behavior effort criteria in four simulation scenarios with regard to item parameter recovery and classification accuracy. The results demonstrate that the mixture methods and the TMM method can reduce the bias of item parameter estimates caused by non-effortful individuals, with the mixture methods showing more advantages when the non-effort severity is high or the response times are not lognormally distributed. An illustrative example is also provided.

Download Full-text

A-14 An Updated Item Response Analysis of the Halstead Category Test

Archives of Clinical Neuropsychology ◽

10.1093/arclin/acz034.14 ◽

2019 ◽

Vol 34 (6) ◽

pp. 873-873

Author(s):

W Goette ◽

A Schmitt ◽

J Nici

Keyword(s):

Item Response ◽

Item Parameter ◽

Response Analysis ◽

Parameter Estimates ◽

Item Response Model ◽

Category Test ◽

Item Parameters ◽

Wide Range ◽

Halstead Category Test ◽

Two Parameter

Abstract Objective To identify item parameter estimates for the Halstead Category Test (HCT). Previous item response analyses have been conducted on the HCT but without implementing item response theory methods. Method Data were collected from a diagnostically heterogenous sample of 211 adults (110 males, 101 females) referred for neuropsychological evaluation. The sample had an average educational attainment of 14.18 years (SD = 3.05 years) and an average age of 59.75 years (SD = 18.28). Responses from items on Subtests III-VII were dichotomously coded (0 = incorrect, 1 = correct). A two-parameter, hierarchical, logistic item response model was fit to the data using code in Stan, which uses an adaptive variant of Hamiltonian Monte Carlo. Results The model converged appropriately with posterior estimates of item parameters all demonstrating adequate effective sample sizes (min. = 3485.74) and Rhat (max. = 1.002). The range of posterior difficulty estimates follows: -1.06-2.07 (III), -1.67-1.92 (IV), -3.80-2.62 (V), -2.35-4.38 (VI), and -2.28-1.80 (VII). The range of posterior discrimination estimates follows: 0.20-5.41 (III), 0.35-8.17 (IV), 0.11-4.14 (V), 0.69-5.88 (VI), and 0.53-2.83 (VII). Conclusions The HCT demonstrates a wide range of item difficulties with few items being excessively difficult, though some in this range were identified in Subtest VI. Ranges for item discriminations are also wide with some estimates returning high estimates, which may be related to the smaller sample size for a two-parameter model or may be due to less-than-ideal item functioning. These findings support the longstanding sensitivity of the HCT to a variety of neurological conditions and across the severity spectrum.

Download Full-text

Influence of Context on Item Parameters in Forced-Choice Personality Assessments

Educational and Psychological Measurement ◽

10.1177/0013164416646162 ◽

2016 ◽

Vol 77 (3) ◽

pp. 389-414 ◽

Cited By ~ 8

Author(s):

Yin Lin ◽

Anna Brown

Keyword(s):

Small Proportion ◽

Computerized Adaptive Testing ◽

Adaptive Testing ◽

Forced Choice ◽

Item Parameter ◽

Parameter Estimates ◽

Personality Assessments ◽

Fundamental Assumption ◽

Item Parameters ◽

Item Parameter Estimates

A fundamental assumption in computerized adaptive testing is that item parameters are invariant with respect to context—items surrounding the administered item. This assumption, however, may not hold in forced-choice (FC) assessments, where explicit comparisons are made between items included in the same block. We empirically examined the influence of context on item parameters by comparing parameter estimates from two FC instruments. The first instrument was composed of blocks of three items, whereas in the second, the context was manipulated by adding one item to each block, resulting in blocks of four. The item parameter estimates were highly similar. However, a small number of significant deviations were observed, confirming the importance of context when designing adaptive FC assessments. Two patterns of such deviations were identified, and methods to reduce their occurrences in an FC computerized adaptive testing setting were proposed. It was shown that with a small proportion of violations of the parameter invariance assumption, score estimation remained stable.

Download Full-text

Sample Size Requirements for Applying Diagnostic Classification Models

Frontiers in Psychology ◽

10.3389/fpsyg.2020.621251 ◽

2021 ◽

Vol 11 ◽

Author(s):

Sedat Sen ◽

Allan S. Cohen

Keyword(s):

Sample Size ◽

Classification Accuracy ◽

Base Rate ◽

Item Parameter ◽

Test Length ◽

Diagnostic Classification Models ◽

Parameter Recovery ◽

Estimated Parameters ◽

Item Parameters ◽

Larger Sample

Results of a comprehensive simulation study are reported investigating the effects of sample size, test length, number of attributes and base rate of mastery on item parameter recovery and classification accuracy of four DCMs (i.e., C-RUM, DINA, DINO, and LCDMREDUCED). Effects were evaluated using bias and RMSE computed between true (i.e., generating) parameters and estimated parameters. Effects of simulated factors on attribute assignment were also evaluated using the percentage of classification accuracy. More precise estimates of item parameters were obtained with larger sample size and longer test length. Recovery of item parameters decreased as the number of attributes increased from three to five but base rate of mastery had a varying effect on the item recovery. Item parameter and classification accuracy were higher for DINA and DINO models.

Download Full-text

Parameter Estimation Accuracy of the Effort-Moderated IRT Model Under Multiple Assumption Violations

10.31234/osf.io/fjumx ◽

2020 ◽

Author(s):

Joseph Rios ◽

Jim Soland

Keyword(s):

Parameter Estimation ◽

Item Parameter ◽

Estimation Accuracy ◽

Parameter Estimates ◽

Parameter Recovery ◽

Common Solution ◽

Test Taking ◽

Irt Model ◽

Ability Estimates ◽

Ability Parameter

As low-stakes testing contexts increase, low test-taking effort may serve as a serious validity threat. One common solution to this problem is to identify noneffortful responses and treat them as missing during parameter estimation via the Effort-Moderated IRT (EM-IRT) model. Although this model has been shown to outperform traditional IRT models (e.g., 2PL) in parameter estimation under simulated conditions, prior research has failed to examine its performance under violations to the model’s assumptions. Therefore, the objective of this simulation study was to examine item and mean ability parameter recovery when violating the assumptions that noneffortful responding occurs randomly (assumption #1) and is unrelated to the underlying ability of examinees (assumption #2). Results demonstrated that, across conditions, the EM-IRT model provided robust item parameter estimates to violations of assumption #1. However, bias values greater than 0.20 SDs were observed for the EM-IRT model when violating assumption #2; nonetheless, these values were still lower than the 2PL model. In terms of mean ability estimates, model results indicated equal performance between the EM-IRT and 2PL models across conditions. Across both models, mean ability estimates were found to be biased by more than 0.25 SDs when violating assumption #2. However, our accompanying empirical study suggested that this biasing occurred under extreme conditions that may not be present in some operational settings. Overall, these results suggest that the EM-IRT model provides superior item and equal mean ability parameter estimates in the presence of model violations under realistic conditions when compared to the 2PL model.

Download Full-text

Parameter Estimation Accuracy of the Effort-Moderated IRT Model Under Multiple Assumption Violations

10.35542/osf.io/m379h ◽

2020 ◽

Author(s):

Joseph Rios ◽

Jim Soland

Keyword(s):

Parameter Estimation ◽

Item Parameter ◽

Estimation Accuracy ◽

Parameter Estimates ◽

Parameter Recovery ◽

Common Solution ◽

Test Taking ◽

Irt Model ◽

Ability Estimates ◽

Ability Parameter

As low-stakes testing contexts increase, low test-taking effort may serve as a serious validity threat. One common solution to this problem is to identify noneffortful responses and treat them as missing during parameter estimation via the Effort-Moderated IRT (EM-IRT) model. Although this model has been shown to outperform traditional IRT models (e.g., 2PL) in parameter estimation under simulated conditions, prior research has failed to examine its performance under violations to the model’s assumptions. Therefore, the objective of this simulation study was to examine item and mean ability parameter recovery when violating the assumptions that noneffortful responding occurs randomly (assumption #1) and is unrelated to the underlying ability of examinees (assumption #2). Results demonstrated that, across conditions, the EM-IRT model provided robust item parameter estimates to violations of assumption #1. However, bias values greater than 0.20 SDs were observed for the EM-IRT model when violating assumption #2; nonetheless, these values were still lower than the 2PL model. In terms of mean ability estimates, model results indicated equal performance between the EM-IRT and 2PL models across conditions. Across both models, mean ability estimates were found to be biased by more than 0.25 SDs when violating assumption #2. However, our accompanying empirical study suggested that this biasing occurred under extreme conditions that may not be present in some operational settings. Overall, these results suggest that the EM-IRT model provides superior item and equal mean ability parameter estimates in the presence of model violations under realistic conditions when compared to the 2PL model.

Download Full-text

The Impact of Fallible Item Parameter Estimates on Latent Trait Recovery

Psychometrika ◽

10.1007/s11336-009-9144-x ◽

2010 ◽

Vol 75 (2) ◽

pp. 280-291 ◽

Cited By ~ 31

Author(s):

Ying Cheng ◽

Ke-Hai Yuan

Keyword(s):

Latent Trait ◽

Item Parameter ◽

Parameter Estimates ◽

Item Parameter Estimates ◽

The Impact

Download Full-text