rater drift
Recently Published Documents


TOTAL DOCUMENTS

7
(FIVE YEARS 0)

H-INDEX

2
(FIVE YEARS 0)

2017 ◽  
Vol 42 (4) ◽  
pp. 307-320 ◽  
Author(s):  
Adrienne Sgammato ◽  
John R. Donoghue

When constructed response items are administered repeatedly, “trend scoring” can be used to test for rater drift. In trend scoring, raters rescore responses from the previous administration. Two simulation studies evaluated the utility of Stuart’s Q measure of marginal homogeneity as a way of evaluating rater drift when monitoring trend scoring. In the first study, data were generated based on trend scoring tables obtained from an operational assessment. The second study tightly controlled table margins to disentangle certain features present in the empirical data. In addition to Q, the paired t test was included as a comparison, because of its widespread use in monitoring trend scoring. Sample size, number of score categories, interrater agreement, and symmetry/asymmetry of the margins were manipulated. For identical margins, both statistics had good Type I error control. For a unidirectional shift in margins, both statistics had good power. As expected, when shifts in the margins were balanced across categories, the t test had little power. Q demonstrated good power for all conditions and identified almost all items identified by the t test. Q shows substantial promise for monitoring of trend scoring.


2011 ◽  
Vol 21 ◽  
pp. S357-S358 ◽  
Author(s):  
A. DeFries ◽  
B. Rothman ◽  
C. Yavorsky ◽  
M. Opler ◽  
J. Gordon ◽  
...  

2011 ◽  
Vol 26 (S2) ◽  
pp. 683-683 ◽  
Author(s):  
B. Rothman ◽  
C. Yavorsky ◽  
A. De Fries ◽  
J. Gordon ◽  
M. Opler

Introduction/objectives/aimsThough rater drift in clinical trials has long been understood to negatively impact trial results, few studies have systematically quantified this. We examined training data for the HAM-D (Hamilton Depression Scale, 17-item version) at two time points to measure the impact.MethodsRaters participating in a standardized training scored the HAM-D based on two videotaped interviews of depressed patients. To assess drift, data from an initial, post-online training session was compared to data obtained 12 months later. Intra-class correlation coefficients (Shrout & Fliess, 1979) and concordance with expert ratings were compared.ResultsIntra-class correlation coefficients (ICC) for raters (n = 167) following initial training were good to excellent for individual raters (.695–.976, p < .0001) and good for the overall cohort (.752, p < .0001). Concordance with expert ratings was excellent at 99.3%. The overall ICC fell to .730 at the second assessment and although the upper bound of individual performance remained in the good to excellent range, the frequency of scores in the poor to fair range (< .65) increased. Concordance also fell slightly to 87%.ConclusionsRater drift occurred over 12 months, as gauged by the metrics of reliability and concordance. Drift was apparent in a limited portion of the cohort but resulted in a lower overall ICC at the second time point. Because studies are generally powered assuming that the ICC remains stable, there are implications for both this power calculation and the required sample size.


2010 ◽  
Vol 20 ◽  
pp. S513-S514
Author(s):  
A. DeFries ◽  
C. Yavorsky ◽  
M. Opler ◽  
L. Ramadhar ◽  
E. Ivanova ◽  
...  
Keyword(s):  

2009 ◽  
Vol 46 (1) ◽  
pp. 43-58 ◽  
Author(s):  
Polina Harik ◽  
Brian E. Clauser ◽  
Irina Grabovsky ◽  
Ronald J. Nungester ◽  
Dave Swanson ◽  
...  

Sign in / Sign up

Export Citation Format

Share Document