Testing the Significance of the Agreement Among Observers

Martin A. Young; Tom D. Downs

doi:10.1044/jshr.1101.05

Testing the Significance of the Agreement Among Observers

Journal of Speech and Hearing Research ◽

10.1044/jshr.1101.05 ◽

1968 ◽

Vol 11 (1) ◽

pp. 5-17 ◽

Cited By ~ 5

Author(s):

Martin A. Young ◽

Tom D. Downs

Keyword(s):

Rating Scale ◽

Intraclass Correlation ◽

Speech Disorder ◽

Interquartile Range ◽

Speech Pathology ◽

Observer Ratings

Ratings by observers are often used in speech pathology to measure complex speech dimensions; this seems reasonable since a speech “disorder” represents the product of an observer’s evaluation and a speaker’s performance. An index of the validity of these evaluations may be estimated by the amount of agreement among the observers. In this paper, the semi-interquartile range and the intraclass correlation are discussed as possible indices of agreement, and another index is suggested, based on the range of observer ratings. Under the assumption that the distribution of ratings is uniform when ratings are randomly assigned, that is, the observers show no agreement, tables were constructed to indicate the probability of any range for selected numbers of observers and rating scale categories. Some applications for this index concern the training of observers, estimating the number of observers needed, and the construction of master scales.

Download Full-text

Cut-off points between pain intensities of the postoperative pain using receiver operating characteristic (ROC) curves

BMC Anesthesiology ◽

10.1186/s12871-021-01245-5 ◽

2021 ◽

Vol 21 (1) ◽

Author(s):

Sooyoung Cho ◽

Youn Jin Kim ◽

Minjin Lee ◽

Jae Hee Woo ◽

Hyun Jung Lee

Keyword(s):

Postoperative Pain ◽

Receiver Operating Characteristic ◽

Severe Pain ◽

Operating Characteristic ◽

Rating Scale ◽

Intraclass Correlation ◽

Pain Medication ◽

Patient Specific ◽

Moderate Pain ◽

Receiver Operating

Abstract Background Pain assessment and management are important in postoperative circumstances as overdosing of opioids can induce respiratory depression and critical consequences. We aimed this study to check the reliability of commonly used pain scales in a postoperative setting among Korean adults. We also intended to determine cut-off points of pain scores between mild and moderate pain and between moderate and severe pain by which can help to decide to use pain medication. Methods A total of 180 adult patients undergoing elective non-cardiac surgery were included. Postoperative pain intensity was rated with a visual analog scale (VAS), numeric rating scale (NRS), faces pain scale revised (FPS-R), and verbal rating scale (VRS). The VRS rated pain according to four grades: none, mild, moderate, and severe. Pain assessments were performed twice: when the patients were alert enough to communicate after arrival at the postoperative care unit (PACU) and 30 min after arrival at the PACU. The levels of agreement among the scores were evaluated using intraclass correlation coefficients (ICCs). The cut-off points were determined by receiver operating characteristic curves. Results The ICCs among the VAS, NRS, and FPS-R were consistently high (0.839–0.945). The pain categories were as follow: mild ≦ 5.3 / moderate 5.4 ~ 7.1 /severe ≧ 7.2 in VAS, mild ≦ 5 / moderate 6 ~ 7 / severe ≧ 8 in NRS, mild ≦ 4 / moderate 6 / severe 8 and 10 in FPS-R. The cut-off points for analgesics request were VAS ≧ 5.5, NRS ≧ 6, FPS-R ≧ 6, and VRS ≧ 2 (moderate or severe pain). Conclusions During the immediate postoperative period, VAS, NRS, and FPS-R were well correlated. The boundary between mild and moderate pain was around five on 10-point scales, and it corresponded to the cut-off point of analgesic request. Healthcare providers should consider VRS and other patient-specific signs to avoid undertreatment of pain or overdosing of pain medication.

Download Full-text

Using Differential Item Functioning to Test for Interrater Reliability in Constructed Response Items

Educational and Psychological Measurement ◽

10.1177/0013164419899731 ◽

2020 ◽

Vol 80 (4) ◽

pp. 808-820

Author(s):

Cindy M. Walker ◽

Sakine Göçer Şahin

Keyword(s):

Differential Item Functioning ◽

Interrater Reliability ◽

Rating Scales ◽

Rating Scale ◽

Intraclass Correlation ◽

Kappa Statistic ◽

Promising Alternative ◽

Constructed Response ◽

Polytomous Item ◽

Item Functioning

The purpose of this study was to investigate a new way of evaluating interrater reliability that can allow one to determine if two raters differ with respect to their rating on a polytomous rating scale or constructed response item. Specifically, differential item functioning (DIF) analyses were used to assess interrater reliability and compared with traditional interrater reliability measures. Three different procedures that can be used as measures of interrater reliability were compared: (1) intraclass correlation coefficient (ICC), (2) Cohen’s kappa statistic, and (3) DIF statistic obtained from Poly-SIBTEST. The results of this investigation indicated that DIF procedures appear to be a promising alternative to assess the interrater reliability of constructed response items, or other polytomous types of items, such as rating scales. Furthermore, using DIF to assess interrater reliability does not require a fully crossed design and allows one to determine if a rater is either more severe, or more lenient, in their scoring of each individual polytomous item on a test or rating scale.

Download Full-text

Comparative study of the pencil-and-paper and digital formats of the Spanish DARS scale

Acta Neuropsychiatrica ◽

10.1017/neu.2021.45 ◽

2021 ◽

pp. 1-21

Author(s):

Elsa Arrua-Duarte ◽

Marta Migoya-Borja ◽

Igor Barahona ◽

Lena C. Quilty ◽

Sakina J. Rizvi ◽

...

Keyword(s):

Rating Scale ◽

Mean Squared Error ◽

Intraclass Correlation ◽

Test Validity ◽

Wilcoxon Test ◽

Digital Version ◽

Root Mean Squared Error ◽

Squared Error ◽

Digital Format ◽

Paper And Pencil

Abstract Objective: The Dimensional Anhedonia Rating Scale (DARS) is a novel questionnaire to assess anhedonia of recent validation. In this work we aim to study the equivalence between the traditional paper-and-pencil and the digital format of DARS. Methods: 69 patients filled the DARS in a paper-based and digital versions. We assessed differences between formats (Wilcoxon test), validity of the scales (Kappa and Intraclass Correlation Coefficients), and reliability (Cronbach’s alpha and Guttman’s coefficient). We calculated the Comparative Fit Index and the Root Mean Squared Error associated with the proposed one-factor structure. Results: Total scores were higher for paper-based format. Significant differences between both formats were found for three items. The weighted Kappa coefficient was approximately 0.40 for most of the items. Internal consistency was greater than 0.94, and the Intraclass Correlation Coefficient for the digital version was 0.95 and 0.94 for the paper-and-pencil version (F= 16.7, p < 0.001). Comparative Adjustment Index was 0.97 for the digital DARS and 0.97 for the paper-and-pencil DARS, and Root Mean Squared Error was 0.11 for the digital DARS and 0.10 for the paper-and-pencil DARS. Conclusion: The digital DARS is consistent in many respects to the paper-and-pencil questionnaire, but equivalence with this format cannot be assumed without caution.

Download Full-text

Interformat Reliability of Web-Based Parent-Rated Questionnaires for Assessing Neurodevelopmental Disorders Among Preschoolers: Cross-sectional Community Study

JMIR Pediatrics and Parenting ◽

10.2196/20172 ◽

2021 ◽

Vol 4 (1) ◽

pp. e20172

Author(s):

Masanori Tanaka ◽

Manabu Saito ◽

Michio Takahashi ◽

Masaki Adachi ◽

Kazuhiko Nakamura

Keyword(s):

Neurodevelopmental Disorders ◽

Rating Scale ◽

Developmental Coordination Disorder ◽

Intraclass Correlation ◽

Correlation Coefficients ◽

Autism Spectrum ◽

Cross Sectional ◽

Web Based ◽

Developmental Health ◽

The Web

Background Early detection and intervention for neurodevelopmental disorders are effective. Several types of paper questionnaires have been developed to assess these conditions in early childhood; however, the psychometric equivalence between the web-based and the paper versions of these questionnaires is unknown. Objective This study examined the interformat reliability of the web-based parent-rated version of the Autism Spectrum Screening Questionnaire (ASSQ), Attention-Deficit/Hyperactivity Disorder Rating Scale (ADHD-RS), Developmental Coordination Disorder Questionnaire 2007 (DCDQ), and Strengths and Difficulties Questionnaire (SDQ) among Japanese preschoolers in a community developmental health check-up setting. Methods A set of paper-based questionnaires were distributed for voluntary completion to parents of children aged 5 years. The package of the paper format questionnaires included the ASSQ, ADHD-RS, DCDQ, parent-reported SDQ (P-SDQ), and several additional demographic questions. Responses were received from 508 parents of children who agreed to participate in the study. After 3 months, 300 parents, who were among the initial responders, were randomly selected and asked to complete the web-based versions of these questionnaires. A total of 140 parents replied to the web-based format and were included as a final sample in this study. Results We obtained the McDonald ω coefficients for both the web-based and paper formats of the ASSQ (web-based: ω=.90; paper: ω=.86), ADHD-RS total and subscales (web-based: ω=.88-.94; paper: ω=.87-.93), DCDQ total and subscales (web-based: ω=.82-.94; paper: ω=.74-.92), and P-SDQ total and subscales (web-based: ω=.55-.81; paper: ω=.52-.80). The intraclass correlation coefficients between the web-based and paper formats were all significant at the 99.9% confidence level: ASSQ (r=0.66, P<.001); ADHD-RS total and subscales (r=0.66-0.74, P<.001); DCDQ total and subscales (r=0.66-0.71, P<.001); P-SDQ Total Difficulties and subscales (r=0.55-0.73, P<.001). There were no significant differences between the web-based and paper formats for total mean score of the ASSQ (P=.76), total (P=.12) and subscale (P=.11-.47) mean scores of DCDQ, and the P-SDQ Total Difficulties mean score (P=.20) and mean subscale scores (P=.28-.79). Although significant differences were found between the web-based and paper formats for mean ADHD-RS scores (total: t132=2.83, P=.005; Inattention subscale: t133=2.15, P=.03; Hyperactivity/Impulsivity subscale: t133=3.21, P=.002), the effect sizes were small (Cohen d=0.18-0.22). Conclusions These results suggest that the web-based versions of the ASSQ, ADHD-RS, DCDQ, and P-SDQ were equivalent, with the same level of internal consistency and intrarater reliability as the paper versions, indicating the applicability of the web-based versions of these questionnaires for assessing neurodevelopmental disorders.

Download Full-text

The Arabic Version of the Mobile App Rating Scale: Development and Validation Study (Preprint)

10.2196/preprints.16956 ◽

2019 ◽

Author(s):

Marco Bardus ◽

Nathalie Awada ◽

Lilian A Ghandour ◽

Elie-Jacques Fares ◽

Tarek Gherbal ◽

...

Keyword(s):

Weight Management ◽

Information Quality ◽

Rating Scale ◽

Arab World ◽

Intraclass Correlation ◽

Correlation Coefficients ◽

Mobile App ◽

Health Apps ◽

Highly Correlated

BACKGROUND With thousands of health apps in app stores globally, it is crucial to systemically and thoroughly evaluate the quality of these apps due to their potential influence on health decisions and outcomes. The Mobile App Rating Scale (MARS) is the only currently available tool that provides a comprehensive, multidimensional evaluation of app quality, which has been used to compare medical apps from American and European app stores in various areas, available in English, Italian, Spanish, and German. However, this tool is not available in Arabic. OBJECTIVE This study aimed to translate and adapt MARS to Arabic and validate the tool with a sample of health apps aimed at managing or preventing obesity and associated disorders. METHODS We followed a well-established and defined “universalist” process of cross-cultural adaptation using a mixed methods approach. Early translations of the tool, accompanied by confirmation of the contents by two rounds of separate discussions, were included and culminated in a final version, which was then back-translated into English. Two trained researchers piloted the MARS in Arabic (MARS-Ar) with a sample of 10 weight management apps obtained from Google Play and the App Store. Interrater reliability was established using intraclass correlation coefficients (ICCs). After reliability was ascertained, the two researchers independently evaluated a set of additional 56 apps. RESULTS MARS-Ar was highly aligned with the original English version. The ICCs for MARS-Ar (0.836, 95% CI 0.817-0.853) and MARS English (0.838, 95% CI 0.819-0.855) were good. The MARS-Ar subscales were highly correlated with the original counterparts (P<.001). The lowest correlation was observed in the area of usability (r=0.685), followed by aesthetics (r=0.827), information quality (r=0.854), engagement (r=0.894), and total app quality (r=0.897). Subjective quality was also highly correlated (r=0.820). CONCLUSIONS MARS-Ar is a valid instrument to assess app quality among trained Arabic-speaking users of health and fitness apps. Researchers and public health professionals in the Arab world can use the overall MARS score and its subscales to reliably evaluate the quality of weight management apps. Further research is necessary to test the MARS-Ar on apps addressing various health issues, such as attention or anxiety prevention, or sexual and reproductive health.

Download Full-text

Concordance between the delirium motor subtyping scale (DMSS) and the abbreviated version (DMSS-4) over longitudinal assessment in elderly medical inpatients

International Psychogeriatrics ◽

10.1017/s104161021500191x ◽

2015 ◽

Vol 28 (5) ◽

pp. 845-851 ◽

Cited By ~ 9

Author(s):

James Fitzgerald ◽

Niamh O’Regan ◽

Dimitrios Adamis ◽

Suzanne Timmons ◽

Colum Dunne ◽

...

Keyword(s):

Rating Scale ◽

Research Effort ◽

Intraclass Correlation ◽

Rapid Assessment ◽

Longitudinal Assessment ◽

Medical Inpatients ◽

Patient Assessments ◽

High Concordance ◽

Delirium Management ◽

Clinical Subtype

ABSTRACTBackground:Delirium is a common neuropsychiatric syndrome that includes clinical subtypes identified by the Delirium Motor Subtyping Scale (DMSS). We explored the concordance between the DMSS and an abbreviated 4-item version in elderly medical inpatients.Methods:Elderly general medical admissions (n = 145) were assessed for delirium using the Revised Delirium Rating scale (DRS-R98). Clinical subtype was assessed with the DMSS (which includes the four items included in the DMSS-4). Motor subtypes were generated for all patient assessments using both versions of the scale. The concordance of the original and abbreviated DMSS was examined.Results:The agreement between the DMSS and DMSS-4 was high, both at initial and subsequent assessments (κ range 0.75–0.91). Intraclass Correlation Coefficient (ICC) for all three raters for the DMSS was high (0.70) and for DMSS-4 was moderate (0.59). Analysis of the agreement between raters for individual DMSS items found higher concordance in respect of hypoactive features compared to hyperactive.Conclusions:The DMSS-4 allows for rapid assessment of clinical subtype in delirium and has high concordance with the longer and well-validated DMSS, including over longitudinal assessment. There is good inter-rater reliability between medical and nursing staff. More consistent clinical subtyping can facilitate better delirium management and more focused research effort.

Download Full-text

Bolus Residue Scale: An Easy-to-Use and Reliable Videofluoroscopic Analysis Tool to Score Bolus Residue in Patients with Dysphagia

International Journal of Otolaryngology ◽

10.1155/2015/780197 ◽

2015 ◽

Vol 2015 ◽

pp. 1-7 ◽

Cited By ~ 11

Author(s):

Nathalie Rommel ◽

Charlotte Borgers ◽

Dirk Van Beckevoort ◽

Ann Goeleven ◽

Eddy Dejaeger ◽

...

Keyword(s):

Rating Scale ◽

Intraclass Correlation ◽

Correlation Coefficients ◽

Posterior Pharyngeal Wall ◽

High Specificity ◽

Analysis Tool ◽

Pharyngeal Wall ◽

Specificity And Sensitivity ◽

Low Sensitivity ◽

Sensitivity Specificity

Background. We aimed to validate an easy-to-use videofluoroscopic analysis tool, the bolus residue scale (BRS), for detection and classification of pharyngeal retention in the valleculae, piriform sinuses, and/or the posterior pharyngeal wall.Methods. 50 randomly selected videofluoroscopic images of 10 mL swallows (recorded in 18 dysphagia patients and 8 controls) were analyzed by 4 experts and 6 nonexpert observers. A score from 1 to 6 was assigned according to the number of structures affected by residue. Inter- and intrarater reliabilities were assessed by calculation of intraclass correlation coefficients (ICCs) for expert and nonexpert observers. Sensitivity, specificity, and interrater agreement were analyzed for different BRS levels.Results. Intrarater reproducibility was almost perfect for experts (mean ICC 0.972) and ranged from substantial to almost perfect for nonexperts (mean ICC 0.835). Interjudge agreement of the experts ranged from substantial to almost perfect (mean ICC 0.780), but interrater reliability of nonexperts ranged from substantial to good (mean 0.719). BRS shows for experts a high specificity and sensitivity and for nonexperts a low sensitivity and high specificity.Conclusions. The BRS is a simple, easy-to-carry-out, and accessible rating scale to locate pharyngeal retention on videofluoroscopic images with a good specificity and reproducibility for observers of different expertise levels.

Download Full-text

Patient-reported outcome measurement compared with professional judgment of cosmetic results after breast-conserving therapy

Current Oncology ◽

10.3747/co.25.4036 ◽

2018 ◽

Vol 25 (6) ◽

Cited By ~ 6

Author(s):

A.T.P.M. Brands-Appeldoorn ◽

A.J.G. Maaskant-Braat ◽

W.A.R Zwaans ◽

J.P. Dieleman ◽

K.E. Schenk ◽

...

Keyword(s):

Rating Scale ◽

Outcome Measurement ◽

Intraclass Correlation ◽

Scoring Systems ◽

Breast Conserving Therapy ◽

Numeric Rating Scale ◽

Cosmetic Outcome ◽

Professional Judgment ◽

Size And Shape ◽

Patient Reported

Background In the present study, we set out to compare patient reported outcomes with professional judgment about cosmesis after breast-conserving therapy (bct) and to evaluate which items (position of the nipple, color, scar, size, shape, and firmness) correlate best with subjective outcome.Methods Dutch patients treated with bct between 2008 and 2009 were analyzed. Exclusion criteria were prior amputation or bct of the contralateral breast, metastatic disease, local recurrence, or any prior cosmetic breast surgery. Structured questionnaires and standardized six-view photographs were obtained with a minimum of 3 years’ follow-up. Cosmetic outcome was judged by the patients and, based on photographs, by 5 different medical professionals using 3 different scoring systems: the Harvard scale, the Sneeuw questionnaire, and a numeric rating scale. Agreement was scored using the intraclass correlation coefficient (icc). The association between items of the Sneeuw questionnaire and a fair–poor Harvard score was estimated using logistic regression analysis.Results The study included 108 female patients (age: 40–91 years). Based on the Harvard scale, agreement on cosmetic outcome between the professionals was good (icc: 0.78). In contrast, agreement between professionals as a group compared with the patients was found to be fair to moderate (icc range: 0.38–0.50). The items “size” and “shape” were identified as the strongest determinants of cosmetic outcome.Conclusions Cosmetic outcome was scored differently by patients and professionals. Agreement was greater between the professionals than between the patients and the professionals as a group. In general, size and shape were the most prominent items on which cosmetic outcome was judged by patients and professionals alike.

Download Full-text

Validation of the French Version of the Auditory Hallucination Rating Scale in a Sample of Hallucinating Patients with Schizophrenia

The Canadian Journal of Psychiatry ◽

10.1177/0706743719895641 ◽

2019 ◽

Vol 65 (4) ◽

pp. 237-244 ◽

Cited By ~ 1

Author(s):

Clément Dondé ◽

Frédéric Haesebaert ◽

Emmanuel Poulet ◽

Marine Mondino ◽

Jérôme Brunelin

Keyword(s):

Factor Analysis ◽

Rating Scale ◽

Auditory Hallucination ◽

Intraclass Correlation ◽

Internal Validity ◽

Correlation Coefficients ◽

French Version ◽

Fine Grained ◽

French Speaking ◽

Item Scores

Objective: The aim of this study was to validate the French version of the 7-item Auditory Hallucination Rating Scale (AHRS) so as to facilitate fine-grained assessment of auditory hallucinations (AH) in native French-speaking patients with schizophrenia (SZ) in clinical settings and studies. Method: Patients ( N = 66) were diagnosed with SZ according to the Diagnostic and Statistical Manual of Mental Disorders. The French version of the AHRS was developed using a forward–backward translation procedure. Psychometric properties of the French version of the AHRS were tested including (i) construct validity with a confirmatory one-factor analysis, (ii) internal validity with Pearson correlations and Cronbach α coefficients, and (iii) external validity by correlations with the Scale for Assessment of Positive Symptoms (SAPS-H1), the Positive and Negative Syndrome Scale (PANSS-P3; concurrent), the PANSS-Negative subscale and age of subjects (divergent), and inter-rater intraclass correlation coefficients (ICCs). Results: (i) The confirmatory one-factor analysis found a root mean square error of approximation (RMSEA) = 0.00, 90% confidence interval = [0.000 to 0.011], and a comparative fit index = 0.994. (ii) Correlations between AHRS total score and individual items were mostly ≥0.4. Cronbach α coefficient was 0.61. (iii) Correlations with PANSS-P3 and SAPS-H1 were 0.42 and 0.53, respectively. In a subset of participants ( N = 16), ICC values were extremely high and significant for AHRS total and individual item scores (ICCs range 0.899 to 0.996) Conclusion: The French version of the AHRS is a psychometrically acceptable instrument for the evaluation of AH severity in French-speaking patients with SZ.

Download Full-text

Development and assessment of the inter-rater and intra-rater reproducibility of a self-administration version of the ALSFRS-R

Journal of Neurology Neurosurgery & Psychiatry ◽

10.1136/jnnp-2019-321138 ◽

2019 ◽

Vol 91 (1) ◽

pp. 75-81 ◽

Cited By ~ 7

Author(s):

Leonhard A Bakker ◽

Carin D Schröder ◽

Harold H G Tan ◽

Simone M A G Vugts ◽

Ruben P A van Eijk ◽

...

Keyword(s):

Rating Scale ◽

Clinical Care ◽

Intraclass Correlation ◽

Correlation Coefficients ◽

The Self ◽

Coefficient Alpha ◽

Rater Agreement ◽

Self Administration ◽

Limits Of Agreement ◽

Rater Reliability

ObjectiveThe Amyotrophic Lateral Sclerosis Functional Rating Scale-Revised (ALSFRS-R) is widely applied to assess disease severity and progression in patients with motor neuron disease (MND). The objective of the study is to assess the inter-rater and intra-rater reproducibility, i.e., the inter-rater and intra-rater reliability and agreement, of a self-administration version of the ALSFRS-R for use in apps, online platforms, clinical care and trials.MethodsThe self-administration version of the ALSFRS-R was developed based on both patient and expert feedback. To assess the inter-rater reproducibility, 59 patients with MND filled out the ALSFRS-R online and were subsequently assessed on the ALSFRS-R by three raters. To assess the intra-rater reproducibility, patients were invited on two occasions to complete the ALSFRS-R online. Reliability was assessed with intraclass correlation coefficients, agreement was assessed with Bland-Altman plots and paired samples t-tests, and internal consistency was examined with Cronbach’s coefficient alpha.ResultsThe self-administration version of the ALSFRS-R demonstrated excellent inter-rater and intra-rater reliability. The assessment of inter-rater agreement demonstrated small systematic differences between patients and raters and acceptable limits of agreement. The assessment of intra-rater agreement demonstrated no systematic changes between time points; limits of agreement were 4.3 points for the total score and ranged from 1.6 to 2.4 points for the domain scores. Coefficient alpha values were acceptable.DiscussionThe self-administration version of the ALSFRS-R demonstrates high reproducibility and can be used in apps and online portals for both individual comparisons, facilitating the management of clinical care and group comparisons in clinical trials.

Download Full-text