scholarly journals MP23: Giving medical students what they deserve - a rigorous, equitable and defensible CaRMS selection process

CJEM ◽  
2019 ◽  
Vol 21 (S1) ◽  
pp. S50
Author(s):  
Q. Paterson ◽  
R. Hartmann ◽  
R. Woods ◽  
L. Martin ◽  
B. Thoma

Innovation Concept: The fairness of the Canadian Residency Matching Service (CaRMS) selection process has been called into question by rising rates of unmatched medical students and reports of bias and subjectivity. We outline how the University of Saskatchewan Royal College emergency medicine program evaluates CaRMS applications in a standardized, rigorous, equitable and defensible manner. Methods: Our CaRMS applicant evaluation methods were first utilized in the 2017 CaRMS cycle, based on published Best Practices, and have been refined yearly to ensure validity, standardization, defensibility, rigour, and to improve the speed and flow of data processing. To determine the reliability of the total application scores for each rater, single measures intraclass correlation coefficients (ICCs) were calculated using a random effects model in 2017 and 2018. Curriculum, Tool or Material: A secure, online spreadsheet was created that includes applicant names, reviewer assignments, data entry boxes, and formulas. Each file reviewer entered data in a dedicated sheet within the document. Each application was reviewed by two staff physicians and two to four residents. File reviewers used a standardized, criterion-based scoring rubric for each application component. The file score for each reviewer-applicant pair was converted into a z-score based on each reviewer's distribution of scores. Z-scores of all reviewers for a single applicant were then combined by weighted average, with the group of staff and group of residents each being weighted to represent half of the final file score. The ICC for the total raw scores improved from 0.38 (poor) in 2017 to 0.52 (moderate) in 2018. The data from each reviewer was amalgamated into a master sheet where applicants were sorted by final file score and heat-mapped to offer a visual aid regarding differences in ratings. Conclusion: Our innovation uses heat-mapped and formula-populated spreadsheets, scoring rubrics, and z-scores to normalize variation in scoring trends between reviewers. We believe this approach provides a rigorous, defensible, and reproducible process by which Canadian residency programs can appraise applicants and create a rank order list.

1999 ◽  
Vol 8 (4) ◽  
pp. 254-261 ◽  
Author(s):  
J Powers ◽  
SJ Bennett

BACKGROUND: Dyspnea, or difficult breathing, is common in patients receiving mechanical ventilation; however, dyspnea is not routinely or systematically measured. OBJECTIVE: The primary purpose of this methodological study was to evaluate the test-retest reliability of 5 dyspnea rating scales and the criterion validity of 4 dyspnea rating scales in patients receiving mechanical ventilation. The secondary purpose was to examine the correlations between each of these 5 rating scales and physiological measures of respiratory function. METHODS: The convenience sample consisted of 28 patients on mechanical ventilation during their hospitalization in the intensive care units of a large, inner-city hospital. Patients rated their dyspnea twice at 30-minute intervals on the visual analogue scale, the vertical analogue dyspnea scale, the modified Borg scale, the numerical scale, and the faces scale. Test-retest reliability was computed by using the intraclass correlation coefficient. Criterion validity was evaluated by using the Spearman rank-order correlation coefficient. RESULTS: The 5 rating scales had acceptable test-retest reliabilities, with intraclass correlation coefficients ranging from 0.81 to 0.97. Criterion validity of the 4 scales also was acceptable, with Spearman rank-order correlation coefficients from 0.76 to 0.96. The rating scales were not correlated with most of the physiological variables. At least half of the patients reported moderate to severe dyspnea. CONCLUSION: The scales showed acceptable reliability and validity, and they will be useful in quantifying dyspnea experienced by patients receiving mechanical ventilation. Further work is needed to evaluate the extent and the severity of dyspnea in such patients in order to evaluate the effectiveness of interventions.


2004 ◽  
Vol 84 (10) ◽  
pp. 906-918 ◽  
Author(s):  
Diane M Wrisley ◽  
Gregory F Marchetti ◽  
Diane K Kuharsky ◽  
Susan L Whitney

Background and Purpose. The Functional Gait Assessment (FGA) is a 10-item gait assessment based on the Dynamic Gait Index. The purpose of this study was to evaluate the reliability, internal consistency, and validity of data obtained with the FGA when used with people with vestibular disorders. Subjects. Seven physical therapists from various practice settings, 3 physical therapist students, and 6 patients with vestibular disorders volunteered to participate. Methods. All raters were given 10 minutes to review the instructions, the test items, and the grading criteria for the FGA. The 10 raters concurrently rated the performance of the 6 patients on the FGA. Patients completed the FGA twice, with an hour's rest between sessions. Reliability of total FGA scores was assessed using intraclass correlation coefficients (2,1). Internal consistency of the FGA was assessed using the Cronbach alpha and confirmatory factor analysis. Concurrent validity was assessed using the correlation of the FGA scores with balance and gait measurements. Results. Intraclass correlation coefficients of .86 and .74 were found for interrater and intrarater reliability of the total FGA scores. Internal consistency of the FGA scores was .79. Spearman rank order correlation coefficients of the FGA scores with balance measurements ranged from .11 to .67. Discussion and Conclusion. The FGA demonstrates what we believe is acceptable reliability, internal consistency, and concurrent validity with other balance measures used for patients with vestibular disorders.


2012 ◽  
Vol 92 (6) ◽  
pp. 841-852 ◽  
Author(s):  
Alexandra De Kegel ◽  
Tina Baetens ◽  
Wim Peersman ◽  
Leen Maes ◽  
Ingeborg Dhooge ◽  
...  

Background Balance is a fundamental component of movement. Early identification of balance problems is important to plan early intervention. The Ghent Developmental Balance Test (GDBT) is a new assessment tool designed to monitor balance from the initiation of independent walking to 5 years of age. Objective The purpose of this study was to establish the psychometric characteristics of the GDBT. Methods To evaluate test-retest reliability, 144 children were tested twice on the GDBT by the same examiner, and to evaluate interrater reliability, videotaped GDBT sessions of 22 children were rated by 3 different raters. To evaluate the known-group validity of GDBT scores, z scores on the GDBT were compared between a clinical group (n=20) and a matched control group (n=20). Concurrent validity of GDBT scores with the subscale standardized scores of the Movement Assessment Battery for Children–Second Edition (M-ABC-2), the Peabody Developmental Motor Scales–Second Edition (PDMS-2), and the balance subscale of the Bruininks-Oseretsky Test–Second Edition (BOT-2) was evaluated in a combined group of the 20 children from the clinical group and 74 children who were developing typically. Results Test-retest and interrater reliability were excellent for the GDBT total scores, with intraclass correlation coefficients of .99 and .98, standard error of measurement values of 0.21 and 0.78, and small minimal detectable differences of 0.58 and 2.08, respectively. The GDBT was able to distinguish between the clinical group and the control group (t38=5.456, P<.001). Pearson correlations between the z scores on GDBT and the standardized scores of specific balance subscales of the M-ABC-2, PDMS-2, and BOT-2 were moderate to high, whereas correlations with subscales measuring constructs other than balance were low. Conclusions The GDBT is a reliable and valid clinical assessment tool for the evaluation of balance in toddlers and preschool-aged children.


2016 ◽  
Vol 75 (1) ◽  
Author(s):  
Nishanee Rampersad ◽  
Rekha Hansraj

Background: Accurate assessment of corneal thickness is essential in corneal refractive surgery, contact lens wear and corneal pathology.Aim: To assess the repeatability (intra-observer, inter-observer and inter-session) of central (0 mm – 2 mm), mid-peripheral (2 mm – 5 mm) and peripheral (5 mm – 6 mm) corneal thickness measurements using the iVue 100 spectral domain optical coherence tomographer (SD-OCT).Setting: Optometry Eye Clinic at the University of KwaZulu-Natal (UKZN).Methods: Corneal thickness measurements were taken on 50 healthy participants by two observers independently. A second set of readings was taken by one observer on another session. Repeatability was assessed using Bland–Altman analysis, the intraclass correlation coefficient, coefficient of variation and one-way analysis of variance (ANOVA) analysis.Results: For all corneal regions, the intraclass correlation coefficients for observer one ranged from 0.942 to 0.999 and that for observer two ranged from 0.946 to 0.999, indicating good intra-observer repeatability. Using linear regression, the corneal thickness measurements were found to be comparable (within 1 µm of each other) in all regions with the exception of the nasal and temporal mid-periphery and periphery. The inter-session repeatability was based on the measurements of observer one only with the mean differences ranging from 0.02 µm to 0.63 µm. Linear regression revealed no significant differences between session 1 and session 2 (p > 0.05) except for the measurement of minimum corneal thickness.Conclusion: This study found evidence of good intra-observer, inter-observer and intersession repeatability of central, mid-peripheral and peripheral corneal measurements with the iVue 100 SD-OCT.


2020 ◽  
pp. bmjstel-2020-000705
Author(s):  
Benjamin Clarke ◽  
Samantha E Smith ◽  
Emma Claire Phillips ◽  
Ailsa Hamilton ◽  
Joanne Kerins ◽  
...  

IntroductionNon-technical skills are recognised to play an integral part in safe and effective patient care. Medi-StuNTS (Medical Students’ Non-Technical Skills) is a behavioural marker system developed to enable assessment of medical students’ non-technical skills. This study aimed to assess whether newly trained raters with high levels of clinical experience could achieve reliability coefficients of >0.7 and to compare differences in inter-rater reliability of raters with varying clinical experience.MethodsForty-four raters attended a workshop on Medi-StuNTS before independently rating three videos of medical students participating in immersive simulation scenarios. Data were grouped by raters’ levels of clinical experience. Inter-rater reliability was assessed by calculating intraclass correlation coefficients (ICC).ResultsEleven raters with more than 10 years of clinical experience achieved single-measure ICC of 0.37 and average-measures ICC of 0.87. Fourteen raters with more than or equal to 5 years and less than 10 years of clinical experience achieved single-measure ICC of 0.09 and average-measures ICC of 0.59. Nineteen raters with less than 5 years of clinical experience achieved single-measure ICC of 0.09 and average-measures ICC 0.65.ConclusionsUsing 11 newly trained raters with high levels of clinical experience produced highly reliable ratings that surpassed the prespecified inter-rater reliability standard; however, a single rater from this group would not achieve sufficiently reliable ratings. This is consistent with previous studies using other medical behavioural marker systems. This study demonstrated a decrease in inter-rater reliability of raters with lower levels of clinical experience, suggesting caution when using this population as raters for assessment of non-technical skills.


2010 ◽  
Vol 7 (5) ◽  
pp. 649-657 ◽  
Author(s):  
Kelley K. Pettee Gabriel ◽  
Rebecca L. Rankin ◽  
Chong Lee ◽  
Mary E. Charlton ◽  
Pamela D. Swan ◽  
...  

Background:The 400 m walk test has been used in older adults; however, the applicability in middle-aged populations is unknown.Methods:Data were obtained from the Evaluation of Physical Activity Measures in Middle-Aged Women (PAW) Study and included 66 women (52.6 ± 5.4 years). Participants were instructed to walk at a brisk, maintainable pace; time taken to complete the 400 m was recorded in seconds. Intraclass correlation coefficients (ICC) were used to assess test-retest reliability. Spearman rank order correlation coefficients were used to examine the concurrent validity of the walk test with cardiorespiratory fitness and associations with physical activity, body composition, flexibility, static balance, and muscular fitness, adjusted for age and body mass index.Results:Participants completed the walk at visits 4 and 5 in 248.0 and 245.0 seconds, respectively. The walk test had excellent reproducibility [ICC = 0.95 (95% CI: 0.92, 0.97)] and was significantly associated with estimated (ρ = −0.43; P < 0.0001) and measured (ρ = −0.56; P < 0.001) VO2max. The walk test was also significantly related to physical activity, body composition, flexibility, and balance.Conclusions:These findings support the utility of the 400 m walk test to estimate cardiorespiratory fitness and reflect free-living physical activity in healthy, middle-aged women.


2002 ◽  
Vol 11 (3) ◽  
pp. 190-201 ◽  
Author(s):  
Michael A. Tabor ◽  
George J. Davies ◽  
Thomas W. Kernozek ◽  
Rodney J. Negrete ◽  
Vincent Hudson

Context:Many clinicians use functional-performance tests to determine an athlete’s readiness to resume activity; however, research demonstrating reliability of these tests is limited.Objective:To introduce the Lower Extremity Functional Test (LEFT) and establish it as a reliable assessment tool.Design:Week 1: Subjects participated in a training session. Week 2: Initial maximal-effort time measurements were recorded. Week 3: Retest time measurements were recorded.Setting:The University of Wisconsin–La Crosse (UW-L) and the University of Central Florida (UCF).Subjects:27 subjects from UW-L and 30 from UCF.Main Outcome Measures:Time measurements were analyzed using intraclass correlation coefficients (ICCs).Results:ICC values of .95 and .97 were established at UW-L and UCF, respectively.Conclusions:The LEFT is a reliable assessment tool.


1998 ◽  
Vol 32 (3) ◽  
pp. 201-208 ◽  
Author(s):  
Darci N. Santos ◽  
Robert Blizard ◽  
Anthony H. Mann

INTRODUCTION: Among psychiatric disorders schizophrenia is often said to be the condition with the most disputed definition.The Bleulerian and Schneiderian approaches have given rise to diagnostic formulations that have varied with time and place. Controversies over the concept of schizophrenia were examined within European/North American settings in the early 1970s but little has since been reported on the views of psychiatrists in developing countries. In Brazil both concepts are referred to in the literature. A scale was developed to measure adherence to Bleulerian and Schneiderian concepts among psychiatrists working in S. Paulo. METHODOLOGY: A self-reported questionnaire comprising seventeen visual analogue-scale statements related to Bleulerian and Schneiderian definitions of Shizophrenia, plus sociodemographic and training characteristics, was distributed to a non-randomised sample of 150 psychiatrists. The two sub-scales were assessed by psychometric methods for internal consistency, sub-scale structure and test-retest reliability. Items selected according to internal consistency were examined by a two-factor model exploratory factor analysis. Intraclass correlation coefficients described the stability of the scale. RESULTS: Replies were received from 117 psychiatrists (mean age 36 (SD 7.9)), 74% of whom were made and 26% female. The Schneiderian scale showed better overall internal consistency than the Bleulerian scale. Intra-class correlation coefficients for test-retest comparisons were between 0.5 and 0.7 for Schneiderian items and 0.2 and 0.7 for Bleulerian items. There was no negative association between Bleulerian and Schneiderian scale scores, suggesting that respondents may hold both concepts. Place of training was significantly associated with the respondent's opinion; disagreement with a Bleulerian standpoint predominated for those trained at the University of S. Paulo. CONCLUSIONS: The less satisfactory reliability for the Bleulerian sub-scale limits confidence in the whole scale but on the other hand this questionnaire contributes to the understanding of the controversy over Bleulerian and Schneiderian models for conceptualisation of schizophrenia, the former requiring more inference and therefore being prone to unreliability.


HortScience ◽  
2017 ◽  
Vol 52 (11) ◽  
pp. 1490-1495 ◽  
Author(s):  
Zachary Stansell ◽  
Thomas Björkman ◽  
Sandra Branham ◽  
David Couillard ◽  
Mark W. Farnham

Selection of superior broccoli hybrids involves multiple considerations, including optimization of head quality traits. Quality assessment of broccoli heads is often confounded by relatively subjective human preferences for optimal appearance of heads. To assist the selection process, we assessed five candidate head quality indices that make use of a set of individual and distinct ratings for traits such as head color, head smoothness, bead size, bead uniformity, and others. The head quality indices were tested for both a) the ability to reduce interobserver rating variability and b) the ability to emphasize specific attributes that display the greatest associations with overall horticultural quality of heads. Index development was based on datasets generated from quality evaluations by three independent raters of two replicated variety trials in Spring 2014. Relative-importance analysis was used to identify specific traits most associated with overall quality. Developed models were subsequently tested and compared using data collected by three raters evaluating two similar trials in Spring 2015. Head smoothness, bead uniformity, head color, and holding ability were found to account for 78% of the model variation in overall head quality. Intraclass correlation coefficients (ICCs), which measure the degree of concordance among raters, were increased from 0.71 to 0.88 (P < 0.05) in one 2015 trial and from 0.67 to 0.80 (P < 0.05) in the second when comparing the simple overall quality assessment to the use of the index weighted by the most important individual head attributes. Thus, results showed that a quality index taking into account the relative importance of individual traits should enhance the identification of the best hybrids adapted to target conditions. This method can be used to improve concordance for subjective ratings in general.


1991 ◽  
Vol 34 (5) ◽  
pp. 989-999 ◽  
Author(s):  
Stephanie Shaw ◽  
Truman E. Coggins

This study examines whether observers reliably categorize selected speech production behaviors in hearing-impaired children. A group of experienced speech-language pathologists was trained to score the elicited imitations of 5 profoundly and 5 severely hearing-impaired subjects using the Phonetic Level Evaluation (Ling, 1976). Interrater reliability was calculated using intraclass correlation coefficients. Overall, the magnitude of the coefficients was found to be considerably below what would be accepted in published behavioral research. Failure to obtain acceptably high levels of reliability suggests that the Phonetic Level Evaluation may not yet be an accurate and objective speech assessment measure for hearing-impaired children.


Sign in / Sign up

Export Citation Format

Share Document