Methodology for a statistically sound evaluation of clinical NLP systems (Preprint)
BACKGROUND Clinical Natural Language Processing (NLP) systems are of crucial importance, because of their increasing relevance in driving decisions about clinical practice. However, carrying out a sound evaluation of NLP systems is complex and hindered by a lack of guidance on how to approach it. OBJECTIVE This research aims to provide a state-of-the-art methodology for the evaluation of a clinical NLP system, thereby guiding NLP researchers in this process with the final goal to ensure the robustness and representativeness of the performance metrics. METHODS We developed a methodology that guides through the process of developing an evaluation of a clinical NLP system using Savana’s ‘EHRead technology’ applied on a real use-case on chronic obstructive pulmonary disease (COPD). In addition, we further introduce SLiCE, a software tool that assists NLP specialists to create a statistically useful gold standard. RESULTS The gold standard contained 49.6% positive and 50.4% negative examples for COPD. For the COPD study, the confidence interval (CI) of the primary variable COPD, calculated using SLiCE, demonstrated its usefulness with CI widths of 0.074 for Precision, 0.046 for Recall, and 0.061 for F1, respectively. CONCLUSIONS Our proposed methodology aims to assist the process of creating an evaluation of a clinical NLP system. Researchers can follow our suggestions step-by-step and use SLiCE to statistically back up their gold standard. We successfully evaluated Savana’s ‘EHRead technology’ using our proposed methodology on a real use-case. We share here the outcome of our experiences working in developing NLP solutions for the clinical domain, hoping that it might help others to establish sound protocols for the evaluation of their NLP system.