Description of a Rule-based System for the i2b2 Challenge in Natural Language Processing for Clinical Data

L. C. Childs; R. Enelow; L. Simonsen; N. H. Heintzelman; K. M. Kowalski; R. J. Taylor

doi:10.1197/jamia.m3083

Clinical trial cohort selection based on multi-level rule-based natural language processing system

Journal of the American Medical Informatics Association ◽

10.1093/jamia/ocz109 ◽

2019 ◽

Vol 26 (11) ◽

pp. 1218-1226 ◽

Cited By ~ 7

Author(s):

Long Chen ◽

Yu Gu ◽

Xin Ji ◽

Chao Lou ◽

Zhiyong Sun ◽

...

Keyword(s):

Clinical Trials ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Rule Based ◽

Language System ◽

Unified Medical Language System ◽

Rule Based System ◽

Medical Language ◽

Cohort Selection

Abstract Objective Identifying patients who meet selection criteria for clinical trials is typically challenging and time-consuming. In this article, we describe our clinical natural language processing (NLP) system to automatically assess patients’ eligibility based on their longitudinal medical records. This work was part of the 2018 National NLP Clinical Challenges (n2c2) Shared-Task and Workshop on Cohort Selection for Clinical Trials. Materials and Methods The authors developed an integrated rule-based clinical NLP system which employs a generic rule-based framework plugged in with lexical-, syntactic- and meta-level, task-specific knowledge inputs. In addition, the authors also implemented and evaluated a general clinical NLP (cNLP) system which is built with the Unified Medical Language System and Unstructured Information Management Architecture. Results and Discussion The systems were evaluated as part of the 2018 n2c2-1 challenge, and authors’ rule-based system obtained an F-measure of 0.9028, ranking fourth at the challenge and had less than 1% difference from the best system. While the general cNLP system didn’t achieve performance as good as the rule-based system, it did establish its own advantages and potential in extracting clinical concepts. Conclusion Our results indicate that a well-designed rule-based clinical NLP system is capable of achieving good performance on cohort selection even with a small training data set. In addition, the investigation of a Unified Medical Language System-based general cNLP system suggests that a hybrid system combining these 2 approaches is promising to surpass the state-of-the-art performance.

Download Full-text

Natural Language Processing for the Identification of Silent Brain Infarcts From Neuroimaging Reports (Preprint)

10.2196/preprints.12109 ◽

2018 ◽

Author(s):

Sunyang Fu ◽

Lester Y Leung ◽

Yanshan Wang ◽

Anne-Olivia Raulli ◽

David F Kallmes ◽

...

Keyword(s):

Machine Learning ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Mayo Clinic ◽

Predictive Value ◽

Support Vector ◽

Rule Based ◽

Rule Based System ◽

Sensitivity Specificity

BACKGROUND Silent brain infarction (SBI) is defined as the presence of 1 or more brain lesions, presumed to be because of vascular occlusion, found by neuroimaging (magnetic resonance imaging or computed tomography) in patients without clinical manifestations of stroke. It is more common than stroke and can be detected in 20% of healthy elderly people. Early detection of SBI may mitigate the risk of stroke by offering preventative treatment plans. Natural language processing (NLP) techniques offer an opportunity to systematically identify SBI cases from electronic health records (EHRs) by extracting, normalizing, and classifying SBI-related incidental findings interpreted by radiologists from neuroimaging reports. OBJECTIVE This study aimed to develop NLP systems to determine individuals with incidentally discovered SBIs from neuroimaging reports at 2 sites: Mayo Clinic and Tufts Medical Center. METHODS Both rule-based and machine learning approaches were adopted in developing the NLP system. The rule-based system was implemented using the open source NLP pipeline MedTagger, developed by Mayo Clinic. Features for rule-based systems, including significant words and patterns related to SBI, were generated using pointwise mutual information. The machine learning models adopted convolutional neural network (CNN), random forest, support vector machine, and logistic regression. The performance of the NLP algorithm was compared with a manually created gold standard. RESULTS A total of 5 reports were removed due to invalid scan types. The interannotator agreements across Mayo and Tufts neuroimaging reports were 0.87 and 0.91, respectively. The rule-based system yielded the best performance of predicting SBI with an accuracy, sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV) of 0.991, 0.925, 1.000, 1.000, and 0.990, respectively. The CNN achieved the best score on predicting white matter disease (WMD) with an accuracy, sensitivity, specificity, PPV, and NPV of 0.994, 0.994, 0.994, 0.994, and 0.994, respectively. CONCLUSIONS We adopted a standardized data abstraction and modeling process to developed NLP techniques (rule-based and machine learning) to detect incidental SBIs and WMDs from annotated neuroimaging reports. Validation statistics suggested a high feasibility of detecting SBIs and WMDs from EHRs using NLP.

Download Full-text

Natural language processing versus rule-based text analysis: Comparing BERT score and readability indices to predict crowdfunding outcomes

Journal of Business Venturing Insights ◽

10.1016/j.jbvi.2021.e00276 ◽

2021 ◽

Vol 16 ◽

pp. e00276

Author(s):

C.S. Richard Chan ◽

Charuta Pethe ◽

Steven Skiena

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Text Analysis ◽

Rule Based

Download Full-text

Triage and diagnosis of COVID-19 from medical social media (Preprint)

10.2196/preprints.30397 ◽

2021 ◽

Author(s):

Abul Hasan ◽

Mark Levene ◽

David Weston ◽

Renate Fromson ◽

Nicolas Koslover ◽

...

Keyword(s):

Machine Learning ◽

Social Media ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Learning Models ◽

Rule Based ◽

Additional Information ◽

Processing Pipeline ◽

Machine Learning Models

BACKGROUND The COVID-19 pandemic has created a pressing need for integrating information from disparate sources, in order to assist decision makers. Social media is important in this respect, however, to make sense of the textual information it provides and be able to automate the processing of large amounts of data, natural language processing methods are needed. Social media posts are often noisy, yet they may provide valuable insights regarding the severity and prevalence of the disease in the population. In particular, machine learning techniques for triage and diagnosis could allow for a better understanding of what social media may offer in this respect. OBJECTIVE This study aims to develop an end-to-end natural language processing pipeline for triage and diagnosis of COVID-19 from patient-authored social media posts, in order to provide researchers and other interested parties with additional information on the symptoms, severity and prevalence of the disease. METHODS The text processing pipeline first extracts COVID-19 symptoms and related concepts such as severity, duration, negations, and body parts from patients’ posts using conditional random fields. An unsupervised rule-based algorithm is then applied to establish relations between concepts in the next step of the pipeline. The extracted concepts and relations are subsequently used to construct two different vector representations of each post. These vectors are applied separately to build support vector machine learning models to triage patients into three categories and diagnose them for COVID-19. RESULTS We report that Macro- and Micro-averaged F_{1\ }scores in the range of 71-96% and 61-87%, respectively, for the triage and diagnosis of COVID-19, when the models are trained on human labelled data. Our experimental results indicate that similar performance can be achieved when the models are trained using predicted labels from concept extraction and rule-based classifiers, thus yielding end-to-end machine learning. Also, we highlight important features uncovered by our diagnostic machine learning models and compare them with the most frequent symptoms revealed in another COVID-19 dataset. In particular, we found that the most important features are not always the most frequent ones. CONCLUSIONS Our preliminary results show that it is possible to automatically triage and diagnose patients for COVID-19 from natural language narratives using a machine learning pipeline, in order to provide additional information on the severity and prevalence of the disease through the eyes of social media.

Download Full-text

An approach to natural language processing in the rule-based expert system

Proceedings of the 1990 ACM annual conference on Cooperation - CSC '90 ◽

10.1145/100348.100381 ◽

1990 ◽

Cited By ~ 1

Author(s):

Jan Kazimierczak

Keyword(s):

Natural Language Processing ◽

Expert System ◽

Natural Language ◽

Language Processing ◽

Rule Based

Download Full-text

Assessment of Natural Language Processing Methods for Ascertaining the Expanded Disability Status Scale Score From the Electronic Health Records of Patients With Multiple Sclerosis: Algorithm Development and Validation Study

JMIR Medical Informatics ◽

10.2196/25157 ◽

2022 ◽

Vol 10 (1) ◽

pp. e25157

Author(s):

Zhen Yang ◽

Chloé Pou-Prom ◽

Ashley Jones ◽

Michaelia Banning ◽

David Dai ◽

...

Keyword(s):

Natural Language Processing ◽

Electronic Health Records ◽

Natural Language ◽

Language Processing ◽

Expanded Disability Status Scale ◽

Health Records ◽

Rule Based ◽

Disability Status ◽

Edss Score ◽

Electronic Health

Background The Expanded Disability Status Scale (EDSS) score is a widely used measure to monitor disability progression in people with multiple sclerosis (MS). However, extracting and deriving the EDSS score from unstructured electronic health records can be time-consuming. Objective We aimed to compare rule-based and deep learning natural language processing algorithms for detecting and predicting the total EDSS score and EDSS functional system subscores from the electronic health records of patients with MS. Methods We studied 17,452 electronic health records of 4906 MS patients followed at one of Canada’s largest MS clinics between June 2015 and July 2019. We randomly divided the records into training (80%) and test (20%) data sets, and compared the performance characteristics of 3 natural language processing models. First, we applied a rule-based approach, extracting the EDSS score from sentences containing the keyword “EDSS.” Next, we trained a convolutional neural network (CNN) model to predict the 19 half-step increments of the EDSS score. Finally, we used a combined rule-based–CNN model. For each approach, we determined the accuracy, precision, recall, and F-score compared with the reference standard, which was manually labeled EDSS scores in the clinic database. Results Overall, the combined keyword-CNN model demonstrated the best performance, with accuracy, precision, recall, and an F-score of 0.90, 0.83, 0.83, and 0.83 respectively. Respective figures for the rule-based and CNN models individually were 0.57, 0.91, 0.65, and 0.70, and 0.86, 0.70, 0.70, and 0.70. Because of missing data, the model performance for EDSS subscores was lower than that for the total EDSS score. Performance improved when considering notes with known values of the EDSS subscores. Conclusions A combined keyword-CNN natural language processing model can extract and accurately predict EDSS scores from patient records. This approach can be automated for efficient information extraction in clinical and research settings.

Download Full-text

Designing an openEHR-Based Pipeline for Extracting and Standardizing Unstructured Clinical Data Using Natural Language Processing

Methods of Information in Medicine ◽

10.1055/s-0040-1716403 ◽

2020 ◽

Vol 59 (S 02) ◽

pp. e64-e78

Author(s):

Antje Wulff ◽

Marcel Mast ◽

Marcus Hassler ◽

Sara Montag ◽

Michael Marschollek ◽

...

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Clinical Data ◽

Expert Knowledge ◽

Clinical Information ◽

Structured Data ◽

Unstructured Data ◽

Mapping Rules ◽

Medical Histories

Abstract Background Merging disparate and heterogeneous datasets from clinical routine in a standardized and semantically enriched format to enable a multiple use of data also means incorporating unstructured data such as medical free texts. Although the extraction of structured data from texts, known as natural language processing (NLP), has been researched at least for the English language extensively, it is not enough to get a structured output in any format. NLP techniques need to be used together with clinical information standards such as openEHR to be able to reuse and exchange still unstructured data sensibly. Objectives The aim of the study is to automatically extract crucial information from medical free texts and to transform this unstructured clinical data into a standardized and structured representation by designing and implementing an exemplary pipeline for the processing of pediatric medical histories. Methods We constructed a pipeline that allows reusing medical free texts such as pediatric medical histories in a structured and standardized way by (1) selecting and modeling appropriate openEHR archetypes as standard clinical information models, (2) defining a German dictionary with crucial text markers serving as expert knowledge base for a NLP pipeline, and (3) creating mapping rules between the NLP output and the archetypes. The approach was evaluated in a first pilot study by using 50 manually annotated medical histories from the pediatric intensive care unit of the Hannover Medical School. Results We successfully reused 24 existing international archetypes to represent the most crucial elements of unstructured pediatric medical histories in a standardized form. The self-developed NLP pipeline was constructed by defining 3.055 text marker entries, 132 text events, 66 regular expressions, and a text corpus consisting of 776 entries for automatic correction of spelling mistakes. A total of 123 mapping rules were implemented to transform the extracted snippets to an openEHR-based representation to be able to store them together with other structured data in an existing openEHR-based data repository. In the first evaluation, the NLP pipeline yielded 97% precision and 94% recall. Conclusion The use of NLP and openEHR archetypes was demonstrated as a viable approach for extracting and representing important information from pediatric medical histories in a structured and semantically enriched format. We designed a promising approach with potential to be generalized, and implemented a prototype that is extensible and reusable for other use cases concerning German medical free texts. In a long term, this will harness unstructured clinical data for further research purposes such as the design of clinical decision support systems. Together with structured data already integrated in openEHR-based representations, we aim at developing an interoperable openEHR-based application that is capable of automatically assessing a patient's risk status based on the patient's medical history at time of admission.

Download Full-text

A rule-based grapheme-phone converter and stress determination for Brazilian Portuguese natural language processing

2006 International Telecommunications Symposium ◽

10.1109/its.2006.4433335 ◽

2006 ◽

Cited By ~ 5

Author(s):

Denilson C. Silva ◽

Amaro A. de Lima ◽

R. Maia ◽

Daniela Braga ◽

Joao F. de Moraes ◽

...

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Brazilian Portuguese ◽

Stress Determination ◽

Rule Based

Download Full-text

Machine translation using natural language processing

MATEC Web of Conferences ◽

10.1051/matecconf/201927702004 ◽

2019 ◽

Vol 277 ◽

pp. 02004

Author(s):

Middi Venkata Sai Rishita ◽

Middi Appala Raju ◽

Tanvir Ahmed Harris

Keyword(s):

Neural Network ◽

Neural Networks ◽

Natural Language Processing ◽

Natural Language ◽

Machine Translation ◽

Language Processing ◽

Deep Neural Network ◽

English Text ◽

French Translation ◽

Rule Based

Machine Translation is the translation of text or speech by a computer with no human involvement. It is a popular topic in research with different methods being created, like rule-based, statistical and examplebased machine translation. Neural networks have made a leap forward to machine translation. This paper discusses the building of a deep neural network that functions as a part of end-to-end translation pipeline. The completed pipeline would accept English text as input and return the French Translation. The project has three main parts which are preprocessing, creation of models and Running the model on English Text.

Download Full-text

Natural Language Processing and the Representation of Clinical Data

Journal of the American Medical Informatics Association ◽

10.1136/jamia.1994.95236145 ◽

1994 ◽

Vol 1 (2) ◽

pp. 142-160 ◽

Cited By ~ 87

Author(s):

N. Sager ◽

M. Lyman ◽

C. Bucknall ◽

N. Nhan ◽

L. J. Tick

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Clinical Data

Download Full-text