scholarly journals Developing a RadLex-based Named Entity Recognition Tool for Mining Textual Radiology Reports (Preprint)

Author(s):  
Shintaro Tsuji ◽  
Andrew Wen ◽  
Naoki Takahashi ◽  
Hongjian Zhang ◽  
Katsuhiko Ogasawara ◽  
...  
2020 ◽  
Author(s):  
Shintaro Tsuji ◽  
Andrew Wen ◽  
Naoki Takahashi ◽  
Hongjian Zhang ◽  
Katsuhiko Ogasawara ◽  
...  

BACKGROUND Named entity recognition (NER) plays an important role in extracting the features of descriptions for mining free-text radiology reports. However, the performance of existing NER tools is limited because the number of entities depends on its dictionary lookup. Especially, the recognition of compound terms is very complicated because there are a variety of patterns. OBJECTIVE The objective of the study is to develop and evaluate a NER tool concerned with compound terms using the RadLex for mining free-text radiology reports. METHODS We leveraged the clinical Text Analysis and Knowledge Extraction System (cTAKES) to develop customized pipelines using both RadLex and SentiWordNet (a general-purpose dictionary, GPD). We manually annotated 400 of radiology reports for compound terms (Cts) in noun phrases and used them as the gold standard for the performance evaluation (precision, recall, and F-measure). Additionally, we also created a compound-term-enhanced dictionary (CtED) by analyzing false negatives (FNs) and false positives (FPs), and applied it for another 100 radiology reports for validation. We also evaluated the stem terms of compound terms, through defining two measures: an occurrence ratio (OR) and a matching ratio (MR). RESULTS The F-measure of the cTAKES+RadLex+GPD was 32.2% (Precision 92.1%, Recall 19.6%) and that of combined the CtED was 67.1% (Precision 98.1%, Recall 51.0%). The OR indicated that stem terms of “effusion”, "node", "tube", and "disease" were used frequently, but it still lacks capturing Cts. The MR showed that 71.9% of stem terms matched with that of ontologies and RadLex improved about 22% of the MR from the cTAKES default dictionary. The OR and MR revealed that the characteristics of stem terms would have the potential to help generate synonymous phrases using ontologies. CONCLUSIONS We developed a RadLex-based customized pipeline for parsing radiology reports and demonstrated that CtED and stem term analysis has the potential to improve dictionary-based NER performance toward expanding vocabularies.


2021 ◽  
Vol 12 (1) ◽  
Author(s):  
Irene Pérez-Díez ◽  
Raúl Pérez-Moraga ◽  
Adolfo López-Cerdán ◽  
Jose-Maria Salinas-Serrano ◽  
María de la Iglesia-Vayá

Abstract Background Medical texts such as radiology reports or electronic health records are a powerful source of data for researchers. Anonymization methods must be developed to de-identify documents containing personal information from both patients and medical staff. Although currently there are several anonymization strategies for the English language, they are also language-dependent. Here, we introduce a named entity recognition strategy for Spanish medical texts, translatable to other languages. Results We tested 4 neural networks on our radiology reports dataset, achieving a recall of 97.18% of the identifying entities. Alongside, we developed a randomization algorithm to substitute the detected entities with new ones from the same category, making it virtually impossible to differentiate real data from synthetic data. The three best architectures were tested with the MEDDOCAN challenge dataset of electronic health records as an external test, achieving a recall of 69.18%. Conclusions The strategy proposed, combining named entity recognition tasks with randomization of entities, is suitable for Spanish radiology reports. It does not require a big training corpus, thus it could be easily extended to other languages and medical texts, such as electronic health records.


2020 ◽  
Author(s):  
Liliya Akhtyamova ◽  
Paloma Martínez ◽  
Karin Verspoor ◽  
John Cardiff

Abstract Background: In the Big Data era there is an increasing need to fully exploit and analyse the huge quantity of information available about health. Natural Language Processing (NLP) technologies can contribute to extract relevant information from unstructured data contained in Electronic Health Records (EHR) such as clinical notes, patient’s discharge summaries and radiology reports among others. Extracted information could help in health-related decision making processes. Named entity recognition (NER) devoted to detect important concepts in texts (diseases, symptoms, drugs, etc.) is a crucial task in information extraction processes especially in languages other than English. In this work, we develop a deep learning-based NLP pipeline for biomedical entity extraction in Spanish clinical narrative. Methods: We explore the use of contextualized word embeddings to enhance named entity recognition in Spanish language clinical text, particularly of pharmacological substances, compounds, and proteins. Various combinations of word and sense embeddings were tested on the evaluation corpus of the PharmacoNER 2019 task, the Spanish Clinical Case Corpus (SPACCC). This data set consists of clinical case sections derived from open access Spanish-language medical publications. Results: NER system integrates in-domain pre-trained Flair and FastText word embeddings, byte-pairwise encoded and the bi-LSTM-based character word embeddings. The system yielded the best performance measure with F-score of 90.84%. Error analysis showed that the main source of errors for the best model is the newly detected false positive entities with the half of that amount of errors belonged to longer than the actual ones detected entities. Conclusions: Our study shows that our deep-learning-based system with domain-specific contextualized embeddings coupled with stacking of complementary embeddings yields superior performance over the system with integrated standard and general-domain word embeddings. With this system, we achieve performance competitive with the state-of-the-art.


2019 ◽  
Vol 58 (02/03) ◽  
pp. 094-106 ◽  
Author(s):  
Zhe Xie ◽  
Yuanyuan Yang ◽  
Mingqing Wang ◽  
Ming Li ◽  
Haozhe Huang ◽  
...  

Abstract Background Radiology reports are a permanent record of patient's health information often used in clinical practice and research. Reading radiology reports is common for clinicians and radiologists. However, it is laborious and time-consuming when the amount of reports to be read is large. Assisting clinicians to locate and assimilate the key information of reports is of great significance for improving the efficiency of reading reports. There are few studies on information extraction from Chinese medical texts and its application in radiology information systems (RIS) for efficiency improvement. Objectives The purpose of this study was to explore methods for extracting, grouping, ranking, delivering, and displaying medical-named entities in radiology reports which can yield efficiency improvement in RISs. Methods A total of 5,000 reports were obtained from two medical institutions for this study. We proposed a neural network model called Multi-Embedding-BGRU-CRF (bidirectional gated recurrent unit-conditional random field) for medical-named entity recognition and rule-based methods for entity grouping and ranking. Furthermore, a methodology for delivering and displaying entities in RISs was presented. Results The proposed neural named entity recognition model has achieved a good F1 score of 95.88%. Entity ranking achieved a very high accuracy of 99.23%. The weakness of the system is the entity grouping approach which yield accuracy of 91.03%. The effectiveness of the overall solution was proved by an evaluation task performed by two clinicians based on the setup of actual clinical practice. Conclusions The neural model shows great potential in extracting medical-named entities from radiology reports, especially for languages, that lack lexicons and natural language processing tools. The pipeline of extracting, grouping, ranking, delivering, and displaying medical-named entities could be a feasible solution to enhance RIS functionality by information extraction. The integration of information extraction and RIS has been demonstrated to be effective in improving the efficiency of reading radiology reports.


2020 ◽  
Author(s):  
Irene Pérez-Díez ◽  
Raúl Pérez-Moraga ◽  
Adolfo López-Cerdán ◽  
Jose-Maria Salinas-Serrano ◽  
María de la Iglesia-Vayá

Medical texts such as radiology reports or electronic health records are a powerful source of data for researchers. Anonymization methods must be developed to de-identify documents containing personal information from both patients and medical staff. Although currently there are several anonymization strategies for the English language, they are also language-dependent. Here, we introduce a named entity recognition strategy for Spanish medical texts, translatable to other languages. We tested 4 neural networks on our radiology reports dataset, achieving a recall of 97.18% of the identifying entities. Along-side, we developed a randomization algorithm to substitute the detected entities with new ones from the same category, making it virtually impossible to differentiate real data from synthetic data. The three best architectures were tested with the MEDDOCAN challenge dataset of electronic health records as an external test, achieving a recall of 69.18%. The strategy proposed, combining named entity recognition tasks with randomization of entities, is suitable for Spanish radiology reports. It does not require a big training corpus, thus it can be easily extended to other languages and medical texts, such as electronic health records.


Sign in / Sign up

Export Citation Format

Share Document