Combining Contextualized Embeddings and Prior Knowledge for Clinical Named Entity Recognition: Evaluation Study

Min Jiang; Todd Sanger; Xiong Liu

doi:10.2196/14850

Combining Contextualized Embeddings and Prior Knowledge for Clinical Named Entity Recognition: Evaluation Study

JMIR Medical Informatics ◽

10.2196/14850 ◽

2019 ◽

Vol 7 (4) ◽

pp. e14850 ◽

Cited By ~ 4

Author(s):

Min Jiang ◽

Todd Sanger ◽

Xiong Liu

Keyword(s):

Deep Learning ◽

Prior Knowledge ◽

Language Processing ◽

Named Entity Recognition ◽

Word Embedding ◽

Training Data ◽

Entity Recognition ◽

Named Entities ◽

Clinical Text ◽

Named Entity

Background Named entity recognition (NER) is a key step in clinical natural language processing (NLP). Traditionally, rule-based systems leverage prior knowledge to define rules to identify named entities. Recently, deep learning–based NER systems have become more and more popular. Contextualized word embedding, as a new type of representation of the word, has been proposed to dynamically capture word sense using context information and has proven successful in many deep learning–based systems in either general domain or medical domain. However, there are very few studies that investigate the effects of combining multiple contextualized embeddings and prior knowledge on the clinical NER task. Objective This study aims to improve the performance of NER in clinical text by combining multiple contextual embeddings and prior knowledge. Methods In this study, we investigate the effects of combining multiple contextualized word embeddings with classic word embedding in deep neural networks to predict named entities in clinical text. We also investigate whether using a semantic lexicon could further improve the performance of the clinical NER system. Results By combining contextualized embeddings such as ELMo and Flair, our system achieves the F-1 score of 87.30% when only training based on a portion of the 2010 Informatics for Integrating Biology and the Bedside NER task dataset. After incorporating the medical lexicon into the word embedding, the F-1 score was further increased to 87.44%. Another finding was that our system still could achieve an F-1 score of 85.36% when the size of the training data was reduced to 40%. Conclusions Combined contextualized embedding could be beneficial for the clinical NER task. Moreover, the semantic lexicon could be used to further improve the performance of the clinical NER system.

Download Full-text

Combining Contextualized Embeddings and Prior Knowledge for Clinical Named Entity Recognition: Evaluation Study (Preprint)

10.2196/preprints.14850 ◽

2019 ◽

Author(s):

Min Jiang ◽

Todd Sanger ◽

Xiong Liu

Keyword(s):

Deep Learning ◽

Prior Knowledge ◽

Named Entity Recognition ◽

Word Embedding ◽

Training Data ◽

Entity Recognition ◽

Named Entities ◽

Clinical Text ◽

Named Entity ◽

System P

BACKGROUND Named entity recognition (NER) is a key step in clinical natural language processing (NLP). Traditionally, rule-based systems leverage prior knowledge to define rules to identify named entities. Recently, deep learning–based NER systems have become more and more popular. Contextualized word embedding, as a new type of representation of the word, has been proposed to dynamically capture word sense using context information and has proven successful in many deep learning–based systems in either general domain or medical domain. However, there are very few studies that investigate the effects of combining multiple contextualized embeddings and prior knowledge on the clinical NER task. OBJECTIVE This study aims to improve the performance of NER in clinical text by combining multiple contextual embeddings and prior knowledge. METHODS In this study, we investigate the effects of combining multiple contextualized word embeddings with classic word embedding in deep neural networks to predict named entities in clinical text. We also investigate whether using a semantic lexicon could further improve the performance of the clinical NER system. RESULTS By combining contextualized embeddings such as ELMo and Flair, our system achieves the F-1 score of 87.30% when only training based on a portion of the 2010 Informatics for Integrating Biology and the Bedside NER task dataset. After incorporating the medical lexicon into the word embedding, the F-1 score was further increased to 87.44%. Another finding was that our system still could achieve an F-1 score of 85.36% when the size of the training data was reduced to 40%. CONCLUSIONS Combined contextualized embedding could be beneficial for the clinical NER task. Moreover, the semantic lexicon could be used to further improve the performance of the clinical NER system.

Download Full-text

DeIDNER Corpus: Annotation of Clinical Discharge Summary Notes for Named Entity Recognition Using BRAT Tool

Studies in Health Technology and Informatics - Public Health and Informatics ◽

10.3233/shti210195 ◽

2021 ◽

Author(s):

Mahanazuddin Syed ◽

Shaymaa Al-Shukri ◽

Shorabuddin Syed ◽

Kevin Sexton ◽

Melody L. Greer ◽

...

Keyword(s):

Language Processing ◽

Named Entity Recognition ◽

Discharge Summary ◽

Training Data ◽

Entity Recognition ◽

Named Entities ◽

Named Entity ◽

Domain Specific ◽

Model Training ◽

Clinical Domain

Named Entity Recognition (NER) aims to identify and classify entities into predefined categories is a critical pre-processing task in Natural Language Processing (NLP) pipeline. Readily available off-the-shelf NER algorithms or programs are trained on a general corpus and often need to be retrained when applied on a different domain. The end model’s performance depends on the quality of named entities generated by these NER models used in the NLP task. To improve NER model accuracy, researchers build domain-specific corpora for both model training and evaluation. However, in the clinical domain, there is a dearth of training data because of privacy reasons, forcing many studies to use NER models that are trained in the non-clinical domain to generate NER feature-set. Thus, influencing the performance of the downstream NLP tasks like information extraction and de-identification. In this paper, our objective is to create a high quality annotated clinical corpus for training NER models that can be easily generalizable and can be used in a downstream de-identification task to generate named entities feature-set.

Download Full-text

Clinical Named Entity Recognition based on Deep Learning with Pre-trained Word Embedding for Colonoscopy Reports: Model Development and Performance Evaluation (Preprint)

10.2196/preprints.27256 ◽

2021 ◽

Author(s):

Donghyeong Seong ◽

Yoonho Choi ◽

Sungwon Jung ◽

Sungchul Bae ◽

Soo-Yong Shin ◽

...

Keyword(s):

Colorectal Cancer ◽

Deep Learning ◽

Language Processing ◽

Short Term Memory ◽

Medical Center ◽

Named Entity Recognition ◽

Word Embedding ◽

Entity Recognition ◽

Screening Tests ◽

Named Entity

BACKGROUND Colorectal cancer is a leading cause of cancer deaths. Several screening tests such as colonoscopy can be used to find polyps or colorectal cancer. Colonoscopy reports are often written in unstructured narrative text. The information embedded in the reports can be used for various purposes, including colorectal cancer risk prediction, follow-up recommendation, and quality measurement. However, the availability and accessibility of the unstructured text data are still very low despite the large amounts of accumulated data. OBJECTIVE We aimed to develop a deep learning-based natural language processing (NLP) method for named entity recognition (NER) in colonoscopy reports. To the best of our knowledge, no previous studies on clinical NLP for colonoscopy reports have applied deep learning techniques. METHODS This study proposed a method to apply pre-trained word embedding to a deep learning-based NER model using large unlabeled colonoscopy reports. Approximately 280,668 colonoscopy reports were extracted from the clinical data warehouse of the Samsung Medical Center. For 5,000 reports, procedural information and colonoscopic findings were manually annotated with 17 labels. We compared variants of the long short-term memory (LSTM) model to select the one with the best performance for colonoscopy reports, which was the bidirectional LSTM with conditional random fields. Then, we applied pre-trained word embedding using a large unlabeled data (280,668 reports) to the selected model. RESULTS The NER model with pre-trained word embedding performed better for most labels than the model with one-hot encoding. The F1 score for colonoscopic findings were: 0.9564 for lesions, 0.9722 for locations, 0.9809 for shapes, 0.9720 for colors, 0.9862 for sizes, and 0.9717 for numbers. CONCLUSIONS In this study, clinical NER was applied to extract meaningful information from colonoscopy reports. We proposed a deep learning-based NER model with pre-trained word embedding. The proposed method in this study achieved promising results that demonstrate it can be applied to various practical purposes.

Download Full-text

An Annotated Corpus of Crime-Related Portuguese Documents for NLP and Machine Learning Processing

Data ◽

10.3390/data6070071 ◽

2021 ◽

Vol 6 (7) ◽

pp. 71

Author(s):

Gonçalo Carnaz ◽

Mário Antunes ◽

Vitor Beires Nogueira

Keyword(s):

Machine Learning ◽

Language Processing ◽

Named Entity Recognition ◽

Entity Recognition ◽

Automatic Identification ◽

Named Entities ◽

Related Data ◽

Named Entity ◽

Chain Of Custody ◽

Evidence Collection

Criminal investigations collect and analyze the facts related to a crime, from which the investigators can deduce evidence to be used in court. It is a multidisciplinary and applied science, which includes interviews, interrogations, evidence collection, preservation of the chain of custody, and other methods and techniques of investigation. These techniques produce both digital and paper documents that have to be carefully analyzed to identify correlations and interactions among suspects, places, license plates, and other entities that are mentioned in the investigation. The computerized processing of these documents is a helping hand to the criminal investigation, as it allows the automatic identification of entities and their relations, being some of which difficult to identify manually. There exists a wide set of dedicated tools, but they have a major limitation: they are unable to process criminal reports in the Portuguese language, as an annotated corpus for that purpose does not exist. This paper presents an annotated corpus, composed of a collection of anonymized crime-related documents, which were extracted from official and open sources. The dataset was produced as the result of an exploratory initiative to collect crime-related data from websites and conditioned-access police reports. The dataset was evaluated and a mean precision of 0.808, recall of 0.722, and F1-score of 0.733 were obtained with the classification of the annotated named-entities present in the crime-related documents. This corpus can be employed to benchmark Machine Learning (ML) and Natural Language Processing (NLP) methods and tools to detect and correlate entities in the documents. Some examples are sentence detection, named-entity recognition, and identification of terms related to the criminal domain.

Download Full-text

A Survey of Arabic Named Entity Recognition and Classification

Computational Linguistics ◽

10.1162/coli_a_00178 ◽

2014 ◽

Vol 40 (2) ◽

pp. 469-510 ◽

Cited By ~ 62

Author(s):

Khaled Shaalan

Keyword(s):

Language Processing ◽

Named Entity Recognition ◽

Relevant Information ◽

Arabic Language ◽

Entity Recognition ◽

Named Entities ◽

Linguistic Resources ◽

Named Entity ◽

To Receive ◽

Made In

As more and more Arabic textual information becomes available through the Web in homes and businesses, via Internet and Intranet services, there is an urgent need for technologies and tools to process the relevant information. Named Entity Recognition (NER) is an Information Extraction task that has become an integral part of many other Natural Language Processing (NLP) tasks, such as Machine Translation and Information Retrieval. Arabic NER has begun to receive attention in recent years. The characteristics and peculiarities of Arabic, a member of the Semitic languages family, make dealing with NER a challenge. The performance of an Arabic NER component affects the overall performance of the NLP system in a positive manner. This article attempts to describe and detail the recent increase in interest and progress made in Arabic NER research. The importance of the NER task is demonstrated, the main characteristics of the Arabic language are highlighted, and the aspects of standardization in annotating named entities are illustrated. Moreover, the different Arabic linguistic resources are presented and the approaches used in Arabic NER field are explained. The features of common tools used in Arabic NER are described, and standard evaluation metrics are illustrated. In addition, a review of the state of the art of Arabic NER research is discussed. Finally, we present our conclusions. Throughout the presentation, illustrative examples are used for clarification.

Download Full-text

Evaluating named entity recognition tools for extracting social networks from novels

PeerJ Computer Science ◽

10.7717/peerj-cs.189 ◽

2019 ◽

Vol 5 ◽

pp. e189 ◽

Cited By ~ 2

Author(s):

Niels Dekker ◽

Tobias Kuhn ◽

Marieke van Erp

Keyword(s):

Social Networks ◽

Social Interactions ◽

Language Processing ◽

Named Entity Recognition ◽

Entity Recognition ◽

Computer Assisted ◽

Early 20Th Century ◽

Automatic Extraction ◽

Named Entities ◽

Named Entity

The analysis of literary works has experienced a surge in computer-assisted processing. To obtain insights into the community structures and social interactions portrayed in novels, the creation of social networks from novels has gained popularity. Many methods rely on identifying named entities and relations for the construction of these networks, but many of these tools are not specifically created for the literary domain. Furthermore, many of the studies on information extraction from literature typically focus on 19th and early 20th century source material. Because of this, it is unclear if these techniques are as suitable to modern-day literature as they are to those older novels. We present a study in which we evaluate natural language processing tools for the automatic extraction of social networks from novels as well as their network structure. We find that there are no significant differences between old and modern novels but that both are subject to a large amount of variance. Furthermore, we identify several issues that complicate named entity recognition in our set of novels and we present methods to remedy these. We see this work as a step in creating more culturally-aware AI systems.

Download Full-text

Statistical Method for Named Entity Recognition in Telugu, an Indian Language

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.b3500.078219 ◽

2019 ◽

Vol 8 (2) ◽

pp. 4211-4216

Keyword(s):

Language Processing ◽

Conditional Random Fields ◽

Named Entity Recognition ◽

Entity Recognition ◽

Semantic Features ◽

Indian Language ◽

Named Entities ◽

Maximum Entropy Models ◽

Named Entity ◽

Proper Nouns

One of the important tasks of Natural Language Processing (NLP) is Named Entity Recognition (NER). The primary operation of NER is to identify proper nouns i.e. to locate all the named entities in the text and tag them as certain named entity categories such as Entity, Time expression and Numeric expression. In the previous works, NER for Telugu language is addressed with Conditional Random Fields (CRF) and Maximum Entropy models however they failed to handle ambiguous named entity tags for the same named entity. This paper presents a hybrid statistical system for Named Entity Recognition in Telugu language in which named entities are identified by both dictionary-based approach and statistical Hidden Markov Model (HMM). The proposed method uses Lexicon-lookup dictionary and contexts based on semantic features for predicting named entity tags. Further HMM is used to resolve the named entity ambiguities in predicted named entity tags. The present work reports an average accuracy of 86.3% for finding the named entities

Download Full-text

Evaluating social network extraction for classic and modern fiction literature

10.7287/peerj.preprints.27263 ◽

2018 ◽

Author(s):

Niels Dekker ◽

Tobias Kuhn ◽

Marieke van Erp

Keyword(s):

Social Networks ◽

Science Fiction ◽

19Th Century ◽

Language Processing ◽

Named Entity Recognition ◽

Entity Recognition ◽

Computer Assisted ◽

Named Entities ◽

Named Entity ◽

Modern Fiction

The analysis of literary works has experienced a surge in computer-assisted processing. To obtain insights into the community structures and social interactions portrayed in novels the creation of social networks from novels has gained popularity. Many methods rely on identifying named entities and relations for the construction of these networks, but many of these tools are not specifically created for the literary domain. Furthermore, many of the studies on information extraction from literature typically focus on 19th century source material. Because of this, it is unclear if these techniques are as suitable to modern-day science fiction and fantasy literature as they are to those 19th century classics. We present a study to compare classic literature to modern literature in terms of performance of natural language processing tools for the automatic extraction of social networks as well as their network structure. We find that there are no significant differences between the two sets of novels but that both are subject to a high amount of variance. Furthermore, we identify several issues that complicate named entity recognition in modern novels and we present methods to remedy these.

Download Full-text

Techniques for Named Entity Recognition on Arabic-English Code-Mixed Data

International Journal of Robotic Computing ◽

10.35708/tai1868-126245 ◽

2019 ◽

pp. 44-63

Author(s):

Caroline Sabty ◽

Ahmed Sherif ◽

Mohamed Elmahdy ◽

Slim Abdennadher

Keyword(s):

Language Processing ◽

Named Entity Recognition ◽

Arab Countries ◽

Entity Recognition ◽

Mixed Data ◽

Word Embeddings ◽

Named Entities ◽

Named Entity ◽

Code Mixing ◽

Social Media Platforms

As a result of globalization and better quality of education, a signifcant percentage of the population in Arab countries have become bilingual/multilingual. This has raised the frequency of code-switching and code-mixing among Arabs in daily communication. Consequently, huge amount of Code-Mixed (CM) content can be found on different social media platforms. Such data could be analyzed and used in different Natural Language Processing (NLP) tasks to tackle the challenges emerging due to this multilingual phenomenon. Named-Entity Recognition (NER) is one of the major tasks for several NLP systems. It is the process of identifying named entities in text. However, there is a lack of annotated CM data and resources for such task. This work aims at collecting and building the first annotated CM Arabic-English corpus for NER. Furthermore, we constructed a baseline NER system using deep neural networks and word embeddings for Arabic-English CM text. Moreover, we investigated the usage of different types of classical and contextual pre-trained word embeddings on our system. The highest NER system achieved an F1-score of 77.69% by combining classical and contextual word embeddings.

Download Full-text

Bidirectional Recurrent Neural Network Approach for Arabic Named Entity Recognition

Future Internet ◽

10.3390/fi10120123 ◽

2018 ◽

Vol 10 (12) ◽

pp. 123 ◽

Cited By ~ 7

Author(s):

Mohammed Ali ◽

Guanzheng Tan ◽

Aamir Hussain

Keyword(s):

Neural Network ◽

Language Processing ◽

Recurrent Neural Network ◽

Short Term Memory ◽

Named Entity Recognition ◽

Recognition Task ◽

Word Embedding ◽

Entity Recognition ◽

Named Entity ◽

Lstm Network

Recurrent neural network (RNN) has achieved remarkable success in sequence labeling tasks with memory requirement. RNN can remember previous information of a sequence and can thus be used to solve natural language processing (NLP) tasks. Named entity recognition (NER) is a common task of NLP and can be considered a classification problem. We propose a bidirectional long short-term memory (LSTM) model for this entity recognition task of the Arabic text. The LSTM network can process sequences and relate to each part of it, which makes it useful for the NER task. Moreover, we use pre-trained word embedding to train the inputs that are fed into the LSTM network. The proposed model is evaluated on a popular dataset called “ANERcorp.” Experimental results show that the model with word embedding achieves a high F-score measure of approximately 88.01%.

Download Full-text