scholarly journals Family History Extraction Using Deep Biaffine Attention (Preprint)

Author(s):  
Kecheng Zhan ◽  
Weihua Peng ◽  
Ying Xiong ◽  
Huhao Fu ◽  
Qingcai Chen ◽  
...  

BACKGROUND Family history (FH) information, including family members, side of family of family members, living status of family members, observations of family members, etc., plays a significant role in disease diagnosis and treatment. Family member information extraction aims to extract FH information from semi-structured/unstructured text in electronic health records (EHRs), which is a challenging task regarding named entity recognition (NER) and relation extraction (RE), where NE refers to family members, living status and observations, and relation refers to relations between family members and living status, and relations between family members and observations. OBJECTIVE This study aims to explore the ways to effectively extract family history information from clinical text. METHODS Inspired by dependency parsing, we design a novel graph-based schema to represent FH information and introduced deep biaffine attention to extract FH information in clinical text. In the deep biaffine attention model, we use CNN-BiLSTM (Convolutional Neural Network-Bidirectional Long Short Term Memory network) and BERT (Bidirectional Encoder Representation from Transformers) to encode input sentences, and deployed biaffine classifier to extract FH information. In addition, we also develop a post-processing module to adjust results. A system based on the proposed method was developed for the 2019 n2c2/OHNLP shared task track on FH information extraction, which includes two subtasks on entity recognition and relation extraction respectively. RESULTS We conduct experiments on the corpus provided by the 2019 n2c2/OHNLP shared task track on FH information extraction. Our system achieved the highest F1-scores of 0.8823 on subtask 1 and 0.7048 on subtask 2, respectively, new benchmark results on the 2019 n2c2/OHNLP corpus. CONCLUSIONS This study designed a novel Schema to represent FH information using graph and applied deep biaffine attention to extract FH information. Experimental results show the effectiveness of deep biaffine attention on FH information extraction.

Author(s):  
Xue Shi ◽  
Dehuan Jiang ◽  
Yuanhang Huang ◽  
Xiaolong Wang ◽  
Qingcai Chen ◽  
...  

Abstract Background Family history (FH) information, including family members, side of family of family members (i.e., maternal or paternal), living status of family members, observations (diseases) of family members, etc., is very important in the decision-making process of disorder diagnosis and treatment. However FH information cannot be used directly by computers as it is always embedded in unstructured text in electronic health records (EHRs). In order to extract FH information form clinical text, there is a need of natural language processing (NLP). In the BioCreative/OHNLP2018 challenge, there is a task regarding FH extraction (i.e., task1), including two subtasks: (1) entity identification, identifying family members and their observations (diseases) mentioned in clinical text; (2) family history extraction, extracting side of family of family members, living status of family members, and observations of family members. For this task, we propose a system based on deep joint learning methods to extract FH information. Our system achieves the highest F1- scores of 0.8901 on subtask1 and 0.6359 on subtask2, respectively.


2019 ◽  
Vol 26 (12) ◽  
pp. 1584-1591 ◽  
Author(s):  
Xue Shi ◽  
Yingping Yi ◽  
Ying Xiong ◽  
Buzhou Tang ◽  
Qingcai Chen ◽  
...  

Abstract Objective Extracting clinical entities and their attributes is a fundamental task of natural language processing (NLP) in the medical domain. This task is typically recognized as 2 sequential subtasks in a pipeline, clinical entity or attribute recognition followed by entity-attribute relation extraction. One problem of pipeline methods is that errors from entity recognition are unavoidably passed to relation extraction. We propose a novel joint deep learning method to recognize clinical entities or attributes and extract entity-attribute relations simultaneously. Materials and Methods The proposed method integrates 2 state-of-the-art methods for named entity recognition and relation extraction, namely bidirectional long short-term memory with conditional random field and bidirectional long short-term memory, into a unified framework. In this method, relation constraints between clinical entities and attributes and weights of the 2 subtasks are also considered simultaneously. We compare the method with other related methods (ie, pipeline methods and other joint deep learning methods) on an existing English corpus from SemEval-2015 and a newly developed Chinese corpus. Results Our proposed method achieves the best F1 of 74.46% on entity recognition and the best F1 of 50.21% on relation extraction on the English corpus, and 89.32% and 88.13% on the Chinese corpora, respectively, which outperform the other methods on both tasks. Conclusions The joint deep learning–based method could improve both entity recognition and relation extraction from clinical text in both English and Chinese, indicating that the approach is promising.


2020 ◽  
Author(s):  
Maciej Rybinski ◽  
Xiang Dai ◽  
Sonit Singh ◽  
Sarvnaz Karimi ◽  
Anthony Nguyen

BACKGROUND The prognosis, diagnosis, and treatment of many genetic disorders and familial diseases significantly improve if the family history (FH) of a patient is known. Such information is often written in the free text of clinical notes. OBJECTIVE The aim of this study is to develop automated methods that enable access to FH data through natural language processing. METHODS We performed information extraction by using transformers to extract disease mentions from notes. We also experimented with rule-based methods for extracting family member (FM) information from text and coreference resolution techniques. We evaluated different transfer learning strategies to improve the annotation of diseases. We provided a thorough error analysis of the contributing factors that affect such information extraction systems. RESULTS Our experiments showed that the combination of domain-adaptive pretraining and intermediate-task pretraining achieved an F1 score of 81.63% for the extraction of diseases and FMs from notes when it was tested on a public shared task data set from the National Natural Language Processing Clinical Challenges (N2C2), providing a statistically significant improvement over the baseline (<i>P</i><.001). In comparison, in the 2019 N2C2/Open Health Natural Language Processing Shared Task, the median F1 score of all 17 participating teams was 76.59%. CONCLUSIONS Our approach, which leverages a state-of-the-art named entity recognition model for disease mention detection coupled with a hybrid method for FM mention detection, achieved an effectiveness that was close to that of the top 3 systems participating in the 2019 N2C2 FH extraction challenge, with only the top system convincingly outperforming our approach in terms of precision.


2020 ◽  
Author(s):  
Feichen Shen ◽  
Sijia Liu ◽  
Sunyang Fu ◽  
Yanshan Wang ◽  
Samuel Henry ◽  
...  

BACKGROUND As a risk factor for many diseases, family history captures both shared genetic variations and living environments among family members. Though there are several systems focusing on family history extraction (FHE) using natural language processing (NLP) techniques, the evaluation protocol of such systems has not been standardized. OBJECTIVE The n2c2/OHNLP 2019 FHE Task aims to encourage the community efforts on a standard evaluation and system development on FHE from synthetic clinical narratives. METHODS We organized the first BioCreative/OHNLP FHE shared task in 2018. We continued the shared task in 2019 in collaboration with n2c2 and OHNLP consortium, and organized the 2019 n2c2/OHNLP FHE track. The shared task composes of two subtasks. Subtask 1 focuses on identifying family member entities and clinical observations (diseases), and Subtask 2 expects the association of the living status, side of the family and clinical observations to family members to be extracted. Subtask 2 is an end-to-end task which is based on the result of Subtask 1. We manually curated the first de-identified clinical narrative from family history sections of clinical notes at Mayo Clinic Rochester, the content of which are highly relevant to patients’ family history. RESULTS 17 teams from all over the world have participated in the n2c2/OHNLP FHE shared task, where 38 runs were submitted for subtask 1 and 21 runs were submitted for subtask 2. For subtask 1, the top three runs were generated by Harbin Institute of Technology, ezDI, Inc, and The Medical University of South Carolina with F1 scores of 0.8745, 0.8225, and 0.8130, respectively. For subtask 2, the top three runs were from Harbin Institute of Technology, ezDI, Inc, and University of Florida with F1 scores of 0.681, 0.6586, and 0.6544, respectively. The workshop was held in conjunction with the AMIA 2019 Fall Symposium. Conclusions: A wide variety of methods were used by different teams in both tasks, such as BERT, CNN, Bi-LSTM, CRF, SVM, and rule-based strategies. System performances show that relation extraction from family history is a more challenging task when compared to entity identification task. CONCLUSIONS We summarize the 2019 n2c2/OHNLP FHE shared task in this overview. In this task, we have developed a corpus using de-identified family history data stored in Mayo Clinic. The corpus we prepared along with the shared task have encouraged participants internationally to develop FHE systems for understanding clinical narratives. We compared the performance of valid systems on two subtasks: entity identification and relation extraction. The optimal F1 score for subtask 1 and subtask 2 is 0.8745 and 0.6810 respectively. We also observed that most of the typical errors made by the submitted systems are related to co-reference resolution. The corpus could be viewed as valuable resources for more researchers to improve systems for family history analysis. CLINICALTRIAL


2021 ◽  
Vol 12 (1) ◽  
Author(s):  
Pål H. Brekke ◽  
Taraka Rama ◽  
Ildikó Pilán ◽  
Øystein Nytrø ◽  
Lilja Øvrelid

Abstract Background The limited availability of clinical texts for Natural Language Processing purposes is hindering the progress of the field. This article investigates the use of synthetic data for the annotation and automated extraction of family history information from Norwegian clinical text. We make use of incrementally developed synthetic clinical text describing patients’ family history relating to cases of cardiac disease and present a general methodology which integrates the synthetically produced clinical statements and annotation guideline development. The resulting synthetic corpus contains 477 sentences and 6030 tokens. In this work we experimentally assess the validity and applicability of the annotated synthetic corpus using machine learning techniques and furthermore evaluate the system trained on synthetic text on a corpus of real clinical text, consisting of de-identified records for patients with genetic heart disease. Results For entity recognition, an SVM trained on synthetic data had class weighted precision, recall and F1-scores of 0.83, 0.81 and 0.82, respectively. For relation extraction precision, recall and F1-scores were 0.74, 0.75 and 0.74. Conclusions A system for extraction of family history information developed on synthetic data generalizes well to real, clinical notes with a small loss of accuracy. The methodology outlined in this paper may be useful in other situations where limited availability of clinical text hinders NLP tasks. Both the annotation guidelines and the annotated synthetic corpus are made freely available and as such constitutes the first publicly available resource of Norwegian clinical text.


2020 ◽  
Author(s):  
Youngjun Kim ◽  
Paul M Heider ◽  
Isabel R H Lally ◽  
Stéphane M Meystre

BACKGROUND Family history information is important to assess the risk of inherited medical conditions. Natural language processing has the potential to extract this information from unstructured free-text notes to improve patient care and decision-making. We describe the end-to-end information extraction system the Medical University of South Carolina team developed when participating in the 2019 n2c2/OHNLP shared task. OBJECTIVE This task involves identifying mentions of family members and observations in electronic health record text notes, and recognizing the relations between family members, observations, and living status. Our system aims to achieve a high level of performance by integrating heuristics and advanced information extraction methods. Our efforts also include improving the performance of two subtasks by exploiting additional labeled data and clinical text-based embedding models. METHODS We present a hybrid method that combines machine learning and rule-based approaches. We implemented an end-to-end system with multiple information extraction and attribute classification components. For entity identification, we trained bidirectional long short-term memory deep learning models. These models incorporated static word embeddings and context-dependent embeddings. We created a voting ensemble that combined the predictions of all individual models. For relation extraction, we trained two relation extraction models. The first model determined the living status of each family member. The second model identified observations associated with each family member. We implemented online gradient descent models to extract related entity pairs. As part of post-challenge efforts, we used the BioCreative/OHNLP 2018 corpus and trained new models with the union of these two data sets. We also pre-trained language models using clinical notes from the MIMIC-III clinical database. RESULTS The voting ensemble achieved better performance than individual classifiers. In the entity identification task, the best performing system reached a precision of 78.90% and a recall of 83.84%. Our NLP system for entity identification and relation extraction ranked 3rd and 4th respectively in the challenge. Our end-to-end pipeline system substantially benefited from the combination of the two data sets. Compared to our official submission, the revised system yielded significantly better performance (p < 0.05) with F1-scores of 86.02% and 72.48% for entity identification and relation extraction, respectively. CONCLUSIONS We demonstrated that a hybrid model could be used to successfully extract family history information recorded in unstructured free-text notes. In this study, our approach of entity identification as a sequence labeling problem produced satisfactory results. Our post-challenge efforts significantly improved performance by leveraging additional labeled data and using word vector representations learned from large collections of clinical notes.


2019 ◽  
Vol 9 (19) ◽  
pp. 3945 ◽  
Author(s):  
Houssem Gasmi ◽  
Jannik Laval ◽  
Abdelaziz Bouras

Extracting cybersecurity entities and the relationships between them from online textual resources such as articles, bulletins, and blogs and converting these resources into more structured and formal representations has important applications in cybersecurity research and is valuable for professional practitioners. Previous works to accomplish this task were mainly based on utilizing feature-based models. Feature-based models are time-consuming and need labor-intensive feature engineering to describe the properties of entities, domain knowledge, entity context, and linguistic characteristics. Therefore, to alleviate the need for feature engineering, we propose the usage of neural network models, specifically the long short-term memory (LSTM) models to accomplish the tasks of Named Entity Recognition (NER) and Relation Extraction (RE). We evaluated the proposed models on two tasks. The first task is performing NER and evaluating the results against the state-of-the-art Conditional Random Fields (CRFs) method. The second task is performing RE using three LSTM models and comparing their results to assess which model is more suitable for the domain of cybersecurity. The proposed models achieved competitive performance with less feature-engineering work. We demonstrate that exploiting neural network models in cybersecurity text mining is effective and practical.


Author(s):  
Jun Xu ◽  
Zhiheng Li ◽  
Qiang Wei ◽  
Yonghui Wu ◽  
Yang Xiang ◽  
...  

Abstract Background To detect attributes of medical concepts in clinical text, a traditional method often consists of two steps: named entity recognition of attributes and then relation classification between medical concepts and attributes. Here we present a novel solution, in which attribute detection of given concepts is converted into a sequence labeling problem, thus attribute entity recognition and relation classification are done simultaneously within one step. Methods A neural architecture combining bidirectional Long Short-Term Memory networks and Conditional Random fields (Bi-LSTMs-CRF) was adopted to detect various medical concept-attribute pairs in an efficient way. We then compared our deep learning-based sequence labeling approach with traditional two-step systems for three different attribute detection tasks: disease-modifier, medication-signature, and lab test-value. Results Our results show that the proposed method achieved higher accuracy than the traditional methods for all three medical concept-attribute detection tasks. Conclusions This study demonstrates the efficacy of our sequence labeling approach using Bi-LSTM-CRFs on the attribute detection task, indicating its potential to speed up practical clinical NLP applications.


Author(s):  
V. A. Ivanin ◽  
◽  
E. L. Artemova ◽  
T. V. Batura ◽  
V. V. Ivanov ◽  
...  

In this paper, we present a shared task on core information extraction problems, named entity recognition and relation extraction. In contrast to popular shared tasks on related problems, we try to move away from strictly academic rigor and rather model a business case. As a source for textual data we choose the corpus of Russian strategic documents, which we annotated according to our own annotation scheme. To speed up the annotation process, we exploit various active learning techniques. In total we ended up with more than two hundred annotated documents. Thus we managed to create a high-quality data set in short time. The shared task consisted of three tracks, devoted to 1) named entity recognition, 2) relation extraction and 3) joint named entity recognition and relation extraction. We provided with the annotated texts as well as a set of unannotated texts, which could of been used in any way to improve solutions. In the paper we overview and compare solutions, submitted by the shared task participants. We release both raw and annotated corpora along with annotation guidelines, evaluation scripts and results at https://github.com/dialogue-evaluation/RuREBus.


2015 ◽  
Vol 8 (2) ◽  
pp. 1-15 ◽  
Author(s):  
Aicha Ghoulam ◽  
Fatiha Barigou ◽  
Ghalem Belalem

Information Extraction (IE) is a natural language processing (NLP) task whose aim is to analyse texts written in natural language to extract structured and useful information such as named entities and semantic relations between them. Information extraction is an important task in a diverse set of applications like bio-medical literature mining, customer care, community websites, personal information management and so on. In this paper, the authors focus only on information extraction from clinical reports. The two most fundamental tasks in information extraction are discussed; namely, named entity recognition task and relation extraction task. The authors give details about the most used rule/pattern-based and machine learning techniques for each task. They also make comparisons between these techniques and summarize the advantages and disadvantages of each one.


Sign in / Sign up

Export Citation Format

Share Document