scholarly journals Extraction of Family History Information From Clinical Notes: Deep Learning and Heuristics Approach

10.2196/22898 ◽  
2020 ◽  
Vol 8 (12) ◽  
pp. e22898
Author(s):  
João Figueira Silva ◽  
João Rafael Almeida ◽  
Sérgio Matos

Background Electronic health records store large amounts of patient clinical data. Despite efforts to structure patient data, clinical notes containing rich patient information remain stored as free text, greatly limiting its exploitation. This includes family history, which is highly relevant for applications such as diagnosis and prognosis. Objective This study aims to develop automatic strategies for annotating family history information in clinical notes, focusing not only on the extraction of relevant entities such as family members and disease mentions but also on the extraction of relations between the identified entities. Methods This study extends a previous contribution for the 2019 track on family history extraction from national natural language processing clinical challenges by improving a previously developed rule-based engine, using deep learning (DL) approaches for the extraction of entities from clinical notes, and combining both approaches in a hybrid end-to-end system capable of successfully extracting family member and observation entities and the relations between those entities. Furthermore, this study analyzes the impact of factors such as the use of external resources and different types of embeddings in the performance of DL models. Results The approaches developed were evaluated in a first task regarding entity extraction and in a second task concerning relation extraction. The proposed DL approach improved observation extraction, obtaining F1 scores of 0.8688 and 0.7907 in the training and test sets, respectively. However, DL approaches have limitations in the extraction of family members. The rule-based engine was adjusted to have higher generalizing capability and achieved family member extraction F1 scores of 0.8823 and 0.8092 in the training and test sets, respectively. The resulting hybrid system obtained F1 scores of 0.8743 and 0.7979 in the training and test sets, respectively. For the second task, the original evaluator was adjusted to perform a more exact evaluation than the original one, and the hybrid system obtained F1 scores of 0.6480 and 0.5082 in the training and test sets, respectively. Conclusions We evaluated the impact of several factors on the performance of DL models, and we present an end-to-end system for extracting family history information from clinical notes, which can help in the structuring and reuse of this type of information. The final hybrid solution is provided in a publicly available code repository.

2020 ◽  
Author(s):  
João Figueira Silva ◽  
João Rafael Almeida ◽  
Sérgio Matos

BACKGROUND Electronic health records store large amounts of patient clinical data. Despite efforts to structure patient data, clinical notes containing rich patient information remain stored as free text, greatly limiting its exploitation. This includes family history, which is highly relevant for applications such as diagnosis and prognosis. OBJECTIVE This study aims to develop automatic strategies for annotating family history information in clinical notes, focusing not only on the extraction of relevant entities such as family members and disease mentions but also on the extraction of relations between the identified entities. METHODS This study extends a previous contribution for the 2019 track on family history extraction from national natural language processing clinical challenges by improving a previously developed rule-based engine, using deep learning (DL) approaches for the extraction of entities from clinical notes, and combining both approaches in a hybrid end-to-end system capable of successfully extracting family member and observation entities and the relations between those entities. Furthermore, this study analyzes the impact of factors such as the use of external resources and different types of embeddings in the performance of DL models. RESULTS The approaches developed were evaluated in a first task regarding entity extraction and in a second task concerning relation extraction. The proposed DL approach improved observation extraction, obtaining F<sub>1</sub> scores of 0.8688 and 0.7907 in the training and test sets, respectively. However, DL approaches have limitations in the extraction of family members. The rule-based engine was adjusted to have higher generalizing capability and achieved family member extraction F<sub>1</sub> scores of 0.8823 and 0.8092 in the training and test sets, respectively. The resulting hybrid system obtained F<sub>1</sub> scores of 0.8743 and 0.7979 in the training and test sets, respectively. For the second task, the original evaluator was adjusted to perform a more exact evaluation than the original one, and the hybrid system obtained F<sub>1</sub> scores of 0.6480 and 0.5082 in the training and test sets, respectively. CONCLUSIONS We evaluated the impact of several factors on the performance of DL models, and we present an end-to-end system for extracting family history information from clinical notes, which can help in the structuring and reuse of this type of information. The final hybrid solution is provided in a publicly available code repository.


2020 ◽  
Author(s):  
Youngjun Kim ◽  
Paul M Heider ◽  
Isabel R H Lally ◽  
Stéphane M Meystre

BACKGROUND Family history information is important to assess the risk of inherited medical conditions. Natural language processing has the potential to extract this information from unstructured free-text notes to improve patient care and decision-making. We describe the end-to-end information extraction system the Medical University of South Carolina team developed when participating in the 2019 n2c2/OHNLP shared task. OBJECTIVE This task involves identifying mentions of family members and observations in electronic health record text notes, and recognizing the relations between family members, observations, and living status. Our system aims to achieve a high level of performance by integrating heuristics and advanced information extraction methods. Our efforts also include improving the performance of two subtasks by exploiting additional labeled data and clinical text-based embedding models. METHODS We present a hybrid method that combines machine learning and rule-based approaches. We implemented an end-to-end system with multiple information extraction and attribute classification components. For entity identification, we trained bidirectional long short-term memory deep learning models. These models incorporated static word embeddings and context-dependent embeddings. We created a voting ensemble that combined the predictions of all individual models. For relation extraction, we trained two relation extraction models. The first model determined the living status of each family member. The second model identified observations associated with each family member. We implemented online gradient descent models to extract related entity pairs. As part of post-challenge efforts, we used the BioCreative/OHNLP 2018 corpus and trained new models with the union of these two data sets. We also pre-trained language models using clinical notes from the MIMIC-III clinical database. RESULTS The voting ensemble achieved better performance than individual classifiers. In the entity identification task, the best performing system reached a precision of 78.90% and a recall of 83.84%. Our NLP system for entity identification and relation extraction ranked 3rd and 4th respectively in the challenge. Our end-to-end pipeline system substantially benefited from the combination of the two data sets. Compared to our official submission, the revised system yielded significantly better performance (p < 0.05) with F1-scores of 86.02% and 72.48% for entity identification and relation extraction, respectively. CONCLUSIONS We demonstrated that a hybrid model could be used to successfully extract family history information recorded in unstructured free-text notes. In this study, our approach of entity identification as a sequence labeling problem produced satisfactory results. Our post-challenge efforts significantly improved performance by leveraging additional labeled data and using word vector representations learned from large collections of clinical notes.


2011 ◽  
Vol 42 (5) ◽  
pp. 296-308
Author(s):  
Ridgely Fisk Green ◽  
Joan Ehrhardt ◽  
Margaret F. Ruttenber ◽  
Richard S. Olney

1991 ◽  
Vol 133 (8) ◽  
pp. 757-765 ◽  
Author(s):  
Pamela H. Phillips ◽  
Martha S. Linet ◽  
Emily L. Harris

2021 ◽  
Vol 297 ◽  
pp. 01072
Author(s):  
Rajae Bensoltane ◽  
Taher Zaki

Aspect category detection (ACD) is a task of aspect-based sentiment analysis (ABSA) that aims to identify the discussed category in a given review or sentence from a predefined list of categories. ABSA tasks were widely studied in English; however, studies in other low-resource languages such as Arabic are still limited. Moreover, most of the existing Arabic ABSA work is based on rule-based or feature-based machine learning models, which require a tedious task of feature-engineering and the use of external resources like lexicons. Therefore, the aim of this paper is to overcome these shortcomings by handling the ACD task using a deep learning method based on a bidirectional gated recurrent unit model. Additionally, we examine the impact of using different vector representation models on the performance of the proposed model. The experimental results show that our model outperforms the baseline and related work models significantly by achieving an enhanced F1-score of more than 7%.


2002 ◽  
Vol 20 (2) ◽  
pp. 528-537 ◽  
Author(s):  
Kevin M. Sweet ◽  
Terry L. Bradley ◽  
Judith A. Westman

PURPOSE: Obtainment of family history and accurate assessment is essential for the identification of families at risk for hereditary cancer. Our study compared the extent to which the family cancer history in the physician medical record reflected that entered by patients directly into a touch-screen family history computer program. PATIENTS AND METHODS: The study cohort consisted of 362 patients seen at a comprehensive cancer center ambulatory clinic over a 1-year period who voluntarily used the computer program and were a mixture of new and return patients. The computer entry was assessed by genetics staff and then compared with the medical record for corroboration of family history information and appropriate physician risk assessment. RESULTS: Family history information from the medical record was available for comparison to the computer entry in 69%. It was most often completed on new patients only and not routinely updated. Of the 362 computer entries, 101 were assigned to a high-risk category. Evidence in the records confirmed 69 high-risk individuals. Documentation of physician risk assessment (ie, notation of significant family cancer history or hereditary risk) was found in only 14 of the high-risk charts. Only seven high-risk individuals (6.9%) had evidence of referral for genetic consultation. CONCLUSION: This study demonstrates the need to collect family history information on all new and established patients in order to perform adequate cancer risk assessment. The lack of identification of patients at highest risk seems to be directly correlated with insufficient data collection, risk assessment, and documentation by medical staff.


Sign in / Sign up

Export Citation Format

Share Document