Inferring Drug-Protein–Side Effect Relationships from Biomedical Text

Min Song; Seung Han Baek; Go Eun Heo; Jeong-Hoon Lee

doi:10.3390/genes10020159

Inferring Drug-Protein–Side Effect Relationships from Biomedical Text

Genes ◽

10.3390/genes10020159 ◽

2019 ◽

Vol 10 (2) ◽

pp. 159 ◽

Cited By ~ 4

Author(s):

Min Song ◽

Seung Han Baek ◽

Go Eun Heo ◽

Jeong-Hoon Lee

Keyword(s):

Side Effects ◽

Text Mining ◽

Semantic Similarity ◽

Side Effect ◽

Relation Extraction ◽

Ranking Function ◽

Entity Recognition ◽

Free Text ◽

Pubmed Database ◽

Biomedical Texts

Background: Although there are many studies of drugs and their side effects, the underlying mechanisms of these side effects are not well understood. It is also difficult to understand the specific pathways between drugs and side effects. Objective: The present study seeks to construct putative paths between drugs and their side effects by applying text-mining techniques to free text of biomedical studies, and to develop ranking metrics that could identify the most-likely paths. Materials and Methods: We extracted three types of relationships—drug-protein, protein-protein, and protein–side effect—from biomedical texts by using text mining and predefined relation-extraction rules. Based on the extracted relationships, we constructed whole drug-protein–side effect paths. For each path, we calculated its ranking score by a new ranking function that combines corpus- and ontology-based semantic similarity as well as co-occurrence frequency. Results: We extracted 13 plausible biomedical paths connecting drugs and their side effects from cancer-related abstracts in the PubMed database. The top 20 paths were examined, and the proposed ranking function outperformed the other methods tested, including co-occurrence, COALS, and UMLS by P@5-P@20. In addition, we confirmed that the paths are novel hypotheses that are worth investigating further. Discussion: The risk of side effects has been an important issue for the US Food and Drug Administration (FDA). However, the causes and mechanisms of such side effects have not been fully elucidated. This study extends previous research on understanding drug side effects by using various techniques such as Named Entity Recognition (NER), Relation Extraction (RE), and semantic similarity. Conclusion: It is not easy to reveal the biomedical mechanisms of side effects due to a huge number of possible paths. However, we automatically generated predictable paths using the proposed approach, which could provide meaningful information to biomedical researchers to generate plausible hypotheses for the understanding of such mechanisms.

Download Full-text

Mining microbe–disease interactions from literature via a transfer learning model

BMC Bioinformatics ◽

10.1186/s12859-021-04346-7 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Chengkun Wu ◽

Xinyi Xiao ◽

Canqun Yang ◽

JinXiang Chen ◽

Jiacai Yi ◽

...

Keyword(s):

Text Mining ◽

Large Scale ◽

Named Entity Recognition ◽

Learning Model ◽

Biomedical Literature ◽

Fine Tuning ◽

Entity Recognition ◽

Interaction Extraction ◽

Biomedical Texts ◽

Data Browsing

Abstract Background Interactions of microbes and diseases are of great importance for biomedical research. However, large-scale of microbe–disease interactions are hidden in the biomedical literature. The structured databases for microbe–disease interactions are in limited amounts. In this paper, we aim to construct a large-scale database for microbe–disease interactions automatically. We attained this goal via applying text mining methods based on a deep learning model with a moderate curation cost. We also built a user-friendly web interface that allows researchers to navigate and query required information. Results Firstly, we manually constructed a golden-standard corpus and a sliver-standard corpus (SSC) for microbe–disease interactions for curation. Moreover, we proposed a text mining framework for microbe–disease interaction extraction based on a pretrained model BERE. We applied named entity recognition tools to detect microbe and disease mentions from the free biomedical texts. After that, we fine-tuned the pretrained model BERE to recognize relations between targeted entities, which was originally built for drug–target interactions or drug–drug interactions. The introduction of SSC for model fine-tuning greatly improved detection performance for microbe–disease interactions, with an average reduction in error of approximately 10%. The MDIDB website offers data browsing, custom searching for specific diseases or microbes, and batch downloading. Conclusions Evaluation results demonstrate that our method outperform the baseline model (rule-based PKDE4J) with an average $$F_1$$ F 1 -score of 73.81%. For further validation, we randomly sampled nearly 1000 predicted interactions by our model, and manually checked the correctness of each interaction, which gives a 73% accuracy. The MDIDB webiste is freely avaliable throuth http://dbmdi.com/index/

Download Full-text

A Knowledge-Driven Approach to Extract Disease-Related Biomarkers from the Literature

BioMed Research International ◽

10.1155/2014/253128 ◽

2014 ◽

Vol 2014 ◽

pp. 1-11 ◽

Cited By ~ 31

Author(s):

À. Bravo ◽

M. Cases ◽

N. Queralt-Rosinach ◽

F. Sanz ◽

L. I. Furlong

Keyword(s):

Text Mining ◽

Named Entity Recognition ◽

Relation Extraction ◽

Recognition System ◽

Biomedical Literature ◽

Entity Recognition ◽

Scientific Publications ◽

Positive Ratio ◽

Related Information ◽

Mesh Terms

The biomedical literature represents a rich source of biomarker information. However, both the size of literature databases and their lack of standardization hamper the automatic exploitation of the information contained in these resources. Text mining approaches have proven to be useful for the exploitation of information contained in the scientific publications. Here, we show that a knowledge-driven text mining approach can exploit a large literature database to extract a dataset of biomarkers related to diseases covering all therapeutic areas. Our methodology takes advantage of the annotation of MEDLINE publications pertaining to biomarkers with MeSH terms, narrowing the search to specific publications and, therefore, minimizing the false positive ratio. It is based on a dictionary-based named entity recognition system and a relation extraction module. The application of this methodology resulted in the identification of 131,012 disease-biomarker associations between 2,803 genes and 2,751 diseases, and represents a valuable knowledge base for those interested in disease-related biomarkers. Additionally, we present a bibliometric analysis of the journals reporting biomarker related information during the last 40 years.

Download Full-text

Data Processing and Text Mining Technologies on Electronic Medical Records: A Review

Journal of Healthcare Engineering ◽

10.1155/2018/4302425 ◽

2018 ◽

Vol 2018 ◽

pp. 1-9 ◽

Cited By ~ 35

Author(s):

Wencheng Sun ◽

Zhiping Cai ◽

Yangyang Li ◽

Fang Liu ◽

Shengqun Fang ◽

...

Keyword(s):

Data Mining ◽

Text Mining ◽

Large Scale ◽

Named Entity Recognition ◽

Relation Extraction ◽

Entity Recognition ◽

Processing Technologies ◽

Research Issues ◽

Depth Study ◽

Integration Data

Currently, medical institutes generally use EMR to record patient’s condition, including diagnostic information, procedures performed, and treatment results. EMR has been recognized as a valuable resource for large-scale analysis. However, EMR has the characteristics of diversity, incompleteness, redundancy, and privacy, which make it difficult to carry out data mining and analysis directly. Therefore, it is necessary to preprocess the source data in order to improve data quality and improve the data mining results. Different types of data require different processing technologies. Most structured data commonly needs classic preprocessing technologies, including data cleansing, data integration, data transformation, and data reduction. For semistructured or unstructured data, such as medical text, containing more health information, it requires more complex and challenging processing methods. The task of information extraction for medical texts mainly includes NER (named-entity recognition) and RE (relation extraction). This paper focuses on the process of EMR processing and emphatically analyzes the key techniques. In addition, we make an in-depth study on the applications developed based on text mining together with the open challenges and research issues for future work.

Download Full-text

Deep learning with language models improves named entity recognition for PharmaCoNER

BMC Bioinformatics ◽

10.1186/s12859-021-04260-y ◽

2021 ◽

Vol 22 (S1) ◽

Author(s):

Cong Sun ◽

Zhihao Yang ◽

Lei Wang ◽

Yin Zhang ◽

Hongfei Lin ◽

...

Keyword(s):

Deep Learning ◽

Language Processing ◽

Domain Knowledge ◽

Named Entity Recognition ◽

Model Performance ◽

Relation Extraction ◽

Entity Recognition ◽

Language Models ◽

Named Entity ◽

Biomedical Texts

Abstract Background The recognition of pharmacological substances, compounds and proteins is essential for biomedical relation extraction, knowledge graph construction, drug discovery, as well as medical question answering. Although considerable efforts have been made to recognize biomedical entities in English texts, to date, only few limited attempts were made to recognize them from biomedical texts in other languages. PharmaCoNER is a named entity recognition challenge to recognize pharmacological entities from Spanish texts. Because there are currently abundant resources in the field of natural language processing, how to leverage these resources to the PharmaCoNER challenge is a meaningful study. Methods Inspired by the success of deep learning with language models, we compare and explore various representative BERT models to promote the development of the PharmaCoNER task. Results The experimental results show that deep learning with language models can effectively improve model performance on the PharmaCoNER dataset. Our method achieves state-of-the-art performance on the PharmaCoNER dataset, with a max F1-score of 92.01%. Conclusion For the BERT models on the PharmaCoNER dataset, biomedical domain knowledge has a greater impact on model performance than the native language (i.e., Spanish). The BERT models can obtain competitive performance by using WordPiece to alleviate the out of vocabulary limitation. The performance on the BERT model can be further improved by constructing a specific vocabulary based on domain knowledge. Moreover, the character case also has a certain impact on model performance.

Download Full-text

Automatic extraction of microorganisms and their habitats from free text using text mining workflows

Journal of Integrative Bioinformatics ◽

10.1515/jib-2011-184 ◽

2011 ◽

Vol 8 (2) ◽

pp. 176-186 ◽

Cited By ~ 2

Author(s):

BalaKrishna Kolluru ◽

Sirintra Nakjang ◽

Robert P. Hirt ◽

Anil Wipat ◽

Sophia Ananiadou

Keyword(s):

Text Mining ◽

Random Field ◽

Conditional Random Field ◽

Relation Extraction ◽

Free Text ◽

Automatic Extraction

Summary In this paper we illustrate the usage of text mining workflows to automatically extract instances of microorganisms and their habitats from free text; these entries can then be curated and added to different databases. To this end, we use a Conditional Random Field (CRF) based classifier, as part of the workflows, to extract the mention of microorganisms, habitats and the inter-relation between organisms and their habitats.Results indicate a good performance for extraction of microorganisms and the relation extraction aspects of the task (with a precision of over 80%), while habitat recognition is only moderate (a precision of about 65%). We also conjecture that pdf-to-text conversion can be quite noisy and this implicitly affects any sentence-based relation extraction algorithms.

Download Full-text

Enriching contextualized language model from knowledge graph for biomedical information extraction

Briefings in Bioinformatics ◽

10.1093/bib/bbaa110 ◽

2020 ◽

Author(s):

Hao Fei ◽

Yafeng Ren ◽

Yue Zhang ◽

Donghong Ji ◽

Xiaohui Liang

Keyword(s):

Information Extraction ◽

Large Scale ◽

Language Model ◽

Relation Extraction ◽

Event Extraction ◽

Entity Recognition ◽

Language Models ◽

Training Procedure ◽

Biomedical Knowledge ◽

Biomedical Texts

Abstract Biomedical information extraction (BioIE) is an important task. The aim is to analyze biomedical texts and extract structured information such as named entities and semantic relations between them. In recent years, pre-trained language models have largely improved the performance of BioIE. However, they neglect to incorporate external structural knowledge, which can provide rich factual information to support the underlying understanding and reasoning for biomedical information extraction. In this paper, we first evaluate current extraction methods, including vanilla neural networks, general language models and pre-trained contextualized language models on biomedical information extraction tasks, including named entity recognition, relation extraction and event extraction. We then propose to enrich a contextualized language model by integrating a large scale of biomedical knowledge graphs (namely, BioKGLM). In order to effectively encode knowledge, we explore a three-stage training procedure and introduce different fusion strategies to facilitate knowledge injection. Experimental results on multiple tasks show that BioKGLM consistently outperforms state-of-the-art extraction models. A further analysis proves that BioKGLM can capture the underlying relations between biomedical knowledge concepts, which are crucial for BioIE.

Download Full-text

KLOSURE: Closing in on open–ended patient questionnaires with text mining

Journal of Biomedical Semantics ◽

10.1186/s13326-019-0215-3 ◽

2019 ◽

Vol 10 (S1) ◽

Cited By ~ 3

Author(s):

Irena Spasić ◽

David Owen ◽

Andrew Smith ◽

Kate Button

Keyword(s):

Feature Extraction ◽

Text Mining ◽

Clinical Decision Making ◽

Named Entity Recognition ◽

Clinical Decision ◽

Entity Recognition ◽

Free Text ◽

Feature Vectors ◽

Patient Questionnaires

Abstract Background Knee injury and Osteoarthritis Outcome Score (KOOS) is an instrument used to quantify patients’ perceptions about their knee condition and associated problems. It is administered as a 42-item closed-ended questionnaire in which patients are asked to self-assess five outcomes: pain, other symptoms, activities of daily living, sport and recreation activities, and quality of life. We developed KLOG as a 10-item open-ended version of the KOOS questionnaire in an attempt to obtain deeper insight into patients’ opinions including their unmet needs. However, the open–ended nature of the questionnaire incurs analytical overhead associated with the interpretation of responses. The goal of this study was to automate such analysis. We implemented KLOSURE as a system for mining free–text responses to the KLOG questionnaire. It consists of two subsystems, one concerned with feature extraction and the other one concerned with classification of feature vectors. Feature extraction is performed by a set of four modules whose main functionalities are linguistic pre-processing, sentiment analysis, named entity recognition and lexicon lookup respectively. Outputs produced by each module are combined into feature vectors. The structure of feature vectors will vary across the KLOG questions. Finally, Weka, a machine learning workbench, was used for classification of feature vectors. Results The precision of the system varied between 62.8 and 95.3%, whereas the recall varied from 58.3 to 87.6% across the 10 questions. The overall performance in terms of F–measure varied between 59.0 and 91.3% with an average of 74.4% and a standard deviation of 8.8. Conclusions We demonstrated the feasibility of mining open-ended patient questionnaires. By automatically mapping free text answers onto a Likert scale, we can effectively measure the progress of rehabilitation over time. In comparison to traditional closed-ended questionnaires, our approach offers much richer information that can be utilised to support clinical decision making. In conclusion, we demonstrated how text mining can be used to combine the benefits of qualitative and quantitative analysis of patient experiences.

Download Full-text

Text Mining and Hub Gene Network Analysis of Endometriosis

BioMed Research International ◽

10.1155/2021/5517145 ◽

2021 ◽

Vol 2021 ◽

pp. 1-10

Author(s):

Yinuo Wang ◽

Songbiao Zhu ◽

Chengcheng Liu ◽

Haiteng Deng ◽

Zhenyu Zhang

Keyword(s):

Text Mining ◽

Interaction Analysis ◽

Named Entity Recognition ◽

Entity Recognition ◽

Hub Genes ◽

Targeted Interventions ◽

Protein Protein Interaction ◽

Named Entity ◽

Pubmed Database ◽

Gene Network Analysis

This study is aimed at systematically characterizing the endometriosis-associated genes based on text mining and at annotating the functions, pathways, and networks of endometriosis-associated hub genes. We extracted endometriosis-associated abstracts published between 1970 and 2020 from the PubMed database. A neural-named entity recognition and multitype normalization tool for biomedical text mining was used to recognize and normalize the genes and proteins embedded in the abstracts. Gene Ontology and Kyoto Encyclopedia of Genes and Genomes pathway enrichment analyses were conducted to annotate the functions and pathways of recognized genes. Protein-protein interaction analysis was conducted on the genes significantly cooccurring with endometriosis to identify the endometriosis-associated hub genes. A total of 433 genes were recognized as endometriosis-associated genes ( P < 0.05 ), and 154 pathways were significantly enriched ( P < 0.05 ). A network of endometriosis-associated genes with 278 gene nodes and 987 interaction links was established. The 15 proteins that interacted with 20 or more other proteins were identified as the hub proteins of the endometriosis-associated protein network. This study provides novel insights into the hub genes that play key roles in the development of endometriosis and have implications for developing targeted interventions for endometriosis.

Download Full-text

Cutaneous ulcers following hydroxyurea therapy

Phlebologie ◽

10.1055/s-0037-1621559 ◽

2004 ◽

Vol 33 (06) ◽

pp. 202-205 ◽

Cited By ~ 2

Author(s):

K. Hartmann ◽

S. Nagel ◽

T. Erichsen ◽

E. Rabe ◽

K. H. Grips ◽

...

Keyword(s):

Side Effects ◽

Side Effect ◽

Antineoplastic Agent ◽

Limited Time ◽

Myeloproliferative Diseases ◽

Dermatological Side Effects ◽

Chronic Myeloproliferative Diseases ◽

Hydroxyurea Therapy

SummaryHydroxyurea (HU) is usually a well tolerated antineoplastic agent and is commonly used in the treatment of chronic myeloproliferative diseases. Dermatological side effects are frequently seen in patients receiving longterm HU therapy. Cutaneous ulcers have been reported occasionally.We report on four patients with cutaneous ulcers whilst on long-term hydroxyurea therapy for myeloproliferative diseases. In all patients we were able to reduce the dose, or stop HU altogether and their ulcers markedly improved. Our observations suggest that cutaneous ulcers should be considered as possible side effect of long-term HU therapy and healing of the ulcers can be achieved not only by cessation of the HU treatment, but also by reducing the dose of hydroxyurea for a limited time.

Download Full-text

Developing a RadLex-based Named Entity Recognition Tool for Mining Textual Radiology Reports (Preprint)

10.2196/preprints.25378 ◽

2020 ◽

Author(s):

Shintaro Tsuji ◽

Andrew Wen ◽

Naoki Takahashi ◽

Hongjian Zhang ◽

Katsuhiko Ogasawara ◽

...

Keyword(s):

Named Entity Recognition ◽

Noun Phrases ◽

General Purpose ◽

Entity Recognition ◽

Free Text ◽

Clinical Text ◽

Named Entity ◽

Radiology Reports ◽

Two Measures ◽

F Measure

BACKGROUND Named entity recognition (NER) plays an important role in extracting the features of descriptions for mining free-text radiology reports. However, the performance of existing NER tools is limited because the number of entities depends on its dictionary lookup. Especially, the recognition of compound terms is very complicated because there are a variety of patterns. OBJECTIVE The objective of the study is to develop and evaluate a NER tool concerned with compound terms using the RadLex for mining free-text radiology reports. METHODS We leveraged the clinical Text Analysis and Knowledge Extraction System (cTAKES) to develop customized pipelines using both RadLex and SentiWordNet (a general-purpose dictionary, GPD). We manually annotated 400 of radiology reports for compound terms (Cts) in noun phrases and used them as the gold standard for the performance evaluation (precision, recall, and F-measure). Additionally, we also created a compound-term-enhanced dictionary (CtED) by analyzing false negatives (FNs) and false positives (FPs), and applied it for another 100 radiology reports for validation. We also evaluated the stem terms of compound terms, through defining two measures: an occurrence ratio (OR) and a matching ratio (MR). RESULTS The F-measure of the cTAKES+RadLex+GPD was 32.2% (Precision 92.1%, Recall 19.6%) and that of combined the CtED was 67.1% (Precision 98.1%, Recall 51.0%). The OR indicated that stem terms of “effusion”, "node", "tube", and "disease" were used frequently, but it still lacks capturing Cts. The MR showed that 71.9% of stem terms matched with that of ontologies and RadLex improved about 22% of the MR from the cTAKES default dictionary. The OR and MR revealed that the characteristics of stem terms would have the potential to help generate synonymous phrases using ontologies. CONCLUSIONS We developed a RadLex-based customized pipeline for parsing radiology reports and demonstrated that CtED and stem term analysis has the potential to improve dictionary-based NER performance toward expanding vocabularies.

Download Full-text