Automated literature mining and hypothesis generation through a network of Medical Subject Headings

Mapping Intimacies ◽

10.1101/403667 ◽

2018 ◽

Author(s):

Stephen Joseph Wilson ◽

Angela Dawn Wilkins ◽

Matthew V. Holt ◽

Byung Kwon Choi ◽

Daniel Konecki ◽

...

Keyword(s):

Language Processing ◽

Scientific Literature ◽

Growth Factor Receptor ◽

Direct Interaction ◽

Hypothesis Generation ◽

Supervised Machine Learning ◽

Literature Mining ◽

Cancer Genes ◽

Medical Subject Headings ◽

Mesh Terms

ABSTRACTThe scientific literature is vast, growing, and increasingly specialized, making it difficult to connect disparate observations across subfields. To address this problem, we sought to develop automated hypothesis generation by networking at scale the MeSH terms curated by the National Library of Medicine. The result is a Mesh Term Objective Reasoning (MeTeOR) approach that tallies associations among genes, drugs and diseases from PubMed and predicts new ones.Comparisons to reference databases and algorithms show MeTeOR tends to be more reliable. We also show that many predictions based on the literature prior to 2014 were published subsequently. In a practical application, we validated experimentally a surprising new association found by MeTeOR between novel Epidermal Growth Factor Receptor (EGFR) associations and CDK2. We conclude that MeTeOR generates useful hypotheses from the literature (http://meteor.lichtargelab.org/).AUTHOR SUMMARYThe large size and exponential expansion of the scientific literature forms a bottleneck to accessing and understanding published findings. Manual curation and Natural Language Processing (NLP) aim to address this bottleneck by summarizing and disseminating the knowledge within articles as key relationships (e.g. TP53 relates to Cancer). However, these methods compromise on either coverage or accuracy, respectively. To mitigate this compromise, we proposed using manually-assigned keywords (MeSH terms) to extract relationships from the publications and demonstrated a comparable coverage but higher accuracy than current NLP methods. Furthermore, we combined the extracted knowledge with semi-supervised machine learning to create hypotheses to guide future work and discovered a direct interaction between two important cancer genes.

Download Full-text

Modeling the co-citation dependence on semantic layers of co-cited documents

Online Information Review ◽

10.1108/oir-04-2020-0126 ◽

2021 ◽

Vol ahead-of-print (ahead-of-print) ◽

Author(s):

Maryam Yaghtin ◽

Hajar Sotudeh ◽

Alireza Nikseresht ◽

Mahdieh Mirzabeigi

Keyword(s):

Peer Review ◽

Language Processing ◽

Opinion Mining ◽

Citation Network ◽

Test Collection ◽

Medical Subject Headings ◽

Content Type ◽

Citation Frequency ◽

Mesh Terms ◽

Citation Measures

PurposeCo-citation frequency, defined as the number of documents co-citing two articles, is considered as a quantitative, and thus, an efficient proxy of subject relatedness or prestige of the co-cited articles. Despite its quantitative nature, it is found effective in retrieving and evaluating documents, signifying its linkage with the related documents' contents. To better understand the dynamism of the citation network, the present study aims to investigate various content features giving rise to the measure.Design/methodology/approachThe present study examined the interaction of different co-citation features in explaining the co-citation frequency. The features include the co-cited works' similarities in their full-texts, Medical Subject Headings (MeSH) terms, co-citation proximity, opinions and co-citances. A test collection is built using the CITREC dataset. The data were analyzed using natural language processing (NLP) and opinion mining techniques. A linear model was developed to regress the objective and subjective content-based co-citation measures against the natural log of the co-citation frequency.FindingsThe dimensions of co-citation similarity, either subjective or objective, play significant roles in predicting co-citation frequency. The model can predict about half of the co-citation variance. The interaction of co-opinionatedness and non-co-opinionatedness is the strongest factor in the model.Originality/valueIt is the first study in revealing that both the objective and subjective similarities could significantly predict the co-citation frequency. The findings re-confirm the citation analysis assumption claiming the connection between the cognitive layers of cited documents and citation measures in general and the co-citation frequency in particular.Peer reviewThe peer review history for this article is available at https://publons.com/publon/10.1108/OIR-04-2020-0126.

Download Full-text

Toward Preparing a Knowledge Base to Explore Potential Drugs and Biomedical Entities Related to COVID-19: Automated Computational Approach (Preprint)

10.2196/preprints.21648 ◽

2020 ◽

Author(s):

Junaed Younus Khan ◽

Md Tawkat Islam Khondaker ◽

Iram Tazim Hoque ◽

Hamada R H Al-Absi ◽

Mohammad Saifur Rahman ◽

...

Keyword(s):

Language Processing ◽

Scientific Literature ◽

Treatment Plan ◽

Scientific Evidence ◽

Clinical Status ◽

Confidence Score ◽

Computational Approach ◽

Computational Method ◽

Literature Mining ◽

Novel Approach

BACKGROUND Novel coronavirus disease 2019 (COVID-19) is taking a huge toll on public health. Along with the non-therapeutic preventive measurements, scientific efforts are currently focused, mainly, on the development of vaccines and pharmacological treatment with existing drugs. Summarizing evidences from scientific literatures on the discovery of treatment plan of COVID-19 under a platform would help the scientific community to explore the opportunities in a systematic fashion. OBJECTIVE The aim of this study is to explore the potential drugs and biomedical entities related to coronavirus related diseases, including COVID-19, that are mentioned on scientific literature through an automated computational approach. METHODS We mined the information from publicly available scientific literature and related public resources. Six topic-specific dictionaries, including human genes, human miRNAs, diseases, Protein Databank, drugs, and drug side effects, were integrated to mine all scientific evidence related to COVID-19. We employed an automated literature mining and labeling system through a novel approach to measure the effectiveness of drugs against diseases based on natural language processing, sentiment analysis, and deep learning. We also applied the concept of cosine similarity to confidently infer the associations between diseases and genes. RESULTS Based on the literature mining, we identified 1805 diseases, 2454 drugs, 1910 genes that are related to coronavirus related diseases including COVID-19. Integrating the extracted information, we developed the first knowledgebase platform dedicated to COVID-19, which highlights potential list of drugs and related biomedical entities. For COVID-19, we highlighted multiple case studies on existing drugs along with a confidence score for their applicability in the treatment plan. Based on our computational method, we found Remdesivir, Statins, Dexamethasone, and Ivermectin could be considered as potential effective drugs to improve clinical status and lower mortality in patients hospitalized with COVID-19. We also found that Hydroxychloroquine could not be considered as an effective drug for COVID-19. The resulting knowledgebase is made available as an open source tool, named COVID-19Base. CONCLUSIONS Proper investigation of the mined biomedical entities along with the identified interactions among those would help the research community to discover possible ways for the therapeutic treatment of COVID-19.

Download Full-text

Extracting knowledge networks from plant scientific literature: potato tuber flesh color as an exemplary trait

BMC Plant Biology ◽

10.1186/s12870-021-02943-5 ◽

2021 ◽

Vol 21 (1) ◽

Author(s):

Gurnoor Singh ◽

Evangelia A. Papoutsoglou ◽

Frederique Keijts-Lalleman ◽

Bilyana Vencheva ◽

Mark Rice ◽

...

Keyword(s):

Language Processing ◽

Scientific Literature ◽

Relevant Information ◽

Hypothesis Generation ◽

Knowledge Networks ◽

Free Text ◽

Flesh Color ◽

Biological Phenomena ◽

Structured Information ◽

Biological Entities

Abstract Background Scientific literature carries a wealth of information crucial for research, but only a fraction of it is present as structured information in databases and therefore can be analyzed using traditional data analysis tools. Natural language processing (NLP) is often and successfully employed to support humans by distilling relevant information from large corpora of free text and structuring it in a way that lends itself to further computational analyses. For this pilot, we developed a pipeline that uses NLP on biological literature to produce knowledge networks. We focused on the flesh color of potato, a well-studied trait with known associations, and we investigated whether these knowledge networks can assist us in formulating new hypotheses on the underlying biological processes. Results We trained an NLP model based on a manually annotated corpus of 34 full-text potato articles, to recognize relevant biological entities and relationships between them in text (genes, proteins, metabolites and traits). This model detected the number of biological entities with a precision of 97.65% and a recall of 88.91% on the training set. We conducted a time series analysis on 4023 PubMed abstract of plant genetics-based articles which focus on 4 major Solanaceous crops (tomato, potato, eggplant and capsicum), to determine that the networks contained both previously known and contemporaneously unknown leads to subsequently discovered biological phenomena relating to flesh color. A novel time-based analysis of these networks indicates a connection between our trait and a candidate gene (zeaxanthin epoxidase) already two years prior to explicit statements of that connection in the literature. Conclusions Our time-based analysis indicates that network-assisted hypothesis generation shows promise for knowledge discovery, data integration and hypothesis generation in scientific research.

Download Full-text

Toward Preparing a Knowledge Base to Explore Potential Drugs and Biomedical Entities Related to COVID-19: Automated Computational Approach

JMIR Medical Informatics ◽

10.2196/21648 ◽

2020 ◽

Vol 8 (11) ◽

pp. e21648

Author(s):

Junaed Younus Khan ◽

Md Tawkat Islam Khondaker ◽

Iram Tazim Hoque ◽

Hamada R H Al-Absi ◽

Mohammad Saifur Rahman ◽

...

Keyword(s):

Language Processing ◽

Scientific Literature ◽

Treatment Plan ◽

Scientific Evidence ◽

Clinical Status ◽

Confidence Score ◽

Computational Approach ◽

Computational Method ◽

Literature Mining ◽

Novel Approach

Background Novel coronavirus disease 2019 (COVID-19) is taking a huge toll on public health. Along with the non-therapeutic preventive measurements, scientific efforts are currently focused, mainly, on the development of vaccines and pharmacological treatment with existing drugs. Summarizing evidences from scientific literatures on the discovery of treatment plan of COVID-19 under a platform would help the scientific community to explore the opportunities in a systematic fashion. Objective The aim of this study is to explore the potential drugs and biomedical entities related to coronavirus related diseases, including COVID-19, that are mentioned on scientific literature through an automated computational approach. Methods We mined the information from publicly available scientific literature and related public resources. Six topic-specific dictionaries, including human genes, human miRNAs, diseases, Protein Databank, drugs, and drug side effects, were integrated to mine all scientific evidence related to COVID-19. We employed an automated literature mining and labeling system through a novel approach to measure the effectiveness of drugs against diseases based on natural language processing, sentiment analysis, and deep learning. We also applied the concept of cosine similarity to confidently infer the associations between diseases and genes. Results Based on the literature mining, we identified 1805 diseases, 2454 drugs, 1910 genes that are related to coronavirus related diseases including COVID-19. Integrating the extracted information, we developed the first knowledgebase platform dedicated to COVID-19, which highlights potential list of drugs and related biomedical entities. For COVID-19, we highlighted multiple case studies on existing drugs along with a confidence score for their applicability in the treatment plan. Based on our computational method, we found Remdesivir, Statins, Dexamethasone, and Ivermectin could be considered as potential effective drugs to improve clinical status and lower mortality in patients hospitalized with COVID-19. We also found that Hydroxychloroquine could not be considered as an effective drug for COVID-19. The resulting knowledgebase is made available as an open source tool, named COVID-19Base. Conclusions Proper investigation of the mined biomedical entities along with the identified interactions among those would help the research community to discover possible ways for the therapeutic treatment of COVID-19.

Download Full-text

Trends in biomedical informatics: automated topic analysis of JAMIA articles

Journal of the American Medical Informatics Association ◽

10.1093/jamia/ocv157 ◽

2015 ◽

Vol 22 (6) ◽

pp. 1153-1163 ◽

Cited By ~ 10

Author(s):

Dong Han ◽

Shuang Wang ◽

Chao Jiang ◽

Xiaoqian Jiang ◽

Hyeon-Eui Kim ◽

...

Keyword(s):

Language Processing ◽

Biomedical Informatics ◽

Tensor Decomposition ◽

Medical Subject Headings ◽

Topic Analysis ◽

Interdisciplinary Field ◽

Mesh Terms ◽

Wide Range ◽

Number Of Publications ◽

Number Of Citations

Abstract Biomedical Informatics is a growing interdisciplinary field in which research topics and citation trends have been evolving rapidly in recent years. To analyze these data in a fast, reproducible manner, automation of certain processes is needed. JAMIA is a “generalist” journal for biomedical informatics. Its articles reflect the wide range of topics in informatics. In this study, we retrieved Medical Subject Headings (MeSH) terms and citations of JAMIA articles published between 2009 and 2014. We use tensors (i.e., multidimensional arrays) to represent the interaction among topics, time and citations, and applied tensor decomposition to automate the analysis. The trends represented by tensors were then carefully interpreted and the results were compared with previous findings based on manual topic analysis. A list of most cited JAMIA articles, their topics, and publication trends over recent years is presented. The analyses confirmed previous studies and showed that, from 2012 to 2014, the number of articles related to MeSH terms Methods , Organization & Administration , and Algorithms increased significantly both in number of publications and citations. Citation trends varied widely by topic, with Natural Language Processing having a large number of citations in particular years, and Medical Record Systems, Computerized remaining a very popular topic in all years.

Download Full-text

Extracting Knowledge Networks from Plant Scientific Literature: Potato Tuber Flesh Color as an Exemplary Trait

10.21203/rs.3.rs-74928/v1 ◽

2020 ◽

Author(s):

Gurnoor Singh ◽

Evangelia Papoutsoglou ◽

Frederique Keijts-Lalleman ◽

Bilyana Vencheva ◽

Mark Rice ◽

...

Keyword(s):

Language Processing ◽

Scientific Literature ◽

Relevant Information ◽

Hypothesis Generation ◽

Knowledge Networks ◽

Free Text ◽

Color Analysis ◽

Flesh Color ◽

Biological Phenomena ◽

Biological Entities

Abstract Background: Scientiﬁc literature carries a wealth of information crucial for research, but only a fraction of it is present as structured information in databases and therefore can be analyzed using traditional data analysis tools. Natural language processing (NLP) is often and successfully employed to support humans by distilling relevant information from large corpora of free text and structuring it in a way that lends itself to further computational analyses. For this pilot, we developed a pipeline that uses NLP on biological literature to produce knowledge networks. We focused on the ﬂesh color of potato, a well-studied trait with known associations, and we investigated whether these knowledge networks can assist us in formulating new hypotheses on the underlying biological processes. Results: We trained an NLP model based on a manually annotated corpus of 34 full-text potato articles, to recognize relevant biological entities and relationships between them in text (genes, proteins, metabolites and traits). This model detected the number of biological entities with a precision of 97.65% and a recall of 88.91% on the training set. We conducted a time series analysis on 4023 PubMed abstract of plant genetics-based articles which focus on 4 major Solanaceous crops (tomato, potato, eggplant and capsicum), to determine that the networks contained both previously known and contemporaneously unknown leads to subsequently discovered biological phenomena relating to ﬂesh color. Analysis of these networks indicates a connection between our trait and a candidate gene (zeaxanthin epoxidase) already two years prior to explicit statements of that connection in the literature. Conclusions: Our time-based analysis indicates that network-assisted hypothesis generation shows promise for knowledge discovery, data integration and hypothesis generation in scientiﬁc research.

Download Full-text

Mol2vec: Unsupervised Machine Learning Approach with Chemical Intuition

10.26434/chemrxiv.5513581.v1 ◽

2017 ◽

Author(s):

Sabrina Jaeger ◽

Simone Fulle ◽

Samo Turk

Keyword(s):

Machine Learning ◽

Language Processing ◽

Supervised Machine Learning ◽

Learning Approach ◽

Learning Approaches ◽

Unsupervised Machine Learning ◽

Feature Representations ◽

Machine Learning Approach ◽

The Individual ◽

Vector Representations

Inspired by natural language processing techniques we here introduce Mol2vec which is an unsupervised machine learning approach to learn vector representations of molecular substructures. Similarly, to the Word2vec models where vectors of closely related words are in close proximity in the vector space, Mol2vec learns vector representations of molecular substructures that are pointing in similar directions for chemically related substructures. Compounds can finally be encoded as vectors by summing up vectors of the individual substructures and, for instance, feed into supervised machine learning approaches to predict compound properties. The underlying substructure vector embeddings are obtained by training an unsupervised machine learning approach on a so-called corpus of compounds that consists of all available chemical matter. The resulting Mol2vec model is pre-trained once, yields dense vector representations and overcomes drawbacks of common compound feature representations such as sparseness and bit collisions. The prediction capabilities are demonstrated on several compound property and bioactivity data sets and compared with results obtained for Morgan fingerprints as reference compound representation. Mol2vec can be easily combined with ProtVec, which employs the same Word2vec concept on protein sequences, resulting in a proteochemometric approach that is alignment independent and can be thus also easily used for proteins with low sequence similarities.

Download Full-text

Liraglutide for the Treatment of Obesity: Analyzing Published Reviews

Current Pharmaceutical Design ◽

10.2174/1381612825666190701155737 ◽

2019 ◽

Vol 25 (15) ◽

pp. 1783-1790 ◽

Cited By ~ 3

Author(s):

Rosario Pastor ◽

Josep A. Tur

Keyword(s):

Weight Loss ◽

Clinical Trials ◽

Systematic Reviews ◽

Randomized Clinical Trials ◽

Statistical Tests ◽

Overweight And Obesity ◽

Cochrane Library ◽

Medical Subject Headings ◽

Mesh Terms ◽

Treatment Of Obesity

Background: Several drugs have been currently approved for the treatment of obesity. The pharmacokinetic of liraglutide, as well as the treatment of type 2 diabetes mellitus, have been widely described. Objective: To analyze the published systematic reviews on the use of liraglutide for the treatment of obesity. Methods: Systematic reviews were found out through MEDLINE searches, through EBSCO host and the Cochrane Library based on the following terms: "liraglutide" as major term and using the following Medical Subject Headings (MesH) terms: "obesity", "overweight", "weight loss". A total of 3 systematic reviews were finally included to be analyzed. Results: From the three systematic reviews selected, only two included the randomized clinical trials, while the third study reviewed both randomized and non-randomized clinical trials. Only one review performed statistical tests of heterogeneity and a meta-analysis, combining the results of individual studies. Another review showed the results of individual studies with odds ratio and confidence interval, but a second one just showed the means and confidence intervals. In all studies, weight loss was registered in persons treated with liraglutide in a dose dependent form, reaching a plateau at 3.0 mg dose, which was reached just in men. Most usual adverse events were gastrointestinal. Conclusion: More powerful and prospective studies are needed to assess all aspects related to liraglutide in the overweight and obesity treatment.

Download Full-text

The Role of Leukotrienes Inhibitors in the Management of Chronic Inflammatory Diseases

Recent Patents on Inflammation & Allergy Drug Discovery ◽

10.2174/1872213x14666200130095040 ◽

2020 ◽

Vol 14 (1) ◽

pp. 15-31

Author(s):

Deepak Meshram ◽

Khushbo Bhardwaj ◽

Charulata Rathod ◽

Gail B. Mahady ◽

Kapil K. Soni

Keyword(s):

Respiratory Diseases ◽

Inflammatory Diseases ◽

Leukotriene B4 ◽

Published Data ◽

Free Text ◽

Plant Origin ◽

Medical Subject Headings ◽

Leukotriene Antagonists ◽

Mesh Terms ◽

Mediators Of Inflammation

Background: Leukotrienes are powerful mediators of inflammation and interact with specific receptors in target cell membrane to initiate an inflammatory response. Thus, Leukotrienes (LTs) are considered to be potent mediators of inflammatory diseases including allergic rhinitis, inflammatory bowel disease and asthma. Leukotriene B4 and the series of cysteinyl leukotrienes (C4, D4, and E4) are metabolites of arachidonic acid metabolism that cause inflammation. The cysteinyl LTs are known to increase vascular permeability, bronco-constriction and mucus secretion. Objectives: To review the published data for leukotriene inhibitors of plant origin and the recent patents for leukotriene inhibitors, as well as their role in the management of inflammatory diseases. Methods: Published data for leukotrienes antagonists of plant origin were searched from 1938 to 2019, without language restrictions using relevant keywords in both free text and Medical Subject Headings (MeSH terms) format. Literature and patent searches in the field of leukotriene inhibitors were carried out by using numerous scientific databases including Science Direct, PubMed, MEDLINE, Google Patents, US Patents, US Patent Applications, Abstract of Japan, German Patents, European Patents, WIPO and NAPRALERT. Finally, data from these information resources were analyzed and reported in the present study. Results: Currently, numerous anti-histaminic medicines are available including chloropheneremine, brompheniramine, cetirizine, and clementine. Furthermore, specific leukotriene antagonists from allopathic medicines are also available including zileuton, montelukast, pranlukast and zafirlukast and are considered effective and safe medicines as compared to the first generation medicines. The present study reports leukotrienes antagonistic agents of natural products and certain recent patents that could be an alternative medicine in the management of inflammation in respiratory diseases. Conclusion: The present study highlights recent updates on the pharmacology and patents on leukotriene antagonists in the management of inflammation respiratory diseases.

Download Full-text

Osteopathic empirical research: a bibliometric analysis from 1966 to 2018

BMC Complementary Medicine and Therapies ◽

10.1186/s12906-021-03366-3 ◽

2021 ◽

Vol 21 (1) ◽

Author(s):

Chantal Morin ◽

Isabelle Gaboury

Keyword(s):

Bibliometric Analysis ◽

Empirical Research ◽

Scientific Literature ◽

Case Reports ◽

Empirical Studies ◽

Body Part ◽

The United States ◽

Mesh Terms ◽

Research Designs ◽

System Body

Abstract Background Despite the increasing use of osteopathy, a manipulative complementary and alternative medicine therapy, in the general population, its efficacy continues to be debated. In this era of evidence-based practice, no studies have previously reviewed the scientific literature in the field to identify published knowledge, trends and gaps in empirical research. The aims of this bibliometric analysis are to describe characteristics of articles published on the efficacy of osteopathic interventions and to provide an overall portrait of their impacts in the scientific literature. Methods A bibliometric analysis approach was used. Articles were identified with searches using a combination of relevant MeSH terms and indexing keywords about osteopathy and research designs in MEDLINE and CINAHL databases. The following indicators were extracted: country of primary author, year of publication, journals, impact factor of the journal, number of citations, research design, participants’ age group, system/body part addressed, primary outcome, indexing keywords and types of techniques. Results A total of 389 articles met the inclusion criteria. The number of empirical studies doubled every 5 years, with the United States, Italy, Spain, and United Kingdom being the most productive countries. Twenty-three articles were cited over 100 times. Articles were published in 103 different indexed journals, but more than half (53.7%) of articles were published in one of three osteopathy-focused readership journals. Randomized control trials (n = 145; 37.3%) and case reports (n = 142; 36.5%) were the most common research designs. A total of 187 (48.1%) studies examined the effects of osteopathic interventions using a combination of techniques that belonged to two or all of the classic fields of osteopathic interventions (musculoskeletal, cranial, and visceral). Conclusion The number of osteopathy empirical studies increased significantly from 1980 to 2014. The productivity appears to be very much in sync with practice development and innovations; however, the articles were mainly published in osteopathic journals targeting a limited, disciplinary-focused readership.

Download Full-text