scholarly journals Semi-Automated Identification of Ontological Labels in the Biomedical Literature with goldi

2016 ◽  
Author(s):  
Christopher B. Cole ◽  
Sejal Patel ◽  
Leon French ◽  
Jo Knight

AbstractRecent growth in both the scale and the scope of large publicly available ontologies has spurred the development of computational methodologies which can leverage structured information to answer important questions. However, ontological labels, or “terms” have thus far proved difficult to use in practice; text mining, one crucial aspect of electronically understanding and parsing the biomedical literature, has historically had difficulty identifying “terms” in literature. In this article, we present goldi, an open source R package whose goal it is to identify terms of variable length in free form text. It is available at https://github.com/Chris1221/goldi or through CRAN. The algorithm works through identifying words or synonyms of words present in individual terms and comparing the number of present words to an acceptance function for decision making. In this article we present the theoretical rationale behind the algorithm, as well as practical advice for its usage applied to Gene Ontology term identification and quantification. We additionally detail the options available and describe their respective computational efficiencies.

2018 ◽  
Author(s):  
Orges Koci ◽  
Michael Logan ◽  
Vaios Svolos ◽  
Richard K. Russell ◽  
Konstantinos Gerasimidis ◽  
...  

With an unprecedented growth in the biomedical literature, keeping up to date with the new developments presents an immense challenge. Publications are often studied in isolation of the established literature, with interpretation being subjective and often introducing human bias. With ontology-driven annotation of biomedical data gaining popularity in recent years and online databases offering metatags with rich textual information, it is now possible to automatically text-mine ontological terms and complement the laborious task of manual management, interpretation, and analysis of the accumulated literature with downstream statistical analysis. In this paper, we have formulated an automated workflow through which we have identified ontological information, including nutrition-related terms in PubMed abstracts (from 1991 until 2016) for two main types of Inflammatory Bowel Diseases: Crohn's Disease and Ulcerative Colitis; and two other gastrointestinal diseases, namely, Coeliac Disease and Irritable Bowel Syndrome. Our analysis reveals unique clustering patterns as well as spatial and temporal trends inherent to the considered gastrointestinal diseases in terms of literature that has been accumulated so far. Although automated interpretation cannot replace human judgement, the developed workflow shows promising results and can be a useful tool in systematic literature reviews. The workflow is available at https://github.com/KociOrges/pytag .


PeerJ ◽  
2018 ◽  
Vol 6 ◽  
pp. e5047 ◽  
Author(s):  
Orges Koci ◽  
Michael Logan ◽  
Vaios Svolos ◽  
Richard K. Russell ◽  
Konstantinos Gerasimidis ◽  
...  

With an unprecedented growth in the biomedical literature, keeping up to date with the new developments presents an immense challenge. Publications are often studied in isolation of the established literature, with interpretation being subjective and often introducing human bias. With ontology-driven annotation of biomedical data gaining popularity in recent years and online databases offering metatags with rich textual information, it is now possible to automatically text-mine ontological terms and complement the laborious task of manual management, interpretation, and analysis of the accumulated literature with downstream statistical analysis. In this paper, we have formulated an automated workflow through which we have identified ontological information, including nutrition-related terms in PubMed abstracts (from 1991 to 2016) for two main types of Inflammatory Bowel Diseases: Crohn’s Disease and Ulcerative Colitis; and two other gastrointestinal (GI) diseases, namely, Coeliac Disease and Irritable Bowel Syndrome. Our analysis reveals unique clustering patterns as well as spatial and temporal trends inherent to the considered GI diseases in terms of literature that has been accumulated so far. Although automated interpretation cannot replace human judgement, the developed workflow shows promising results and can be a useful tool in systematic literature reviews. The workflow is available at https://github.com/KociOrges/pytag.


2019 ◽  
Vol 18 (4) ◽  
pp. 439-446 ◽  
Author(s):  
James T. Jarc

This article presents a theoretical rationale and practical advice for using wiki collaborative authorship technology in a communication or media classroom. The author’s primary thesis is that the use of a wiki in a course helps students develop digital literacy and subject matter expertise, while simultaneously participating in a specific classroom culture that is fostered in part by the use of the wiki. That ethos, the author suggests, challenges students to think critically, work effectively with others, value transparency and accountability, and practice co-created learning. Finally, the article includes some historical background for the platform, along with a cursory overview of the technical aspects of the platform.


2014 ◽  
Vol 12 (06) ◽  
pp. 1442008 ◽  
Author(s):  
Jung-Hsien Chiang ◽  
Jiun-Huang Ju

Protein–protein interactions (PPIs) are involved in the majority of biological processes. Identification of PPIs is therefore one of the key aims of biological research. Although there are many databases of PPIs, many other unidentified PPIs could be buried in the biomedical literature. Therefore, automated identification of PPIs from biomedical literature repositories could be used to discover otherwise hidden interactions. Search engines, such as Google, have been successfully applied to measure the relatedness among words. Inspired by such approaches, we propose a novel method to identify PPIs through semantic similarity measures among protein mentions. We define six semantic similarity measures as features based on the page counts retrieved from the MEDLINE database. A machine learning classifier, Random Forest, is trained using the above features. The proposed approach achieve an averaged micro-F of 71.28% and an averaged macro-F of 64.03% over five PPI corpora, an improvement over the results of using only the conventional co-occurrence feature (averaged micro-F of 68.79% and an averaged macro-F of 60.49%). A relation-word reinforcement further improves the averaged micro-F to 71.3% and averaged macro-F to 65.12%. Comparing the results of the current work with other studies on the AIMed corpus (ranging from 77.58% to 85.1% in micro-F, 62.18% to 76.27% in macro-F), we show that the proposed approach achieves micro-F of 81.88% and macro-F of 64.01% without the use of sophisticated feature extraction. Finally, we manually examine the newly discovered PPI pairs based on a literature review, and the results suggest that our approach could extract novel protein–protein interactions.


2021 ◽  
Author(s):  
Anthony Xu ◽  
Aravind Venkateswaran ◽  
Lianguizi Zhou ◽  
Andreas Zankl

Knowledge curation from the biomedical literature is very valuable but can be a repetitive and laborious process. The paucity of user-friendly tools is one of the reasons for the lack of widespread adoption of good biomedical knowledge curation practices. Here we present Ontoclick, a Chrome web browser extension that streamlines the process of annotating a text span with a relevant ontology term. We hope this tool will make biocuration more accessible to a wider audience of biomedical researchers. Ontoclick is freely available under the GPL-3.0 license on the Chrome Web Store. Source code and documentation are available at: https://github.com/azankl/Ontoclick Contact: [email protected]


2019 ◽  
Author(s):  
Alice Yue ◽  
Cedric Chauve ◽  
Maxwell Libbrecht ◽  
Ryan R. Brinkman

AbstractWe introduce a new cell population score called SpecEnr (specific enrichment) and describe a method that discovers robust and accurate candidate biomarkers from flow cytometry data. Our approach identifies a new class of candidate biomarkers we define as driver cell populations, whose abundance is associated with a sample class (e.g. disease), but not as a result of a change in a related population. We show that the driver cell populations we find are also easily interpretable using a lattice-based visualization tool. Our method is implemented in the R package flowGraph, freely available on GitHub (github.com/aya49/flowGraph) and will be available BioConductor.


2021 ◽  
Author(s):  
Helena Balabin ◽  
Charles Tapley Hoyt ◽  
Colin Birkenbihl ◽  
Benjamin M. Gyori ◽  
John A. Bachman ◽  
...  

The majority of biomedical knowledge is stored in structured databases or as unstructured text in scientific publications. This vast amount of information has led to numerous machine learning-based biological applications using either text through natural language processing (NLP) or structured data through knowledge graph embedding models (KGEMs). However, representations based on a single modality are inherently limited. To generate better representations of biological knowledge, we propose STonKGs, a Sophisticated Transformer trained on biomedical text and Knowledge Graphs. This multimodal Transformer uses combined input sequences of structured information from KGs and unstructured text data from biomedical literature to learn joint representations. First, we pre-trained STonKGs on a knowledge base assembled by the Integrated Network and Dynamical Reasoning Assembler (INDRA) consisting of millions of text-triple pairs extracted from biomedical literature by multiple NLP systems. Then, we benchmarked STonKGs against two baseline models trained on either one of the modalities (i.e., text or KG) across eight different classification tasks, each corresponding to a different biological application. Our results demonstrate that STonKGs outperforms both baselines, especially on the more challenging tasks with respect to the number of classes, improving upon the F1-score of the best baseline by up to 0.083. Additionally, our pre-trained model as well as the model architecture can be adapted to various other transfer learning applications. Finally, the source code and pre-trained STonKGs models are available at https://github.com/stonkgs/stonkgs and https://huggingface.co/stonkgs/stonkgs-150k.


2004 ◽  
Vol 37 (6) ◽  
pp. 512-526 ◽  
Author(s):  
Michael Krauthammer ◽  
Goran Nenadic

2018 ◽  
Vol 6 ◽  
pp. e21282 ◽  
Author(s):  
Maria Mora ◽  
José Araya

Taxonomic literature keeps records of the planet's biodiversity and gives access to the knowledge needed for its sustainable management. Unfortunately, most of the taxonomic information is available in scientific publications in text format. The amount of publications generated is very large; therefore, to process it in order to obtain high structured texts would be complex and very expensive. Approaches like citizen science may help the process by selecting whole fragments of texts dealing with morphological descriptions; but a deeper analysis, compatible with accepted ontologies, will require specialised tools. The Biodiversity Heritage Library (BHL) estimates that there are more than 120 million pages published in over 5.4 million books since 1469, plus about 800,000 monographs and 40,000 journal titles (12,500 of these are current titles).It is necessary to develop standards and software tools to extract, integrate and publish this information into existing free and open access repositories of biodiversity knowledge to support science, education and biodiversity conservation.This document presents an algorithm based on computational linguistics techniques to extract structured information from morphological descriptions of plants written in Spanish. The developed algorithm is based on the work of Dr. Hong Cui from the University of Arizona; it uses semantic analysis, ontologies and a repository of knowledge acquired from the same descriptions. The algorithm was applied to the books Trees of Costa Rica Volume III (TCRv3), Trees of Costa Rica Volume IV (TCRv4) and to a subset of descriptions of the Manual of Plants of Costa Rica (MPCR) with very competitive results (more than 92.5% of average performance). The system receives the morphological descriptions in tabular format and generates XML documents. The XML schema allows documenting structures, characters and relations between characters and structures. Each extracted object is associated with attributes like name, value, modifiers, restrictions, ontology term id, amongst other attributes.The implemented tool is free software. It was developed using Java and integrates existing technology as FreeLing, the Plant Ontology (PO), the Plant Glossary, the Ontology Term Organizer (OTO) and the Flora Mesoamericana English-Spanish Glossary.


PeerJ ◽  
2021 ◽  
Vol 9 ◽  
pp. e11071
Author(s):  
Joshua L. Schoenbachler ◽  
Jacob J. Hughey

PubMed is an invaluable resource for the biomedical community. Although PubMed is freely available, the existing API is not designed for large-scale analyses and the XML structure of the underlying data is inconvenient for complex queries. We developed an R package called pmparser to convert the data in PubMed to a relational database. Our implementation of the database, called PMDB, currently contains data on over 31 million PubMed Identifiers (PMIDs) and is updated regularly. Together, pmparser and PMDB can enable large-scale, reproducible, and transparent analyses of the biomedical literature. pmparser is licensed under GPL-2 and available at https://pmparser.hugheylab.org. PMDB is available in both PostgreSQL (DOI 10.5281/zenodo.4008109) and Google BigQuery (https://console.cloud.google.com/bigquery?project=pmdb-bq&d=pmdb).


Sign in / Sign up

Export Citation Format

Share Document