Semi-Automated Identification of Ontological Labels in the Biomedical Literature with goldi

Mapping Intimacies ◽

10.1101/073460 ◽

2016 ◽

Cited By ~ 1

Author(s):

Christopher B. Cole ◽

Sejal Patel ◽

Leon French ◽

Jo Knight

Keyword(s):

R Package ◽

Biomedical Literature ◽

Free Form ◽

Automated Identification ◽

Ontology Term ◽

Theoretical Rationale ◽

Practical Advice ◽

Structured Information ◽

Acceptance Function ◽

Term Identification

AbstractRecent growth in both the scale and the scope of large publicly available ontologies has spurred the development of computational methodologies which can leverage structured information to answer important questions. However, ontological labels, or “terms” have thus far proved difficult to use in practice; text mining, one crucial aspect of electronically understanding and parsing the biomedical literature, has historically had difficulty identifying “terms” in literature. In this article, we present goldi, an open source R package whose goal it is to identify terms of variable length in free form text. It is available at https://github.com/Chris1221/goldi or through CRAN. The algorithm works through identifying words or synonyms of words present in individual terms and comparing the number of present words to an acceptance function for decision making. In this article we present the theoretical rationale behind the algorithm, as well as practical advice for its usage applied to Gene Ontology term identification and quantification. We additionally detail the options available and describe their respective computational efficiencies.

Download Full-text

An automated identification and analysis of ontological terms in gastrointestinal diseases and nutrition-related literature provides useful insights

10.7287/peerj.preprints.26869 ◽

2018 ◽

Author(s):

Orges Koci ◽

Michael Logan ◽

Vaios Svolos ◽

Richard K. Russell ◽

Konstantinos Gerasimidis ◽

...

Keyword(s):

Temporal Trends ◽

Gastrointestinal Diseases ◽

Biomedical Literature ◽

Biomedical Data ◽

Automated Identification ◽

Online Databases ◽

Bowel Diseases ◽

Inflammatory Bowel ◽

Irritable Bowel ◽

Spatial And Temporal Trends

With an unprecedented growth in the biomedical literature, keeping up to date with the new developments presents an immense challenge. Publications are often studied in isolation of the established literature, with interpretation being subjective and often introducing human bias. With ontology-driven annotation of biomedical data gaining popularity in recent years and online databases offering metatags with rich textual information, it is now possible to automatically text-mine ontological terms and complement the laborious task of manual management, interpretation, and analysis of the accumulated literature with downstream statistical analysis. In this paper, we have formulated an automated workflow through which we have identified ontological information, including nutrition-related terms in PubMed abstracts (from 1991 until 2016) for two main types of Inflammatory Bowel Diseases: Crohn's Disease and Ulcerative Colitis; and two other gastrointestinal diseases, namely, Coeliac Disease and Irritable Bowel Syndrome. Our analysis reveals unique clustering patterns as well as spatial and temporal trends inherent to the considered gastrointestinal diseases in terms of literature that has been accumulated so far. Although automated interpretation cannot replace human judgement, the developed workflow shows promising results and can be a useful tool in systematic literature reviews. The workflow is available at https://github.com/KociOrges/pytag .

Download Full-text

An automated identification and analysis of ontological terms in gastrointestinal diseases and nutrition-related literature provides useful insights

PeerJ ◽

10.7717/peerj.5047 ◽

2018 ◽

Vol 6 ◽

pp. e5047 ◽

Cited By ~ 1

Author(s):

Orges Koci ◽

Michael Logan ◽

Vaios Svolos ◽

Richard K. Russell ◽

Konstantinos Gerasimidis ◽

...

Keyword(s):

Temporal Trends ◽

Gastrointestinal Diseases ◽

Biomedical Literature ◽

Biomedical Data ◽

Automated Identification ◽

Online Databases ◽

Bowel Diseases ◽

Inflammatory Bowel ◽

Irritable Bowel ◽

Spatial And Temporal Trends

With an unprecedented growth in the biomedical literature, keeping up to date with the new developments presents an immense challenge. Publications are often studied in isolation of the established literature, with interpretation being subjective and often introducing human bias. With ontology-driven annotation of biomedical data gaining popularity in recent years and online databases offering metatags with rich textual information, it is now possible to automatically text-mine ontological terms and complement the laborious task of manual management, interpretation, and analysis of the accumulated literature with downstream statistical analysis. In this paper, we have formulated an automated workflow through which we have identified ontological information, including nutrition-related terms in PubMed abstracts (from 1991 to 2016) for two main types of Inflammatory Bowel Diseases: Crohn’s Disease and Ulcerative Colitis; and two other gastrointestinal (GI) diseases, namely, Coeliac Disease and Irritable Bowel Syndrome. Our analysis reveals unique clustering patterns as well as spatial and temporal trends inherent to the considered GI diseases in terms of literature that has been accumulated so far. Although automated interpretation cannot replace human judgement, the developed workflow shows promising results and can be a useful tool in systematic literature reviews. The workflow is available at https://github.com/KociOrges/pytag.

Download Full-text

Why a wiki works: Leveraging technology’s sociotechnical affordances to support classroom culture and enhanced student learning

Explorations in Media Ecology ◽

10.1386/eme_00015_1 ◽

2019 ◽

Vol 18 (4) ◽

pp. 439-446 ◽

Cited By ~ 1

Author(s):

James T. Jarc

Keyword(s):

Student Learning ◽

Subject Matter ◽

Digital Literacy ◽

Classroom Culture ◽

Historical Background ◽

Technical Aspects ◽

Theoretical Rationale ◽

Practical Advice ◽

Collaborative Authorship ◽

Subject Matter Expertise

This article presents a theoretical rationale and practical advice for using wiki collaborative authorship technology in a communication or media classroom. The author’s primary thesis is that the use of a wiki in a course helps students develop digital literacy and subject matter expertise, while simultaneously participating in a specific classroom culture that is fostered in part by the use of the wiki. That ethos, the author suggests, challenges students to think critically, work effectively with others, value transparency and accountability, and practice co-created learning. Finally, the article includes some historical background for the platform, along with a cursory overview of the technical aspects of the platform.

Download Full-text

Discovering novel protein–protein interactions by measuring the protein semantic similarity from the biomedical literature

Journal of Bioinformatics and Computational Biology ◽

10.1142/s0219720014420086 ◽

2014 ◽

Vol 12 (06) ◽

pp. 1442008 ◽

Cited By ~ 4

Author(s):

Jung-Hsien Chiang ◽

Jiun-Huang Ju

Keyword(s):

Semantic Similarity ◽

Protein Interactions ◽

Similarity Measures ◽

Biomedical Literature ◽

Biological Research ◽

Protein Protein Interactions ◽

Automated Identification ◽

Learning Classifier ◽

Novel Method ◽

Novel Protein

Protein–protein interactions (PPIs) are involved in the majority of biological processes. Identification of PPIs is therefore one of the key aims of biological research. Although there are many databases of PPIs, many other unidentified PPIs could be buried in the biomedical literature. Therefore, automated identification of PPIs from biomedical literature repositories could be used to discover otherwise hidden interactions. Search engines, such as Google, have been successfully applied to measure the relatedness among words. Inspired by such approaches, we propose a novel method to identify PPIs through semantic similarity measures among protein mentions. We define six semantic similarity measures as features based on the page counts retrieved from the MEDLINE database. A machine learning classifier, Random Forest, is trained using the above features. The proposed approach achieve an averaged micro-F of 71.28% and an averaged macro-F of 64.03% over five PPI corpora, an improvement over the results of using only the conventional co-occurrence feature (averaged micro-F of 68.79% and an averaged macro-F of 60.49%). A relation-word reinforcement further improves the averaged micro-F to 71.3% and averaged macro-F to 65.12%. Comparing the results of the current work with other studies on the AIMed corpus (ranging from 77.58% to 85.1% in micro-F, 62.18% to 76.27% in macro-F), we show that the proposed approach achieves micro-F of 81.88% and macro-F of 64.01% without the use of sophisticated feature extraction. Finally, we manually examine the newly discovered PPI pairs based on a literature review, and the results suggest that our approach could extract novel protein–protein interactions.

Download Full-text

Ontoclick: a Chrome web browser extension to facilitate biomedical knowledge curation

10.1101/2021.03.04.433993 ◽

2021 ◽

Author(s):

Anthony Xu ◽

Aravind Venkateswaran ◽

Lianguizi Zhou ◽

Andreas Zankl

Keyword(s):

Source Code ◽

Biomedical Literature ◽

Biomedical Knowledge ◽

Web Browser ◽

Ontology Term ◽

Browser Extension ◽

Widespread Adoption ◽

User Friendly

Knowledge curation from the biomedical literature is very valuable but can be a repetitive and laborious process. The paucity of user-friendly tools is one of the reasons for the lack of widespread adoption of good biomedical knowledge curation practices. Here we present Ontoclick, a Chrome web browser extension that streamlines the process of annotating a text span with a relevant ontology term. We hope this tool will make biocuration more accessible to a wider audience of biomedical researchers. Ontoclick is freely available under the GPL-3.0 license on the Chrome Web Store. Source code and documentation are available at: https://github.com/azankl/Ontoclick Contact: [email protected]

Download Full-text

Automated identification of maximal differential cell populations in flow cytometry data

10.1101/837765 ◽

2019 ◽

Author(s):

Alice Yue ◽

Cedric Chauve ◽

Maxwell Libbrecht ◽

Ryan R. Brinkman

Keyword(s):

Flow Cytometry ◽

Cell Population ◽

R Package ◽

Cell Populations ◽

Visualization Tool ◽

Automated Identification ◽

Flow Cytometry Data ◽

New Class ◽

Differential Cell ◽

Related Population

AbstractWe introduce a new cell population score called SpecEnr (specific enrichment) and describe a method that discovers robust and accurate candidate biomarkers from flow cytometry data. Our approach identifies a new class of candidate biomarkers we define as driver cell populations, whose abundance is associated with a sample class (e.g. disease), but not as a result of a change in a related population. We show that the driver cell populations we find are also easily interpretable using a lattice-based visualization tool. Our method is implemented in the R package flowGraph, freely available on GitHub (github.com/aya49/flowGraph) and will be available BioConductor.

Download Full-text

STonKGs: A Sophisticated Transformer Trained on Biomedical Text and Knowledge Graphs

10.1101/2021.08.17.456616 ◽

2021 ◽

Author(s):

Helena Balabin ◽

Charles Tapley Hoyt ◽

Colin Birkenbihl ◽

Benjamin M. Gyori ◽

John A. Bachman ◽

...

Keyword(s):

Language Processing ◽

Biomedical Literature ◽

Biological Knowledge ◽

Biomedical Text ◽

Biomedical Knowledge ◽

Scientific Publications ◽

Integrated Network ◽

Unstructured Text ◽

Structured Information ◽

Knowledge Graphs

The majority of biomedical knowledge is stored in structured databases or as unstructured text in scientific publications. This vast amount of information has led to numerous machine learning-based biological applications using either text through natural language processing (NLP) or structured data through knowledge graph embedding models (KGEMs). However, representations based on a single modality are inherently limited. To generate better representations of biological knowledge, we propose STonKGs, a Sophisticated Transformer trained on biomedical text and Knowledge Graphs. This multimodal Transformer uses combined input sequences of structured information from KGs and unstructured text data from biomedical literature to learn joint representations. First, we pre-trained STonKGs on a knowledge base assembled by the Integrated Network and Dynamical Reasoning Assembler (INDRA) consisting of millions of text-triple pairs extracted from biomedical literature by multiple NLP systems. Then, we benchmarked STonKGs against two baseline models trained on either one of the modalities (i.e., text or KG) across eight different classification tasks, each corresponding to a different biological application. Our results demonstrate that STonKGs outperforms both baselines, especially on the more challenging tasks with respect to the number of classes, improving upon the F1-score of the best baseline by up to 0.083. Additionally, our pre-trained model as well as the model architecture can be adapted to various other transfer learning applications. Finally, the source code and pre-trained STonKGs models are available at https://github.com/stonkgs/stonkgs and https://huggingface.co/stonkgs/stonkgs-150k.

Download Full-text

Term identification in the biomedical literature

Journal of Biomedical Informatics ◽

10.1016/j.jbi.2004.08.004 ◽

2004 ◽

Vol 37 (6) ◽

pp. 512-526 ◽

Cited By ~ 137

Author(s):

Michael Krauthammer ◽

Goran Nenadic

Keyword(s):

Biomedical Literature ◽

Term Identification

Download Full-text

Semi-automatic Extraction of Plants Morphological Characters from Taxonomic Descriptions Written in Spanish

Biodiversity Data Journal ◽

10.3897/bdj.6.e21282 ◽

2018 ◽

Vol 6 ◽

pp. e21282 ◽

Cited By ~ 1

Author(s):

Maria Mora ◽

José Araya

Keyword(s):

Costa Rica ◽

Computational Linguistics ◽

Semantic Analysis ◽

Morphological Characters ◽

Scientific Publications ◽

Ontology Term ◽

Biodiversity Knowledge ◽

Taxonomic Descriptions ◽

Plant Ontology ◽

Structured Information

Taxonomic literature keeps records of the planet's biodiversity and gives access to the knowledge needed for its sustainable management. Unfortunately, most of the taxonomic information is available in scientific publications in text format. The amount of publications generated is very large; therefore, to process it in order to obtain high structured texts would be complex and very expensive. Approaches like citizen science may help the process by selecting whole fragments of texts dealing with morphological descriptions; but a deeper analysis, compatible with accepted ontologies, will require specialised tools. The Biodiversity Heritage Library (BHL) estimates that there are more than 120 million pages published in over 5.4 million books since 1469, plus about 800,000 monographs and 40,000 journal titles (12,500 of these are current titles).It is necessary to develop standards and software tools to extract, integrate and publish this information into existing free and open access repositories of biodiversity knowledge to support science, education and biodiversity conservation.This document presents an algorithm based on computational linguistics techniques to extract structured information from morphological descriptions of plants written in Spanish. The developed algorithm is based on the work of Dr. Hong Cui from the University of Arizona; it uses semantic analysis, ontologies and a repository of knowledge acquired from the same descriptions. The algorithm was applied to the books Trees of Costa Rica Volume III (TCRv3), Trees of Costa Rica Volume IV (TCRv4) and to a subset of descriptions of the Manual of Plants of Costa Rica (MPCR) with very competitive results (more than 92.5% of average performance). The system receives the morphological descriptions in tabular format and generates XML documents. The XML schema allows documenting structures, characters and relations between characters and structures. Each extracted object is associated with attributes like name, value, modifiers, restrictions, ontology term id, amongst other attributes.The implemented tool is free software. It was developed using Java and integrates existing technology as FreeLing, the Plant Ontology (PO), the Plant Glossary, the Ontology Term Organizer (OTO) and the Flora Mesoamericana English-Spanish Glossary.

Download Full-text

pmparser and PMDB: resources for large-scale, open studies of the biomedical literature

PeerJ ◽

10.7717/peerj.11071 ◽

2021 ◽

Vol 9 ◽

pp. e11071

Author(s):

Joshua L. Schoenbachler ◽

Jacob J. Hughey

Keyword(s):

Relational Database ◽

Large Scale ◽

R Package ◽

Biomedical Literature ◽

Complex Queries ◽

Biomedical Community

PubMed is an invaluable resource for the biomedical community. Although PubMed is freely available, the existing API is not designed for large-scale analyses and the XML structure of the underlying data is inconvenient for complex queries. We developed an R package called pmparser to convert the data in PubMed to a relational database. Our implementation of the database, called PMDB, currently contains data on over 31 million PubMed Identifiers (PMIDs) and is updated regularly. Together, pmparser and PMDB can enable large-scale, reproducible, and transparent analyses of the biomedical literature. pmparser is licensed under GPL-2 and available at https://pmparser.hugheylab.org. PMDB is available in both PostgreSQL (DOI 10.5281/zenodo.4008109) and Google BigQuery (https://console.cloud.google.com/bigquery?project=pmdb-bq&d=pmdb).

Download Full-text