The Pensoft Annotator: A new tool for text annotation with ontology terms

Biodiversity Information Science and Standards ◽

10.3897/biss.4.59042 ◽

2020 ◽

Vol 4 ◽

Author(s):

Mariya Dimitrova ◽

Georgi Zhelezov ◽

Teodor Georgiev ◽

Lyubomir Penev

Keyword(s):

False Positive ◽

Semantic Analysis ◽

Expert Knowledge ◽

Data Reuse ◽

Free Text ◽

Ontology Term ◽

Anatomy Ontology ◽

Biodiversity Knowledge ◽

Multiple Ontologies ◽

Biodiversity Information

Introduction Digitisation of biodiversity knowledge from collections, scholarly literature and various research documents is an ongoing mission of the Biodiversity Information Standards (TDWG) community. Organisations such as the Biodiversity Heritage Library make historical biodiversity literature openly available and develop tools to allow biodiversity data reuse and interoperability. For instance, Plazi transforms free text into machine-readable formats and extracts collection data and feeds it into the Global Biodiversity Information Facility (GBIF) and other aggregators. All of these digitisation workflows require a lot of effort to develop and implement in practice. In essence, what these digitisation activities entail are the mapping of free text to concepts from recognised vocabularies or ontologies in order to make the content understandable to computers. Aim We aim to address the problem of mapping free text to ontological terms ("strings to things") with our tool for text-to-ontology mapping: the Pensoft Annotator. Methods & Implementation The Annotator is a web application that performs direct text matching to terms from any ontology or vocabulary list given as input to the Annotator. The term 'ontology' is used loosely here and means a collection of terms and their synonyms, where terms are uniquely identified via a Uniform Resource Identifier (URI). The Annotator accepts any of the following ontology formats (e.g. OBO, OWL, RDF/XML, etc.) but does not require the existence of a proper ontology structure (logical statements). We use the ROBOT command line tool to convert any of these formats to JSON. After the upload of a new ontology, the Annotator processes the ontology terms by normalising all exact synonyms and by removing all of the other synonyms (related, narrow and broad synonyms). This is done to limit the number of false positive matches and to preserve the semantic similarity between the matched ontology term and the text. After matching the words in the input text and the ontology term labels, the Pensoft Annotator returns a table of matched ontology terms including the following fields: the identifier of the ontology term, the ontology term label or the label of the synonym, the starting position of the matched term in the text, the term context (words surrounding the matched term in the text), the type of ontology term (class or property), the ontology from which the matched term originates and the number of times a given term is mentioned in the text. The Pensoft Annotator allows simultaneous annotation with multiple ontologies. To better visualise the exact ontology from which a matching term has been found, the terms are highlighted in different colour depending on the ontology. The Pensoft Annotator is also accessible programmatically via an Application Programming Interface (API), documented at https://annotator.pensoft.net/api. Discussion & Use Cases The Pensoft Annotator provides functionalities that will aid the transformation of free text to collections of semantic resources. However, it still requires expert knowledge to use as the ontologies need to be selected carefully. Some false positive matches from the annotation are possible because we do not perform semantic analysis of the texts. False negatives are also possible since there might be different word forms of ontology terms, which are not direct matches to them (e.g. 'wolf' and 'wolves'). For this reason, matched terms can be reviewed and removed from the results within the web interface of the Pensoft Annotator. After removal of terms, they will not be present in the downloaded results. The Pensoft Annotator can be used to annotate biodiversity and taxonomic literature to help with the extraction of biodiversity knowledge (e.g. species habitat preferences, species interaction data, localities, biogeographic data). The existence of some domain and taxon-specific ontologies, such as the Hymenoptera Anatomy Ontology, provides further opportunities for context-specific annotation. Semantic analysis of unstructured texts could be applied in addition to ontology annotation to improve the accuracy of ontology term matching and to filter out mismatched terms. Annotation of structured or semi-structured text (e.g. tables) can be done with better success. A recent example demonstrates the use of the Annotator to extract biotic interactions from tables (Dimitrova et al. 2020). The Annotator could also be used for ontology analysis and comparison. Annotation of text can help to discover gaps in ontologies as well as inaccurate synonyms. For instance, a certain word could be recognised as an ontology term match because it is an exact synonym in the ontology but in reality it might be more accurate to mark it as a related synonym. In addition, annotation with multiple ontologies can help to elucidate links between ontologies.

Download Full-text

Semi-automatic Extraction of Plants Morphological Characters from Taxonomic Descriptions Written in Spanish

Biodiversity Data Journal ◽

10.3897/bdj.6.e21282 ◽

2018 ◽

Vol 6 ◽

pp. e21282 ◽

Cited By ~ 1

Author(s):

Maria Mora ◽

José Araya

Keyword(s):

Costa Rica ◽

Computational Linguistics ◽

Semantic Analysis ◽

Morphological Characters ◽

Scientific Publications ◽

Ontology Term ◽

Biodiversity Knowledge ◽

Taxonomic Descriptions ◽

Plant Ontology ◽

Structured Information

Taxonomic literature keeps records of the planet's biodiversity and gives access to the knowledge needed for its sustainable management. Unfortunately, most of the taxonomic information is available in scientific publications in text format. The amount of publications generated is very large; therefore, to process it in order to obtain high structured texts would be complex and very expensive. Approaches like citizen science may help the process by selecting whole fragments of texts dealing with morphological descriptions; but a deeper analysis, compatible with accepted ontologies, will require specialised tools. The Biodiversity Heritage Library (BHL) estimates that there are more than 120 million pages published in over 5.4 million books since 1469, plus about 800,000 monographs and 40,000 journal titles (12,500 of these are current titles).It is necessary to develop standards and software tools to extract, integrate and publish this information into existing free and open access repositories of biodiversity knowledge to support science, education and biodiversity conservation.This document presents an algorithm based on computational linguistics techniques to extract structured information from morphological descriptions of plants written in Spanish. The developed algorithm is based on the work of Dr. Hong Cui from the University of Arizona; it uses semantic analysis, ontologies and a repository of knowledge acquired from the same descriptions. The algorithm was applied to the books Trees of Costa Rica Volume III (TCRv3), Trees of Costa Rica Volume IV (TCRv4) and to a subset of descriptions of the Manual of Plants of Costa Rica (MPCR) with very competitive results (more than 92.5% of average performance). The system receives the morphological descriptions in tabular format and generates XML documents. The XML schema allows documenting structures, characters and relations between characters and structures. Each extracted object is associated with attributes like name, value, modifiers, restrictions, ontology term id, amongst other attributes.The implemented tool is free software. It was developed using Java and integrates existing technology as FreeLing, the Plant Ontology (PO), the Plant Glossary, the Ontology Term Organizer (OTO) and the Flora Mesoamericana English-Spanish Glossary.

Download Full-text

Understanding Based Managing Support Systems

Enterprise Information Systems and Implementing IT Infrastructures ◽

10.4018/978-1-61520-625-4.ch007 ◽

2010 ◽

pp. 91-102

Author(s):

Lidia Ogiela ◽

Ryszard Tadeusiewicz ◽

Marek R. Ogiela

Keyword(s):

Decision Making ◽

Information Systems ◽

Semantic Analysis ◽

Support Systems ◽

Expert Knowledge ◽

Management Support ◽

Economic Data ◽

Economic Information ◽

Management Support Systems ◽

Intelligent Information

This publication presents cognitive systems designed for analysing economic data. Such systems have been created as the next step in the development of classical DSS systems (Decision Support Systems), which are currently the most widespread tools providing computer support for economic decision-making. The increasing complexity of decision-making processes in business combined with increasing demands that managers put on IT tools supporting management cause DSS systems to evolve into intelligent information systems. This publication defines a new category of systems - UBMSS (Understanding Based Management Support Systems) which conduct in-depth analyses of data using on an apparatus for linguistic and meaning-based interpretation and reasoning. This type of interpretation and reasoning is inherent in the human way of perceiving the world. This is why the authors of this publication have striven to perfect the scope and depth of computer interpretation of economic information based on human processes of cognitive data analysis. As a result, they have created UBMSS systems for the automatic analysis and interpretation of economic data. The essence of the proposed approach to the cognitive analysis of economic data is the use of the apparatus for the linguistic description of data and for semantic analysis. This type of analysis is based on expectations generated automatically by a system which collects resources of expert knowledge, taking into account the information which can significantly characterise the analysed data. In this publication, the processes of classical data description and analysis are extended to include cognitive processes as well as reasoning and forecasting mechanisms. As a result of the analyses shown, we will present a new class of UBMSS cognitive economic information systems which automatically perform a semantic analysis of business data.

Download Full-text

Factors Influencing Problem List Use in Electronic Health Records—Application of the Unified Theory of Acceptance and Use of Technology

Applied Clinical Informatics ◽

10.1055/s-0040-1712466 ◽

2020 ◽

Vol 11 (03) ◽

pp. 415-426

Author(s):

Eva S. Klappe ◽

Nicolette F. de Keizer ◽

Ronald Cornet

Keyword(s):

Peer To Peer ◽

Data Reuse ◽

Care Process ◽

Free Text ◽

Problem List ◽

Unified Theory ◽

Peer Training ◽

Large Variability ◽

Electronic Health ◽

Problem Lists

Abstract Background Problem-oriented electronic health record (EHR) systems can help physicians to track a patient's status and progress, and organize clinical documentation, which could help improving quality of clinical data and enable data reuse. The problem list is central in a problem-oriented medical record. However, current problem lists remain incomplete because of the lack of end-user training and inaccurate content of underlying terminologies. This leads to modifications of diagnosis code descriptions and use of free-text notes, limiting reuse of data. Objectives We aimed to investigate factors that influence acceptance and actual use of the problem list, and used these to propose recommendations, to increase the value of problem lists for (re)use. Methods Semistructured interviews were conducted with physicians, heads of medical departments, and data quality experts, who were invited through snowball sampling. The interviews were transcribed and coded. Comments were fitted in constructs of the validated framework unified theory of acceptance user technology (UTAUT), and were discussed in terms of facilitators and barriers. Results In total, 24 interviews were conducted. We found large variability in attitudes toward problem list use. Barriers included uncertainty about the responsibility for maintaining the problem list and little perceived benefits. Facilitators included the (re)design of policies, improved (peer-to-peer) training to increase motivation, and positive peer feedback and monitoring. Motivation is best increased through sharing benefits relevant in the care process, such as providing overview, timely generation of discharge or referral letters, and reuse of data. Furthermore, content of the underlying terminology should be improved and the problem list should be better presented in the EHR system. Conclusion To let physicians accept and use the problem list, policies and guidelines should be redesigned, and prioritized by supervising staff. Additionally, peer-to-peer training on the benefits of using the problem list is needed.

Download Full-text

Applying lexical and semantic analysis to the exploration of free-text data

Nurse Researcher ◽

10.7748/nr.4.3.46.s5 ◽

1997 ◽

Vol 4 (3) ◽

pp. 46-68

Author(s):

LG Moseley ◽

FA Murphy

Keyword(s):

Semantic Analysis ◽

Free Text ◽

Text Data

Download Full-text

On the Way to Close the Loop in Information Logistics: Data from the Patient — Value for the Patient

Yearbook of Medical Informatics ◽

10.1055/s-0038-1667076 ◽

2018 ◽

Vol 27 (01) ◽

pp. 091-097 ◽

Cited By ~ 1

Author(s):

Werner Hackl ◽

Alexander Hoerbst ◽

Keyword(s):

Information Systems ◽

Medical Informatics ◽

Clinical Information ◽

Data Reuse ◽

Clinical Information Systems ◽

Free Text ◽

Search Terms ◽

Information Logistics ◽

Free Text Search ◽

Selection Of

Objective: To summarize recent research and to propose a selection of best papers published in 2017 in the field of Clinical Information Systems (CIS). Method: Each year a systematic process is carried out to retrieve articles and to select a set of best papers for the CIS section of the International Medical Informatics Association (IMIA) Yearbook of Medical Informatics. The query aiming at identifying relevant publications in the field of CIS was refined by the section editors during the last years. For three years now, the query is stable. It comprises search terms from the Medical Subject Headings (MeSH) thesaurus as well as additional free text search terms from PubMed and Web of Science®. The retrieved articles were categorized in a multi-pass review carried out by the two section editors. The final selection of candidate papers was then peer-reviewed by Yearbook editors and external reviewers. Based on the review results, the best papers were then selected by the IMIA Yearbook editorial board. Text mining, and term co-occurrence mapping techniques were used to get an overview on the content of the retrieved articles. Results: The query was carried out in mid-January 2018, yielding a consolidated result set of 2,255 articles which had been published in 939 different journals. Out of them, 15 papers were nominated as candidate best papers and four of them were finally selected as best papers in the CIS section. Again, the content analysis of the articles revealed the broad spectrum of topics which is covered by CIS research. Conclusions: Modern clinical information systems serve as backbone for a very complex, trans-institutional information logistics process. Data that is produced by, documented in, shared via, organized in, presented by, and stored within clinical information systems is more and more reused for multiple purposes. We found a lot of examples showing the benefits of such data reuse with various novel approaches implemented to tackle the challenges of this process. We also found that the patient moves in the focus of interest of CIS research. So the loop of information logistics begins to close: data from the patients is used to produce value for the patients.

Download Full-text

Automated assessment of short free-text responses in computer science using latent semantic analysis

Proceedings of the 16th annual joint conference on Innovation and technology in computer science education - ITiCSE '11 ◽

10.1145/1999747.1999793 ◽

2011 ◽

Cited By ~ 12

Author(s):

Richard Klein ◽

Angelo Kyrilov ◽

Mayya Tokman

Keyword(s):

Computer Science ◽

Latent Semantic Analysis ◽

Semantic Analysis ◽

Free Text ◽

Automated Assessment

Download Full-text

Mobilizing Data from Taxonomic Literature for an Iconic Species (Dinosauria, Theropoda, Tyrannosaurus rex)

Biodiversity Information Science and Standards ◽

10.3897/biss.3.37078 ◽

2019 ◽

Vol 3 ◽

Author(s):

Jeremy Miller ◽

Yanell Braumuller ◽

Puneet Kishor ◽

David Shorthouse ◽

Mariya Dimitrova ◽

...

Keyword(s):

Global Biodiversity Information Facility ◽

Museum Exhibits ◽

The Public ◽

The Past ◽

Biodiversity Knowledge ◽

Informatics Infrastructure ◽

The Subject ◽

The Impact ◽

Biodiversity Information ◽

Occurrence Records

A vast amount of biodiversity data is reported in the primary taxonomic literature. In the past, we have demonstrated the use of semantic enhancement to extract data from taxonomic literature and make it available to a network of databases (Miller et al. 2015). For technical reasons, semantic enhancement of taxonomic literature is most efficient when customized according to the format of a particular journal. This journal-based approach captures and disseminates data on whatever taxa happen to be published therein. But if we want to extract all treatments on a particular taxon of interest, these are likely to be spread across multiple journals. Fortunately, the GoldenGATE Imagine document editor (Sautter 2019) is flexible enough to parse most taxonomic literature. Tyrannosaurus rex is an iconic dinosaur with broad public appeal, as well as the subject of more than a century of scholarship. The Naturalis Biodiversity Center recently acquired a specimen that has become a major attraction in the public exhibit space. For most species on earth, the primary taxonomic literature contains nearly everything that is known about it. Every described species on earth is the subject of one or more taxonomic treatments. A taxon-based approach to semantic enhancement can mobilize all this knowledge using the network of databases and resources that comprise the modern biodiversity informatics infrastructure. When a particular species is of special interest, a taxon-based approach to semantic enhancement can be a powerful tool for scholarship and communication. In light of this, we resolved to semantically enhance all taxonomic treatments on T. rex. Our objective was to make these treatments and associated data available for the broad range of stakeholders who might have an interest in this animal, including professional paleontologists, the curious public, and museum exhibits and public communications personnel. Among the routine parsing and data sharing activities in the Plazi workflow (Agosti and Egloff 2009), taxonomic treatments, as well as cited figures, are deposited in the Biodiversity Literature Repository (BLR), and occurrence records are shared with the Global Biodiversity Information Facility (GBIF). Treatment citations were enhanced with hyperlinks to the cited treatment on TreatmentBank, and specimen citations were linked to their entries on public facing collections databases. We used the OpenBiodiv biodiversity knowledge graph (Senderov et al. 2017) to discover other taxa mentioned together with T. rex, and to create a timeline of T. rex research to evaluate the impact of individual researchers and specimen repositories to T. rex research. We contributed treatment links to WikiData, and queried WikiData to discover identifiers to different platforms holding data about T. rex. We used bloodhound-tracker.net to disambiguate human agents, like collectors, identifiers, and authors. We evaluate the adequacy of the fields currently available to extract data from taxonomic treatments, and make recommendations for future standards.

Download Full-text

A novel method for observing proportional group awareness and consensus of items arising from list-generating questioning

10.21203/rs.2.13006/v1 ◽

2019 ◽

Author(s):

Ravi Jandhyala

Keyword(s):

Case Studies ◽

Delphi Technique ◽

Expert Knowledge ◽

Objective Assessment ◽

Free Text ◽

Group Awareness ◽

Item Generation ◽

Frequent Items ◽

Educational Need ◽

Objectively Measured

Abstract Background: Methodologies used to gain consensus among healthcare professionals, including variations of the Delphi technique and the RAND/UCLA appropriateness method, all force a consensus, corrupting the original opinion through the consensus-generating process. Furthermore, none assess knowledge awareness of the experts prior to the consensus process. Methods: Four case studies about X-linked hypophosphatemia (XLH) are reported to demonstrate the principle of group ‘awareness’ of items, consensus and the concept of prompted agreement. The novel methodology consisted of two surveys: Round 1 was an item-generation round in which participants were asked an open-ended question. Responses to Round 1 were collated into themes and developed into mutually exclusive items. Item generating was also performed using systematic literature reviews when appropriate. Items generated were used to develop a structured questionnaire (Round 2) comprising statements for which each participant identified their level of agreement using a five-point Likert scale. All responses were analysed anonymously. Item awareness, observed agreement consensus and prompted agreement were objectively measured. Results: The free-text responses to the item-generation round tested the awareness of specific concepts or items regarding setting up a European registry for XLH, the limitations of empirical treatment for XLH in children and adults, and triggers for treatment of XLH in adults. The four case studies showed different levels of item awareness, observed consensus and various degrees of prompted agreement. All participants agreed or strongly agreed with statements based on the most frequent items listed in Round 1. Less frequent items generated during Round 1 had various degrees of prompted agreement consensus, and some did not reach the consensus threshold of >50% agreement by the participants. Conclusions: Observed proportional group awareness and consensus is a relatively quick process compared with the Delphi technique and its variants, providing objective assessment of expert knowledge and standardized categorization of items with regard to awareness, consensus and prompting. It offers the opportunity for tailored management of each item or concept in terms of educational need and further investigation.

Download Full-text

Storing, Maintaining and Mobilizing Botswana National Museum’s Entomology Digital Collections: The GBIF/BID Approach

Biodiversity Information Science and Standards ◽

10.3897/biss.2.26328 ◽

2018 ◽

Vol 2 ◽

pp. e26328

Author(s):

Boikhutso Lerato Rapalai

Keyword(s):

National Museum ◽

Global Biodiversity Information Facility ◽

Digital Collections ◽

Natural Heritage ◽

Heritage Sites ◽

Biodiversity Knowledge ◽

Insect Biodiversity ◽

Decision Making Processes ◽

Global Biodiversity ◽

Biodiversity Information

The Botswana National Museum is mandated to protect, preserve and promote Botswana’s cultural and natural heritage for sustainable utilization thereof by collecting, researching, conserving and exhibiting for public education and appreciation. The Entomology Section of the museum is aiming towards becoming the national center for entomology collections as well as contributing to the monitoring and enhancement of natural heritage sites in Botswana. The Botswana National Museum entomology collection was assembled over more than three decades by a succession of collectors, curators and technical officers. Specimens are carefully prepared and preserved, labelled with field data, sorted and safely stored. The collection is preserved as wet (ethanol preserved) or as dry pinned specimens in drawers. This collection is invaluable for reference, research, baseline data and educational purposes. As a way of mobilizing insect biodiversity data and making it available online for conservation efforts and decision making processes, in 2016 the Botswana National Museum collaborated with five other African states to implement the Biodiversity Information for Development (BID) and Global Biodiversity Information Facility (GBIF) funded African Insect Atlas’ Project (https://www.gbif.org/project/82632/african-insect-atlas). This collaborative project was initiated to move biodiversity knowledge out of select insect collections into the hands of a new generation of global biodiversity researchers interested in direct outcomes. To date, the Botswana National Museum has been instrumental through the efforts of this project in storing, maintaining and mobilizing insect digital collections and making the data available online through the GBIF Platform.

Download Full-text

Global priorities for an effective information basis of biodiversity distributions

10.7287/peerj.preprints.856 ◽

2015 ◽

Author(s):

Carsten Meyer ◽

Holger Kreft ◽

Robert P Guralnick ◽

Walter Jetz

Keyword(s):

Emerging Economies ◽

Information Needs ◽

Distribution Maps ◽

Terrestrial Vertebrates ◽

Biodiversity Knowledge ◽

Accessible Information ◽

United Nations Convention ◽

The Tropics ◽

Information Basis ◽

Biodiversity Information

Severe gaps and biases in digital accessible information (DAI) of species distributions hamper prospects of safeguarding biodiversity and ecosystem services and reliably addressing central questions in ecology and evolution. Accordingly, governments have agreed on improving and sharing biodiversity knowledge by 2020 (United Nations Convention on Biological Diversity’s Aichi target 19). To achieve this target, gaps in DAI must be identified, and actions prioritized to address their root causes. We take terrestrial vertebrates, an iconic and comparatively well-studied group, as a model and present the first globally comprehensive assessment of patterns and drivers of gaps in DAI, based on an integration of 157 million validated point records with 21,170 expert-based distribution maps. We demonstrate that outside a few well-sampled regions, DAI provides a very limited and spatially highly biased inventory of actual biodiversity. Coarser spatial grains result in more complete inventories, but provide insufficient detail for conservation and resource management. Surprisingly, large emerging economies are particularly under-represented in global DAI, even more so than species-rich, developing countries in the tropics. Multi-model inference reveals that completeness is mainly limited by distance to researchers, locally available research funding, and political participation in data-sharing networks, rather than transportation infrastructure, or size and funding of Western data contributors as often assumed. Our study provides an empirical baseline to advance strategies of enhancing the global information basis of biodiversity. In particular, our results highlight the need for targeted data integration from non-Western data holders and intensified cooperation to more effectively address societal biodiversity information needs.

Download Full-text