Current progress in the development of taxonomic and anatomical ontologies within the scope of BIOfid

Biodiversity Information Science and Standards ◽

10.3897/biss.2.25585 ◽

2018 ◽

Vol 2 ◽

pp. e25585

Author(s):

Markus Koch ◽

Christine Driller ◽

Marco Schmidt ◽

Thomas Hörnschemeyer ◽

Claus Weiland ◽

...

Keyword(s):

Language Processing ◽

Vascular Plants ◽

Information Service ◽

Hierarchical Classification ◽

Global Biodiversity Information Facility ◽

Anatomy Ontology ◽

Related Data ◽

Vernacular Names ◽

Insect Order ◽

Novel Applications

The Specialized Information Service Biodiversity Research (BIOfid; http://biofid.de/) has recently been launched to mobilize valuable biodiversity data hidden in German print sources of the past 250 years. The partners involved in this project started digitisation of the literature corpus envisaged for the pilot stage and provided novel applications for natural language processing and visualization. In order to foster development of new text mining tools, the Senckenberg Biodiversity Informatics team focuses on the design of ontologies for taxa and their anatomy. We present our progress for the taxa prioritized by the target group for the pilot stage, i.e. for vascular plants, moths and butterflies, as well as birds. With regard to our text corpus a key aspect of our taxonomic ontologies is the inclusion of German vernacular names. For this purpose we assembled a taxonomy ontology for vascular plants by synchronizing taxon lists from the Global Biodiversity Information Facility (GBIF) and the Integrated Taxonomic Information System (ITIS) with K.P. Buttler’s Florenliste von Deutschland (http://www.kp-buttler.de/florenliste/). Hierarchical classification of the taxonomic names and class relationships focus on rank and status (validity vs. synonymy). All classes are additionally annotated with details on scientific name, taxonomic authorship, and source. Taxonomic names for birds are mainly compiled from ITIS and the International Ornithological Congress (IOC) World Bird List, for moths and butterflies mainly from GBIF, both lists being classified and annotated accordingly. We intend to cross-link our taxonomy ontologies with the Environment Ontology (ENVO) and anatomy ontologies such as the Flora Phenotype Ontology (FLOPO). For moths and butterflies we started to design the Lepidoptera Anatomy Ontology (LepAO) on the basis of the already available Hymenoptera Anatomy Ontology (HAO). LepAO is planned to be interoperable with other ontologies in the framework of the OBO foundry. A main modification of HAO is the inclusion of German anatomical terms from published glossaries that we add as scientific and vernacular synonyms to make use of already available identifiers (URIs) for corresponding English terms. International collaboration with the founders of HAO and teams focusing on other insect orders such as beetles (ColAO) aims at development of a unified Insect Anatomy Ontology. With a restriction on terms applicable on all insects the unified Insect Anatomy Ontology is intended to establish a basis for accelerating the design of more specific anatomy ontologies for any particular insect order. The advancement of such ontologies aligns with current needs to make knowledge accumulated in descriptive studies on the systematics of organisms accessible to other domains. In the context of BIOfid our ontologies provide exemplars on how semantic queries of yet untapped data relevant for biodiversity studies can be achieved for literature in non-English languages. Furthermore, BIOfid will serve as an open access platform for professional international journals facilitating non-commercial publishing of biodiversity and biodiversity-related data.

Download Full-text

Fast and Easy Access to Central European Biodiversity Data with BIOfid

Biodiversity Information Science and Standards ◽

10.3897/biss.4.59157 ◽

2020 ◽

Vol 4 ◽

Author(s):

Christine Driller ◽

Markus Koch ◽

Giuseppe Abrami ◽

Wahed Hemati ◽

Andy Lücking ◽

...

Keyword(s):

Machine Learning ◽

Language Processing ◽

Information Service ◽

Training Data ◽

Easy Access ◽

Species Occurrence ◽

Global Biodiversity Information Facility ◽

Text Corpus ◽

Short Supply ◽

Occurrence Records

The storage of data in public repositories such as the Global Biodiversity Information Facility (GBIF) or the National Center for Biotechnology Information (NCBI) is nowadays stipulated in the policies of many publishers in order to facilitate data replication or proliferation. Species occurrence records contained in legacy printed literature are no exception to this. The extent of their digital and machine-readable availability, however, is still far from matching the existing data volume (Thessen and Parr 2014). But precisely these data are becoming more and more relevant to the investigation of ongoing loss of biodiversity. In order to extract species occurrence records at a larger scale from available publications, one has to apply specialised text mining tools. However, such tools are in short supply especially for scientific literature in the German language. The Specialised Information Service Biodiversity Research*1 BIOfid (Koch et al. 2017) aims at reducing this desideratum, inter alia, by preparing a searchable text corpus semantically enriched by a new kind of multi-label annotation. For this purpose, we feed manual annotations into automatic, machine-learning annotators. This mixture of automatic and manual methods is needed, because BIOfid approaches a new application area with respect to language (mainly German of the 19th century), text type (biological reports), and linguistic focus (technical and everyday language). We will present current results of the performance of BIOfid’s semantic search engine and the application of independent natural language processing (NLP) tools. Most of these are freely available online, such as TextImager (Hemati et al. 2016). We will show how TextImager is tied into the BIOfid pipeline and how it is made scalable (e.g. extendible by further modules) and usable on different systems (docker containers). Further, we will provide a short introduction to generating machine-learning training data using TextAnnotator (Abrami et al. 2019) for multi-label annotation. Annotation reproducibility can be assessed by the implementation of inter-annotator agreement methods (Abrami et al. 2020). Beyond taxon recognition and entity linking, we place particular emphasis on location and time information. For this purpose, our annotation tag-set combines general categories and biology-specific categories (including taxonomic names) with location and time ontologies. The application of the annotation categories is regimented by annotation guidelines (Lücking et al. 2020). Within the next years, our work deliverable will be a semantically accessible and data-extractable text corpus of around two million pages. In this way, BIOfid is creating a new valuable resource that expands our knowledge of biodiversity and its determinants.

Download Full-text

Semantic Search in Legacy Biodiversity Literature: Integrating data from different data infrastructures

Biodiversity Information Science and Standards ◽

10.3897/biss.5.74251 ◽

2021 ◽

Vol 5 ◽

Author(s):

Adrian Pachzelt ◽

Gerwin Kasperek ◽

Andy Lücking ◽

Giuseppe Abrami ◽

Christine Driller

Keyword(s):

Search Engine ◽

Language Processing ◽

Semantic Network ◽

Information Service ◽

Data Retrieval ◽

Biological Data ◽

Graph Representation ◽

Global Biodiversity Information Facility ◽

Data Repositories ◽

Search Results

Nowadays, obtaining information by entering queries into a web search engine is routine behaviour. With its search portal, the Specialised Information Service Biodiversity Research (BIOfid) adapts the exploration of legacy biodiversity literature and data extraction to current standards (Driller et al. 2020). In this presentation, we introduce the BIOfid search portal and its functionalities in a How-To short guide. To this end, we adapted a knowledge graph representation of our thematic focus of Central European, primarily German language, biodiversity literature of the 19th and 20th centuries. Now, users can search our text-mined corpus containing to date more than 8.700 full-text articles from 68 journals, and particularly focussing on birds, lepidopterans and vascular plants. The texts are automatically preprocessed by the Natural Language Processing provider TextImager (Hemati et al. 2016) and will be linked to various databases such as Wikidata, Wikipedia, the Global Biodiversity Information Facility (GBIF), Encyclopedia of Life (EoL), Geonames, the Integrated Authority File (GND) and WordNet. For data retrieval, users can filter search results and download the article metadata as well as text annotations and database links in JavaScript Object Notation (JSON) format. For example, literature that mentions taxa from certain decades or co-occurrences of species can be searched. Our search engine recognises scientific and vernacular taxon names based on the GBIF Backbone Taxonomy and offers search suggestions to support the user. The semantic network of the BIOfid search portal is also enriched with data from the EoL trait bank, so that trait data can be included in the search queries. Thus, scientists can enhance their own data sets with the search results and feed them into the relevant biodiversity data repositories to sustainably expand the corresponding knowledge graphs with reliable data. Since BIOfid applies standard ontology terms, all data mobilized from literature can be combined with data on natural history collection objects or data from current research projects in order to generate more comprehensive knowledge. Furthermore, taxonomy, ecology and trait ontologies that have been built or extended within this project will be made available through appropriate platforms such as The Open Biological and Biomedical Ontology (OBO) Foundry and the Terminology Service of The German Federation for Biological Data (GFBio).

Download Full-text

An Annotated Corpus of Crime-Related Portuguese Documents for NLP and Machine Learning Processing

Data ◽

10.3390/data6070071 ◽

2021 ◽

Vol 6 (7) ◽

pp. 71

Author(s):

Gonçalo Carnaz ◽

Mário Antunes ◽

Vitor Beires Nogueira

Keyword(s):

Machine Learning ◽

Language Processing ◽

Named Entity Recognition ◽

Entity Recognition ◽

Automatic Identification ◽

Named Entities ◽

Related Data ◽

Named Entity ◽

Chain Of Custody ◽

Evidence Collection

Criminal investigations collect and analyze the facts related to a crime, from which the investigators can deduce evidence to be used in court. It is a multidisciplinary and applied science, which includes interviews, interrogations, evidence collection, preservation of the chain of custody, and other methods and techniques of investigation. These techniques produce both digital and paper documents that have to be carefully analyzed to identify correlations and interactions among suspects, places, license plates, and other entities that are mentioned in the investigation. The computerized processing of these documents is a helping hand to the criminal investigation, as it allows the automatic identification of entities and their relations, being some of which difficult to identify manually. There exists a wide set of dedicated tools, but they have a major limitation: they are unable to process criminal reports in the Portuguese language, as an annotated corpus for that purpose does not exist. This paper presents an annotated corpus, composed of a collection of anonymized crime-related documents, which were extracted from official and open sources. The dataset was produced as the result of an exploratory initiative to collect crime-related data from websites and conditioned-access police reports. The dataset was evaluated and a mean precision of 0.808, recall of 0.722, and F1-score of 0.733 were obtained with the classification of the annotated named-entities present in the crime-related documents. This corpus can be employed to benchmark Machine Learning (ML) and Natural Language Processing (NLP) methods and tools to detect and correlate entities in the documents. Some examples are sentence detection, named-entity recognition, and identification of terms related to the criminal domain.

Download Full-text

The World Checklist of Vascular Plants, a continuously updated resource for exploring global plant diversity

Scientific Data ◽

10.1038/s41597-021-00997-6 ◽

2021 ◽

Vol 8 (1) ◽

Author(s):

Rafaël Govaerts ◽

Eimear Nic Lughadha ◽

Nicholas Black ◽

Robert Turner ◽

Alan Paton

Keyword(s):

Plant Diversity ◽

Vascular Plants ◽

Sustainable Use ◽

Data Reuse ◽

Global Biodiversity Information Facility ◽

Full Dataset ◽

The World ◽

Technical Validation ◽

International Data ◽

Data Collation

AbstractThe World Checklist of Vascular Plants (WCVP) is a comprehensive list of scientifically described plant species, compiled over four decades, from peer-reviewed literature, authoritative scientific databases, herbaria and observations, then reviewed by experts. It is a vital tool to facilitate plant diversity research, conservation and effective management, including sustainable use and equitable sharing of benefits. To maximise utility, such lists should be accessible, explicitly evidence-based, transparent, expert-reviewed, and regularly updated, incorporating new evidence and emerging scientific consensus. WCVP largely meets these criteria, being continuously updated and freely available online. Users can browse, search, or download a user-defined subset of accepted species with corresponding synonyms and bibliographic details, or a date-stamped full dataset. To facilitate appropriate data reuse by individual researchers and global initiatives including Global Biodiversity Information Facility, Catalogue of Life and World Flora Online, we document data collation and review processes, the underlying data structure, and the international data standards and technical validation that ensure data quality and integrity. We also address the questions most frequently received from users.

Download Full-text

Exploring and Reconstructing Ancestral Anatomies using Ontology-Informed Approaches

Biodiversity Information Science and Standards ◽

10.3897/biss.3.37038 ◽

2019 ◽

Vol 3 ◽

Author(s):

Sergei Tarasov ◽

Istvan Miko ◽

Matthew Yoder ◽

Josef Uyeda

Keyword(s):

Phylogenetic Trees ◽

Character State ◽

Ancestral Reconstruction ◽

Anatomy Ontology ◽

Insect Order ◽

Ancestral Character State ◽

Body Regions ◽

Individual Traits ◽

History Of ◽

Complex Characters

Ancestral character state reconstruction has been long used to gain insight into the evolution of individual traits in organisms. However, organismal anatomies (= entire phenotypes) are not merely ensembles of individual traits, rather they are complex systems where traits interact with each other due to anatomical dependencies (when one trait depends on the presence of another trait) and developmental constraints. Comparative phylogenetics has been largely lacking a method for reconstructing the evolution of entire organismal anatomies or organismal body regions. Herein, we present a new approach named PARAMO (Phylogenetic Ancestral Reconstruction of Anatomy by Mapping Ontologies, Tarasov and Uyeda 2019) that takes into account anatomical dependencies and uses stochastic maps (i.e., phylogenetic trees with an instance of mapped evolutionary history of characters, Huelsenbeck et al. 2003) along with anatomy ontologies to reconstruct organismal anatomies. Our approach treats the entire phenotype or its component body regions as single complex characters and allows exploring and comparing phenotypic evolution at different levels of anatomical hierarchy. These complex characters are constructed by ontology-informed amalgamation of elementary characters (i.e., those coded in character matrix) using stochastic maps. In our approach, characters are linked with the terms from an anatomy ontology, which allows viewing them not just as an ensemble of character state tokens but as entities that have their own biological meaning provided by the ontology. This ontology-informed framework provides new opportunities for tracking phenotypic radiations and anatomical evolution of organisms, which we explore using a large dataset for the insect order Hymenoptera (sawflies, wasps, ants and bees).

Download Full-text

Using Global Biodiversity Information Facility Occurrence Data for Automated Invasive Alien Species Risk Mapping

Biodiversity Information Science and Standards ◽

10.3897/biss.4.59172 ◽

2020 ◽

Vol 4 ◽

Author(s):

Amy Davis ◽

Tim Adriaens ◽

Rozemien De Troch ◽

Peter Desmet ◽

Quentin Groom ◽

...

Keyword(s):

Alien Species ◽

Vascular Plants ◽

Invasive Alien Species ◽

Risk Assessments ◽

Sampling Effort ◽

Global Biodiversity Information Facility ◽

Global Biodiversity ◽

Risk Maps ◽

Species Occurrences ◽

Biodiversity Information

To support invasive alien species risk assessments, the Tracking Invasive Alien Species (TrIAS) project has developed an automated, open, workflow incorporating state-of-the-art species distribution modelling practices to create risk maps using the open source language R. It is based on Global Biodiversity Information Facility (GBIF) data and openly published environmental data layers characterizing climate and land cover. Our workflow requires only a species name and generates an ensemble of machine-learning algorithms (Random Forest, Boosted Regression Trees, K-Nearest Neighbors and AdaBoost) stacked together as a meta-model to produce the final risk map at 1 km2 resolution (Fig. 1). Risk maps are generated automatically for standard Intergovernmental Panel on Climate Change (IPCC) greenhouse gas emission scenarios and are accompanied by maps illustrating the confidence of each individual prediction across space, thus enabling the intuitive visualization and understanding of how the confidence of the model varies across space and scenario (Fig. 2). The effects of sampling bias are accounted for by providing options to: use the sampling effort of the higher taxon the modelled species belongs to (e.g., vascular plants), and to thin species occurrences. use the sampling effort of the higher taxon the modelled species belongs to (e.g., vascular plants), and to thin species occurrences. The risk maps generated by our workflow are defensible and repeatable and provide forecasts of alien species distributions under further climate change scenarios. They can be used to support risk assessments and guide surveillance efforts on alien species in Europe. The detailied modeling framework and code are available on GitHub: https://github.com/trias-project.

Download Full-text

Genus Allium in CSBG Digital Herbarium

BIO Web of Conferences ◽

10.1051/bioconf/20202400042 ◽

2020 ◽

Vol 24 ◽

pp. 00042

Author(s):

Nataliya Kovtonyuk ◽

Irina Han ◽

Evgeniya Gatilova ◽

Nikolai Friesen

Keyword(s):

Vascular Plants ◽

Russian Far East ◽

Far East ◽

Botanical Garden ◽

International Standards ◽

Type Specimens ◽

Global Biodiversity Information Facility ◽

East Europe ◽

Herbarium Collections ◽

Biodiversity Information

Two herbarium collections (NS and NSK) of the Central Siberian Botanical Garden SB RAS keep about 740,000 specimens of vascular plants, collected in Siberia, Russian Far East, Europe, Asia and North America. Genus Allium s. lat. Is presented by 6224 herbarium sheets, all of them were scanned using international standards: at a resolution of 600 dpi, the barcode for each specimen, 24-color scale and scale bar. Images and metadata are stored at the CSBG SB RAS Digital Herbarium, generated by ScanWizard Botany and MiVapp Botany software (Microtek, Taiwan). Datasets were published via IPT at the Global Biodiversity Information Facility portal (gbif.org). In total 207 species of the genus Allium are placed in the CSBS Digital Herbarium, which includes representatives from 13 subgenera and 49 sections of the genus. 35 type specimens of 18 species and subspecies of the genus Allium are hosted in CSBG Herbarium collections.

Download Full-text

Togo National Herbarium database

PhytoKeys ◽

10.3897/phytokeys.109.25385 ◽

2018 ◽

Vol 109 ◽

pp. 1-16

Author(s):

Raoufou Radji ◽

Kossi Adjonou ◽

Quashie Marie-Luce Akossiwoa ◽

Komlan Edjèdu Sodjinou ◽

Francisco Pando ◽

...

Keyword(s):

Burkina Faso ◽

Vascular Plants ◽

Current Knowledge ◽

Good Representation ◽

Label Data ◽

Vernacular Names ◽

National Herbarium ◽

The University

This article describes the herbarium database of the University of Lomé. The database provides a good representation of the current knowledge of the flora of Togo. The herbarium of University of Lomé, known also as Herbarium togoense is the national herbarium and is registered in Index Herbariorum with the abbreviation TOGO. It contains 15,000 specimens of vascular plants coming mostly from all Togo's ecofloristic regions. Less than one percent of the specimens are from neighbouring countries such as Ghana, Benin and Burkina Faso. Collecting site details are specified in more that 97% of the sheet labels, but only about 50% contain geographic coordinates. Besides being a research resource, the herbarium constitutes an educational collection. The dataset described in this paper is registered with GBIF and accessible at https://www.gbif.org/dataset/b05dd467-aaf8-4c67-843c-27f049057b78. It was developed with the RIHA software (Réseau Informatique des Herbiers d'Afrique). The RIHA system (Chevillotte and Florence 2006, Radji et al. 2009) allows the capture of label data and associated information such as synonyms, vernacular names, taxonomic hierarchy and references.

Download Full-text

Artificial Intelligence in the Battle against Coronavirus (COVID-19): A Survey and Future Research Directions

10.36227/techrxiv.12743933.v1 ◽

2020 ◽

Cited By ~ 2

Author(s):

Thanh Thi Nguyen

Keyword(s):

Artificial Intelligence ◽

Language Processing ◽

Medical Image ◽

Current Status ◽

Future Research ◽

Research Directions ◽

Daily Lives ◽

Related Data ◽

Future Research Directions ◽

The Internet Of Things

Artificial intelligence (AI) has been applied widely in our daily lives in a variety of ways with numerous successful stories. AI has also contributed to dealing with the coronavirus disease (COVID-19) pandemic, which has been happening around the globe. This paper presents a survey of AI methods being used in various applications in the fight against the COVID-19 outbreak and outlines the crucial roles of AI research in this unprecedented battle. We touch on a number of areas where AI plays as an essential component, from medical image processing, data analytics, text mining and natural language processing, the Internet of Things, to computational biology and medicine. A summary of COVID-19 related data sources that are available for research purposes is also presented. Research directions on exploring the potentials of AI and enhancing its capabilities and power in the battle are thoroughly discussed. We highlight 13 groups of problems related to the COVID-19 pandemic and point out promising AI methods and tools that can be used to solve those problems. It is envisaged that this study will provide AI researchers and the wider community an overview of the current status of AI applications and motivate researchers in harnessing AI potentials in the fight against COVID-19.

Download Full-text

The Standards behind the Scenes: Explaining data from the Plazi workflow

Biodiversity Information Science and Standards ◽

10.3897/biss.4.59178 ◽

2020 ◽

Vol 4 ◽

Author(s):

Donat Agosti ◽

Marcus Guidoti ◽

Terry Catapano ◽

Alexandros Ioannidis-Pantopikos ◽

Guido Sautter

Keyword(s):

Task Force ◽

Journal Article ◽

Biotic Interactions ◽

Portable Document Format ◽

Biomedical Ontology ◽

Third Party ◽

Biological Interactions ◽

Global Biodiversity Information Facility ◽

Related Data ◽

Scholarly Publications

As part of the CETAF COVID19 task force, Plazi liberated taxonomic treatments, figures, observation records, biotic interactions, taxonomic names, and collection and specimen codes involving bats and viruses from scholarly publications with the intention to create open access, findable, accessible, interoperable and reusable data (FAIR). The data is accessible via TreatmentBank and the Biodiversity Literature Repository (BLR) and it is continually harvested and reused by the Global Biodiversity Information Facility (GBIF) and Global Biotic Interactions (GloBI). This data was processed, enhanced and liberated by the Plazi workflow, which involves a dedicated infrastructure including a desktop application (GoldenGate Imagine) that converts portable document format files (PDF) to a dedicated open compressed file format (Image Markup File (IMF)) that is responsible for the data enhancement. To enhance the data contained in the publications, including the biological interactions, a series of standards and vocabularies are used. To the exception of TaxPub, which is a taxonomic specific extension of the U.S. National Center for Biotechnology Information's (NCBI) Journal Article Tag Suite (JATS), all other used vocabulary were previously proposed. This goes along with Plazi’s mission to reuse standards unless they are not available. The following standards of vocabularies are used: Metadata Object Description Schema (MODS) to model article metadata information on Plazi’s XMLs; Darwin Core for taxonomic ranks and materials citation related data; Open Biological and Biomedical Ontology (OBO); Relations Ontology for biological interactions between organisms. The latter two are also used in the custom metadata in the Biodiversity Literature Repository at Zenodo. In this presentation we will provide an overview of the different types of data followed by the standards or vocabularies applied for every and each one of them and their parts. The goal is to provide the context on how the data liberated by Plazi is described, which is extensively reused by third-party applications such as GBIF or GloBI. The use of the standards allows fully automated, daily data ingests by GBIF.

Download Full-text