Semantic Search in Legacy Biodiversity Literature: Integrating data from different data infrastructures

Biodiversity Information Science and Standards ◽

10.3897/biss.5.74251 ◽

2021 ◽

Vol 5 ◽

Author(s):

Adrian Pachzelt ◽

Gerwin Kasperek ◽

Andy Lücking ◽

Giuseppe Abrami ◽

Christine Driller

Keyword(s):

Search Engine ◽

Language Processing ◽

Semantic Network ◽

Information Service ◽

Data Retrieval ◽

Biological Data ◽

Graph Representation ◽

Global Biodiversity Information Facility ◽

Data Repositories ◽

Search Results

Nowadays, obtaining information by entering queries into a web search engine is routine behaviour. With its search portal, the Specialised Information Service Biodiversity Research (BIOfid) adapts the exploration of legacy biodiversity literature and data extraction to current standards (Driller et al. 2020). In this presentation, we introduce the BIOfid search portal and its functionalities in a How-To short guide. To this end, we adapted a knowledge graph representation of our thematic focus of Central European, primarily German language, biodiversity literature of the 19th and 20th centuries. Now, users can search our text-mined corpus containing to date more than 8.700 full-text articles from 68 journals, and particularly focussing on birds, lepidopterans and vascular plants. The texts are automatically preprocessed by the Natural Language Processing provider TextImager (Hemati et al. 2016) and will be linked to various databases such as Wikidata, Wikipedia, the Global Biodiversity Information Facility (GBIF), Encyclopedia of Life (EoL), Geonames, the Integrated Authority File (GND) and WordNet. For data retrieval, users can filter search results and download the article metadata as well as text annotations and database links in JavaScript Object Notation (JSON) format. For example, literature that mentions taxa from certain decades or co-occurrences of species can be searched. Our search engine recognises scientific and vernacular taxon names based on the GBIF Backbone Taxonomy and offers search suggestions to support the user. The semantic network of the BIOfid search portal is also enriched with data from the EoL trait bank, so that trait data can be included in the search queries. Thus, scientists can enhance their own data sets with the search results and feed them into the relevant biodiversity data repositories to sustainably expand the corresponding knowledge graphs with reliable data. Since BIOfid applies standard ontology terms, all data mobilized from literature can be combined with data on natural history collection objects or data from current research projects in order to generate more comprehensive knowledge. Furthermore, taxonomy, ecology and trait ontologies that have been built or extended within this project will be made available through appropriate platforms such as The Open Biological and Biomedical Ontology (OBO) Foundry and the Terminology Service of The German Federation for Biological Data (GFBio).

Download Full-text

WINFRA: A Web-Based Platform for Semantic Data Retrieval and Data Analytics

Mathematics ◽

10.3390/math8112090 ◽

2020 ◽

Vol 8 (11) ◽

pp. 2090

Author(s):

Addi Ait-Mlouk ◽

Xuan-Son Vu ◽

Lili Jiang

Keyword(s):

Semantic Web ◽

Association Rules ◽

Language Processing ◽

Data Analytics ◽

Data Retrieval ◽

Heterogeneous Data ◽

Graph Representation ◽

Knowledge Graph ◽

Multiple Sources ◽

Semantic Data

Given the huge amount of heterogeneous data stored in different locations, it needs to be federated and semantically interconnected for further use. This paper introduces WINFRA, a comprehensive open-access platform for semantic web data and advanced analytics based on natural language processing (NLP) and data mining techniques (e.g., association rules, clustering, classification based on associations). The system is designed to facilitate federated data analysis, knowledge discovery, information retrieval, and new techniques to deal with semantic web and knowledge graph representation. The processing step integrates data from multiple sources virtually by creating virtual databases. Afterwards, the developed RDF Generator is built to generate RDF files for different data sources, together with SPARQL queries, to support semantic data search and knowledge graph representation. Furthermore, some application cases are provided to demonstrate how it facilitates advanced data analytics over semantic data and showcase our proposed approach toward semantic association rules.

Download Full-text

Current progress in the development of taxonomic and anatomical ontologies within the scope of BIOfid

Biodiversity Information Science and Standards ◽

10.3897/biss.2.25585 ◽

2018 ◽

Vol 2 ◽

pp. e25585

Author(s):

Markus Koch ◽

Christine Driller ◽

Marco Schmidt ◽

Thomas Hörnschemeyer ◽

Claus Weiland ◽

...

Keyword(s):

Language Processing ◽

Vascular Plants ◽

Information Service ◽

Hierarchical Classification ◽

Global Biodiversity Information Facility ◽

Anatomy Ontology ◽

Related Data ◽

Vernacular Names ◽

Insect Order ◽

Novel Applications

The Specialized Information Service Biodiversity Research (BIOfid; http://biofid.de/) has recently been launched to mobilize valuable biodiversity data hidden in German print sources of the past 250 years. The partners involved in this project started digitisation of the literature corpus envisaged for the pilot stage and provided novel applications for natural language processing and visualization. In order to foster development of new text mining tools, the Senckenberg Biodiversity Informatics team focuses on the design of ontologies for taxa and their anatomy. We present our progress for the taxa prioritized by the target group for the pilot stage, i.e. for vascular plants, moths and butterflies, as well as birds. With regard to our text corpus a key aspect of our taxonomic ontologies is the inclusion of German vernacular names. For this purpose we assembled a taxonomy ontology for vascular plants by synchronizing taxon lists from the Global Biodiversity Information Facility (GBIF) and the Integrated Taxonomic Information System (ITIS) with K.P. Buttler’s Florenliste von Deutschland (http://www.kp-buttler.de/florenliste/). Hierarchical classification of the taxonomic names and class relationships focus on rank and status (validity vs. synonymy). All classes are additionally annotated with details on scientific name, taxonomic authorship, and source. Taxonomic names for birds are mainly compiled from ITIS and the International Ornithological Congress (IOC) World Bird List, for moths and butterflies mainly from GBIF, both lists being classified and annotated accordingly. We intend to cross-link our taxonomy ontologies with the Environment Ontology (ENVO) and anatomy ontologies such as the Flora Phenotype Ontology (FLOPO). For moths and butterflies we started to design the Lepidoptera Anatomy Ontology (LepAO) on the basis of the already available Hymenoptera Anatomy Ontology (HAO). LepAO is planned to be interoperable with other ontologies in the framework of the OBO foundry. A main modification of HAO is the inclusion of German anatomical terms from published glossaries that we add as scientific and vernacular synonyms to make use of already available identifiers (URIs) for corresponding English terms. International collaboration with the founders of HAO and teams focusing on other insect orders such as beetles (ColAO) aims at development of a unified Insect Anatomy Ontology. With a restriction on terms applicable on all insects the unified Insect Anatomy Ontology is intended to establish a basis for accelerating the design of more specific anatomy ontologies for any particular insect order. The advancement of such ontologies aligns with current needs to make knowledge accumulated in descriptive studies on the systematics of organisms accessible to other domains. In the context of BIOfid our ontologies provide exemplars on how semantic queries of yet untapped data relevant for biodiversity studies can be achieved for literature in non-English languages. Furthermore, BIOfid will serve as an open access platform for professional international journals facilitating non-commercial publishing of biodiversity and biodiversity-related data.

Download Full-text

Fast and Easy Access to Central European Biodiversity Data with BIOfid

Biodiversity Information Science and Standards ◽

10.3897/biss.4.59157 ◽

2020 ◽

Vol 4 ◽

Author(s):

Christine Driller ◽

Markus Koch ◽

Giuseppe Abrami ◽

Wahed Hemati ◽

Andy Lücking ◽

...

Keyword(s):

Machine Learning ◽

Language Processing ◽

Information Service ◽

Training Data ◽

Easy Access ◽

Species Occurrence ◽

Global Biodiversity Information Facility ◽

Text Corpus ◽

Short Supply ◽

Occurrence Records

The storage of data in public repositories such as the Global Biodiversity Information Facility (GBIF) or the National Center for Biotechnology Information (NCBI) is nowadays stipulated in the policies of many publishers in order to facilitate data replication or proliferation. Species occurrence records contained in legacy printed literature are no exception to this. The extent of their digital and machine-readable availability, however, is still far from matching the existing data volume (Thessen and Parr 2014). But precisely these data are becoming more and more relevant to the investigation of ongoing loss of biodiversity. In order to extract species occurrence records at a larger scale from available publications, one has to apply specialised text mining tools. However, such tools are in short supply especially for scientific literature in the German language. The Specialised Information Service Biodiversity Research*1 BIOfid (Koch et al. 2017) aims at reducing this desideratum, inter alia, by preparing a searchable text corpus semantically enriched by a new kind of multi-label annotation. For this purpose, we feed manual annotations into automatic, machine-learning annotators. This mixture of automatic and manual methods is needed, because BIOfid approaches a new application area with respect to language (mainly German of the 19th century), text type (biological reports), and linguistic focus (technical and everyday language). We will present current results of the performance of BIOfid’s semantic search engine and the application of independent natural language processing (NLP) tools. Most of these are freely available online, such as TextImager (Hemati et al. 2016). We will show how TextImager is tied into the BIOfid pipeline and how it is made scalable (e.g. extendible by further modules) and usable on different systems (docker containers). Further, we will provide a short introduction to generating machine-learning training data using TextAnnotator (Abrami et al. 2019) for multi-label annotation. Annotation reproducibility can be assessed by the implementation of inter-annotator agreement methods (Abrami et al. 2020). Beyond taxon recognition and entity linking, we place particular emphasis on location and time information. For this purpose, our annotation tag-set combines general categories and biology-specific categories (including taxonomic names) with location and time ontologies. The application of the annotation categories is regimented by annotation guidelines (Lücking et al. 2020). Within the next years, our work deliverable will be a semantically accessible and data-extractable text corpus of around two million pages. In this way, BIOfid is creating a new valuable resource that expands our knowledge of biodiversity and its determinants.

Download Full-text

The EMBL-EBI search and sequence analysis tools APIs in 2019

Nucleic Acids Research ◽

10.1093/nar/gkz268 ◽

2019 ◽

Vol 47 (W1) ◽

pp. W636-W641 ◽

Cited By ~ 1076

Author(s):

Fábio Madeira ◽

Young mi Park ◽

Joon Lee ◽

Nicola Buso ◽

Tamer Gur ◽

...

Keyword(s):

Sequence Analysis ◽

Web Services ◽

Search Engine ◽

Data Retrieval ◽

Biological Data ◽

Free Access ◽

Text Search ◽

Web Interfaces ◽

Analysis Tools ◽

User Friendly

Abstract The EMBL-EBI provides free access to popular bioinformatics sequence analysis applications as well as to a full-featured text search engine with powerful cross-referencing and data retrieval capabilities. Access to these services is provided via user-friendly web interfaces and via established RESTful and SOAP Web Services APIs (https://www.ebi.ac.uk/seqdb/confluence/display/JDSAT/EMBL-EBI+Web+Services+APIs+-+Data+Retrieval). Both systems have been developed with the same core principles that allow them to integrate an ever-increasing volume of biological data, making them an integral part of many popular data resources provided at the EMBL-EBI. Here, we describe the latest improvements made to the frameworks which enhance the interconnectivity between public EMBL-EBI resources and ultimately enhance biological data discoverability, accessibility, interoperability and reusability.

Download Full-text

Named Data Networking for Genomics Data Management and Integrated Workflows

Frontiers in Big Data ◽

10.3389/fdata.2021.582468 ◽

2021 ◽

Vol 4 ◽

Author(s):

Cameron Ogle ◽

David Reddick ◽

Coleman McKnight ◽

Tyler Biggs ◽

Rini Pauly ◽

...

Keyword(s):

Data Management ◽

Data Retrieval ◽

Data Bank ◽

Biological Data ◽

Preliminary Evaluation ◽

Named Data Networking ◽

Multiple Sources ◽

Data Repositories ◽

Sequencing Technologies ◽

Data Networking

Advanced imaging and DNA sequencing technologies now enable the diverse biology community to routinely generate and analyze terabytes of high resolution biological data. The community is rapidly heading toward the petascale in single investigator laboratory settings. As evidence, the single NCBI SRA central DNA sequence repository contains over 45 petabytes of biological data. Given the geometric growth of this and other genomics repositories, an exabyte of mineable biological data is imminent. The challenges of effectively utilizing these datasets are enormous as they are not only large in the size but also stored in geographically distributed repositories in various repositories such as National Center for Biotechnology Information (NCBI), DNA Data Bank of Japan (DDBJ), European Bioinformatics Institute (EBI), and NASA’s GeneLab. In this work, we first systematically point out the data-management challenges of the genomics community. We then introduce Named Data Networking (NDN), a novel but well-researched Internet architecture, is capable of solving these challenges at the network layer. NDN performs all operations such as forwarding requests to data sources, content discovery, access, and retrieval using content names (that are similar to traditional filenames or filepaths) and eliminates the need for a location layer (the IP address) for data management. Utilizing NDN for genomics workflows simplifies data discovery, speeds up data retrieval using in-network caching of popular datasets, and allows the community to create infrastructure that supports operations such as creating federation of content repositories, retrieval from multiple sources, remote data subsetting, and others. Named based operations also streamlines deployment and integration of workflows with various cloud platforms. Our contributions in this work are as follows 1) we enumerate the cyberinfrastructure challenges of the genomics community that NDN can alleviate, and 2) we describe our efforts in applying NDN for a contemporary genomics workflow (GEMmaker) and quantify the improvements. The preliminary evaluation shows a sixfold speed up in data insertion into the workflow. 3) As a pilot, we have used an NDN naming scheme (agreed upon by the community and discussed in Section 4) to publish data from broadly used data repositories including the NCBI SRA. We have loaded the NDN testbed with these pre-processed genomes that can be accessed over NDN and used by anyone interested in those datasets. Finally, we discuss our continued effort in integrating NDN with cloud computing platforms, such as the Pacific Research Platform (PRP). The reader should note that the goal of this paper is to introduce NDN to the genomics community and discuss NDN’s properties that can benefit the genomics community. We do not present an extensive performance evaluation of NDN—we are working on extending and evaluating our pilot deployment and will present systematic results in a future work.

Download Full-text

Industry Watch

Natural Language Engineering ◽

10.1017/s1351324905003979 ◽

2005 ◽

Vol 11 (4) ◽

pp. 435-438 ◽

Cited By ~ 1

Author(s):

ROBERT DALE

Keyword(s):

Search Engine ◽

Language Processing ◽

Vice President ◽

Search Results ◽

Software Company

Suppose you're a corporate vice president at a well-known international software company, and you want to check on the visibility of one of your leading researchers in the outside world. You're sitting at your desk, so the most obvious thing to do is to enter their name into a search engine. If the well-known international software company happened to be Microsoft, and if the leading researcher happened to be Microsoft's Susan Dumais, and if the search engine you decided to use happened to be Google, you might be surprised to find that the sponsored link that comes atop the search results is actually from Google itself, exhorting you to ‘Work on NLP at Google’, and alerting you to the fact that ‘Google is hiring experts in statistical language processing’.

Download Full-text

Rancang Bangun Aplikasi Chatbot Sebagai Media Pencarian Informasi Anime Menggunakan Regular Expression Pattern Matching

Jurnal ULTIMATICS ◽

10.31937/ti.v9i1.559 ◽

2017 ◽

Vol 9 (1) ◽

pp. 19-24 ◽

Cited By ~ 1

Author(s):

David Domarco ◽

Ni Made Satvika Iswari

Keyword(s):

Information Retrieval ◽

Expression Pattern ◽

Pattern Matching ◽

Language Processing ◽

Regular Expression ◽

Technology Development ◽

Data Retrieval ◽

Index Terms ◽

Retrieval Engine ◽

Behavioral Intention To Use

Technology development has affected many areas of life, especially the entertainment field. One of the fastest growing entertainment industry is anime. Anime has evolved as a trend and a hobby, especially for the population in the regions of Asia. The number of anime fans grow every year and trying to dig up as much information about their favorite anime. Therefore, a chatbot application was developed in this study as anime information retrieval media using regular expression pattern matching method. This application is intended to facilitate the anime fans in searching for information about the anime they like. By using this application, user can gain a convenience and interactive anime data retrieval that can’t be found when searching for information via search engines. Chatbot application has successfully met the standards of information retrieval engine with a very good results, the value of 72% precision and 100% recall showing the harmonic mean of 83.7%. As the application of hedonic, chatbot already influencing Behavioral Intention to Use by 83% and Immersion by 82%. Index Terms—anime, chatbot, information retrieval, Natural Language Processing (NLP), Regular Expression Pattern Matching

Download Full-text

Using the Google™ Search Engine for Health Information: Is There a Problem? Case Study: Supplements for Cancer

Current Developments in Nutrition ◽

10.1093/cdn/nzab002 ◽

2021 ◽

Vol 5 (2) ◽

Author(s):

Hannah C Cai ◽

Leanne E King ◽

Johanna T Dwyer

Keyword(s):

Health Information ◽

Search Engine ◽

Information Quality ◽

Nutrition Information ◽

High Quality ◽

Search Results ◽

Health And Nutrition ◽

Quality Rating

ABSTRACT We assessed the quality of online health and nutrition information using a Google™ search on “supplements for cancer”. Search results were scored using the Health Information Quality Index (HIQI), a quality-rating tool consisting of 12 objective criteria related to website domain, lack of commercial aspects, and authoritative nature of the health and nutrition information provided. Possible scores ranged from 0 (lowest) to 12 (“perfect” or highest quality). After eliminating irrelevant results, the remaining 160 search results had median and mean scores of 8. One-quarter of the results were of high quality (score of 10–12). There was no correlation between high-quality scores and early appearance in the sequence of search results, where results are presumably more visible. Also, 496 advertisements, over twice the number of search results, appeared. We conclude that the Google™ search engine may have shortcomings when used to obtain information on dietary supplements and cancer.

Download Full-text

The Matter of Chance: Auditing Web Search Results Related to the 2020 U.S. Presidential Primary Elections Across Six Search Engines

Social Science Computer Review ◽

10.1177/08944393211006863 ◽

2021 ◽

pp. 089443932110068

Author(s):

Aleksandra Urman ◽

Mykola Makhortykh ◽

Roberto Ulloa

Keyword(s):

Search Engine ◽

Search Engines ◽

Large Scale ◽

Web Search ◽

Primary Elections ◽

Virtual Agents ◽

Search Results ◽

Presidential Primary ◽

Large Scale Analysis ◽

Algorithmic Information

We examine how six search engines filter and rank information in relation to the queries on the U.S. 2020 presidential primary elections under the default—that is nonpersonalized—conditions. For that, we utilize an algorithmic auditing methodology that uses virtual agents to conduct large-scale analysis of algorithmic information curation in a controlled environment. Specifically, we look at the text search results for “us elections,” “donald trump,” “joe biden,” “bernie sanders” queries on Google, Baidu, Bing, DuckDuckGo, Yahoo, and Yandex, during the 2020 primaries. Our findings indicate substantial differences in the search results between search engines and multiple discrepancies within the results generated for different agents using the same search engine. It highlights that whether users see certain information is decided by chance due to the inherent randomization of search results. We also find that some search engines prioritize different categories of information sources with respect to specific candidates. These observations demonstrate that algorithmic curation of political information can create information inequalities between the search engine users even under nonpersonalized conditions. Such inequalities are particularly troubling considering that search results are highly trusted by the public and can shift the opinions of undecided voters as demonstrated by previous research.

Download Full-text

Identifying protein subcellular localisation in scientific literature using bidirectional deep recurrent neural network

Scientific Reports ◽

10.1038/s41598-020-80441-8 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Rakesh David ◽

Rhys-Joshua D. Menezes ◽

Jan De Klerk ◽

Ian R. Castleden ◽

Cornelia M. Hooper ◽

...

Keyword(s):

Neural Network ◽

Language Processing ◽

Data Dissemination ◽

Short Term Memory ◽

Biological Data ◽

Experimental Methodology ◽

Subcellular Localisation ◽

Crop Species ◽

Deep Recurrent Neural Network ◽

Functional Features

AbstractThe increased diversity and scale of published biological data has to led to a growing appreciation for the applications of machine learning and statistical methodologies to gain new insights. Key to achieving this aim is solving the Relationship Extraction problem which specifies the semantic interaction between two or more biological entities in a published study. Here, we employed two deep neural network natural language processing (NLP) methods, namely: the continuous bag of words (CBOW), and the bi-directional long short-term memory (bi-LSTM). These methods were employed to predict relations between entities that describe protein subcellular localisation in plants. We applied our system to 1700 published Arabidopsis protein subcellular studies from the SUBA manually curated dataset. The system combines pre-processing of full-text articles in a machine-readable format with relevant sentence extraction for downstream NLP analysis. Using the SUBA corpus, the neural network classifier predicted interactions between protein name, subcellular localisation and experimental methodology with an average precision, recall rate, accuracy and F1 scores of 95.1%, 82.8%, 89.3% and 88.4% respectively (n = 30). Comparable scoring metrics were obtained using the CropPAL database as an independent testing dataset that stores protein subcellular localisation in crop species, demonstrating wide applicability of prediction model. We provide a framework for extracting protein functional features from unstructured text in the literature with high accuracy, improving data dissemination and unlocking the potential of big data text analytics for generating new hypotheses.

Download Full-text