Fast and Easy Access to Central European Biodiversity Data with BIOfid

Biodiversity Information Science and Standards ◽

10.3897/biss.4.59157 ◽

2020 ◽

Vol 4 ◽

Author(s):

Christine Driller ◽

Markus Koch ◽

Giuseppe Abrami ◽

Wahed Hemati ◽

Andy Lücking ◽

...

Keyword(s):

Machine Learning ◽

Language Processing ◽

Information Service ◽

Training Data ◽

Easy Access ◽

Species Occurrence ◽

Global Biodiversity Information Facility ◽

Text Corpus ◽

Short Supply ◽

Occurrence Records

The storage of data in public repositories such as the Global Biodiversity Information Facility (GBIF) or the National Center for Biotechnology Information (NCBI) is nowadays stipulated in the policies of many publishers in order to facilitate data replication or proliferation. Species occurrence records contained in legacy printed literature are no exception to this. The extent of their digital and machine-readable availability, however, is still far from matching the existing data volume (Thessen and Parr 2014). But precisely these data are becoming more and more relevant to the investigation of ongoing loss of biodiversity. In order to extract species occurrence records at a larger scale from available publications, one has to apply specialised text mining tools. However, such tools are in short supply especially for scientific literature in the German language. The Specialised Information Service Biodiversity Research*1 BIOfid (Koch et al. 2017) aims at reducing this desideratum, inter alia, by preparing a searchable text corpus semantically enriched by a new kind of multi-label annotation. For this purpose, we feed manual annotations into automatic, machine-learning annotators. This mixture of automatic and manual methods is needed, because BIOfid approaches a new application area with respect to language (mainly German of the 19th century), text type (biological reports), and linguistic focus (technical and everyday language). We will present current results of the performance of BIOfid’s semantic search engine and the application of independent natural language processing (NLP) tools. Most of these are freely available online, such as TextImager (Hemati et al. 2016). We will show how TextImager is tied into the BIOfid pipeline and how it is made scalable (e.g. extendible by further modules) and usable on different systems (docker containers). Further, we will provide a short introduction to generating machine-learning training data using TextAnnotator (Abrami et al. 2019) for multi-label annotation. Annotation reproducibility can be assessed by the implementation of inter-annotator agreement methods (Abrami et al. 2020). Beyond taxon recognition and entity linking, we place particular emphasis on location and time information. For this purpose, our annotation tag-set combines general categories and biology-specific categories (including taxonomic names) with location and time ontologies. The application of the annotation categories is regimented by annotation guidelines (Lücking et al. 2020). Within the next years, our work deliverable will be a semantically accessible and data-extractable text corpus of around two million pages. In this way, BIOfid is creating a new valuable resource that expands our knowledge of biodiversity and its determinants.

Download Full-text

Addressing Uncertainties in Machine Learning Predictions of Conservation Status

Biodiversity Information Science and Standards ◽

10.3897/biss.3.37147 ◽

2019 ◽

Vol 3 ◽

Cited By ~ 1

Author(s):

Barnaby Walker ◽

Tarciso Leão ◽

Steven Bachman ◽

Eve Lucas ◽

Eimear Nic Lughadha

Keyword(s):

Machine Learning ◽

Natural History ◽

Extinction Risk ◽

Conservation Status ◽

Red List ◽

Learning Models ◽

Species Occurrence ◽

Voucher Specimens ◽

Machine Learning Models ◽

Occurrence Records

Extinction risk assessments are increasingly important to many stakeholders (Bennun et al. 2017) but there remain large gaps in our knowledge about the status of many species. The IUCN Red List of Threatened Species (IUCN 2019, hereafter Red List) is the most comprehensive assessment of extinction risk. However, it includes assessments of just 7% of all vascular plants, while 18% of all assessed animals lack sufficient data to assign a conservation status. The wide availability of species occurrence information through digitised natural history collections and aggregators such as the Global Biodiversity Information Facility (GBIF), coupled with machine learning methods, provides an opportunity to fill these gaps in our knowledge. Machine learning approaches have already been proposed to guide conservation assessment efforts (Nic Lughadha et al. 2018), assign a conservation status to species with insufficient data for a full assessment (Bland et al. 2014), and predict the number of threatened species across the world (Pelletier et al. 2018). The wide range in sources of species occurrence records can lead to data quality issues, such as missing, imprecise, or mistaken information. These data quality issues may be compounded in databases that aggregate information from multiple sources: many such records derive from field observations (78% for plant species in GBIF; Meyer et al. 2016) largely unsupported by voucher specimens that would allow confirmation or correction of their identification. Even where voucher specimens do exist, different taxonomic or geographic information can be held for a single collection event represented by duplicate specimens deposited in different natural history collections. Tools are available to help clean species occurrence data, but these cannot deal with problems like specimen misidentification, which previous work (Nic Lughadha et al. 2019) has shown to have a large impact on preliminary assessments of conservation status. Machine learning models based on species occurrence records have been reported to predict with high accuracy the conservation status of species. However, given the black-box nature of some of the better machine learning models, it is unclear how well these accuracies apply beyond the data on which the models were trained. Practices for training machine learning models differ between studies, but more interrogation of these models is required if we are to know how much to trust their predictions. To address these problems, we compare predictions made by a machine learning model when trained on specimen occurrence records that have benefitted from minimal or more thorough cleaning, with those based on records from an expert-curated database. We then explore different techniques to interrogate machine learning models and quantify the uncertainty in their predictions.

Download Full-text

Rethinking domain adaptation for machine learning over clinical language

JAMIA Open ◽

10.1093/jamiaopen/ooaa010 ◽

2020 ◽

Vol 3 (2) ◽

pp. 146-150

Author(s):

Egoitz Laparra ◽

Steven Bethard ◽

Timothy A Miller

Keyword(s):

Machine Learning ◽

Natural Language Processing ◽

Language Processing ◽

Domain Adaptation ◽

Positive Impact ◽

Training Data ◽

Clinical Use ◽

Research Directions ◽

Clinical Natural Language Processing ◽

Support Research

Abstract Building clinical natural language processing (NLP) systems that work on widely varying data is an absolute necessity because of the expense of obtaining new training data. While domain adaptation research can have a positive impact on this problem, the most widely studied paradigms do not take into account the realities of clinical data sharing. To address this issue, we lay out a taxonomy of domain adaptation, parameterizing by what data is shareable. We show that the most realistic settings for clinical use cases are seriously under-studied. To support research in these important directions, we make a series of recommendations, not just for domain adaptation but for clinical NLP in general, that ensure that data, shared tasks, and released models are broadly useful, and that initiate research directions where the clinical NLP community can lead the broader NLP and machine learning fields.

Download Full-text

Pharmacovigilance from social media: mining adverse drug reaction mentions using sequence labeling with word embedding cluster features

Journal of the American Medical Informatics Association ◽

10.1093/jamia/ocu041 ◽

2015 ◽

Vol 22 (3) ◽

pp. 671-681 ◽

Cited By ~ 145

Author(s):

Azadeh Nikfarjam ◽

Abeed Sarker ◽

Karen O’Connor ◽

Rachel Ginn ◽

Graciela Gonzalez

Keyword(s):

Machine Learning ◽

Social Media ◽

Language Processing ◽

High Performance ◽

Conditional Random Fields ◽

Training Data ◽

Data Sets ◽

Social Media Mining ◽

Medical Concepts ◽

Media Mining

Abstract Objective Social media is becoming increasingly popular as a platform for sharing personal health-related information. This information can be utilized for public health monitoring tasks, particularly for pharmacovigilance, via the use of natural language processing (NLP) techniques. However, the language in social media is highly informal, and user-expressed medical concepts are often nontechnical, descriptive, and challenging to extract. There has been limited progress in addressing these challenges, and thus far, advanced machine learning-based NLP techniques have been underutilized. Our objective is to design a machine learning-based approach to extract mentions of adverse drug reactions (ADRs) from highly informal text in social media. Methods We introduce ADRMine, a machine learning-based concept extraction system that uses conditional random fields (CRFs). ADRMine utilizes a variety of features, including a novel feature for modeling words’ semantic similarities. The similarities are modeled by clustering words based on unsupervised, pretrained word representation vectors (embeddings) generated from unlabeled user posts in social media using a deep learning technique. Results ADRMine outperforms several strong baseline systems in the ADR extraction task by achieving an F-measure of 0.82. Feature analysis demonstrates that the proposed word cluster features significantly improve extraction performance. Conclusion It is possible to extract complex medical concepts, with relatively high performance, from informal, user-generated content. Our approach is particularly scalable, suitable for social media mining, as it relies on large volumes of unlabeled data, thus diminishing the need for large, annotated training data sets.

Download Full-text

Current progress in the development of taxonomic and anatomical ontologies within the scope of BIOfid

Biodiversity Information Science and Standards ◽

10.3897/biss.2.25585 ◽

2018 ◽

Vol 2 ◽

pp. e25585

Author(s):

Markus Koch ◽

Christine Driller ◽

Marco Schmidt ◽

Thomas Hörnschemeyer ◽

Claus Weiland ◽

...

Keyword(s):

Language Processing ◽

Vascular Plants ◽

Information Service ◽

Hierarchical Classification ◽

Global Biodiversity Information Facility ◽

Anatomy Ontology ◽

Related Data ◽

Vernacular Names ◽

Insect Order ◽

Novel Applications

The Specialized Information Service Biodiversity Research (BIOfid; http://biofid.de/) has recently been launched to mobilize valuable biodiversity data hidden in German print sources of the past 250 years. The partners involved in this project started digitisation of the literature corpus envisaged for the pilot stage and provided novel applications for natural language processing and visualization. In order to foster development of new text mining tools, the Senckenberg Biodiversity Informatics team focuses on the design of ontologies for taxa and their anatomy. We present our progress for the taxa prioritized by the target group for the pilot stage, i.e. for vascular plants, moths and butterflies, as well as birds. With regard to our text corpus a key aspect of our taxonomic ontologies is the inclusion of German vernacular names. For this purpose we assembled a taxonomy ontology for vascular plants by synchronizing taxon lists from the Global Biodiversity Information Facility (GBIF) and the Integrated Taxonomic Information System (ITIS) with K.P. Buttler’s Florenliste von Deutschland (http://www.kp-buttler.de/florenliste/). Hierarchical classification of the taxonomic names and class relationships focus on rank and status (validity vs. synonymy). All classes are additionally annotated with details on scientific name, taxonomic authorship, and source. Taxonomic names for birds are mainly compiled from ITIS and the International Ornithological Congress (IOC) World Bird List, for moths and butterflies mainly from GBIF, both lists being classified and annotated accordingly. We intend to cross-link our taxonomy ontologies with the Environment Ontology (ENVO) and anatomy ontologies such as the Flora Phenotype Ontology (FLOPO). For moths and butterflies we started to design the Lepidoptera Anatomy Ontology (LepAO) on the basis of the already available Hymenoptera Anatomy Ontology (HAO). LepAO is planned to be interoperable with other ontologies in the framework of the OBO foundry. A main modification of HAO is the inclusion of German anatomical terms from published glossaries that we add as scientific and vernacular synonyms to make use of already available identifiers (URIs) for corresponding English terms. International collaboration with the founders of HAO and teams focusing on other insect orders such as beetles (ColAO) aims at development of a unified Insect Anatomy Ontology. With a restriction on terms applicable on all insects the unified Insect Anatomy Ontology is intended to establish a basis for accelerating the design of more specific anatomy ontologies for any particular insect order. The advancement of such ontologies aligns with current needs to make knowledge accumulated in descriptive studies on the systematics of organisms accessible to other domains. In the context of BIOfid our ontologies provide exemplars on how semantic queries of yet untapped data relevant for biodiversity studies can be achieved for literature in non-English languages. Furthermore, BIOfid will serve as an open access platform for professional international journals facilitating non-commercial publishing of biodiversity and biodiversity-related data.

Download Full-text

Semantic Search in Legacy Biodiversity Literature: Integrating data from different data infrastructures

Biodiversity Information Science and Standards ◽

10.3897/biss.5.74251 ◽

2021 ◽

Vol 5 ◽

Author(s):

Adrian Pachzelt ◽

Gerwin Kasperek ◽

Andy Lücking ◽

Giuseppe Abrami ◽

Christine Driller

Keyword(s):

Search Engine ◽

Language Processing ◽

Semantic Network ◽

Information Service ◽

Data Retrieval ◽

Biological Data ◽

Graph Representation ◽

Global Biodiversity Information Facility ◽

Data Repositories ◽

Search Results

Nowadays, obtaining information by entering queries into a web search engine is routine behaviour. With its search portal, the Specialised Information Service Biodiversity Research (BIOfid) adapts the exploration of legacy biodiversity literature and data extraction to current standards (Driller et al. 2020). In this presentation, we introduce the BIOfid search portal and its functionalities in a How-To short guide. To this end, we adapted a knowledge graph representation of our thematic focus of Central European, primarily German language, biodiversity literature of the 19th and 20th centuries. Now, users can search our text-mined corpus containing to date more than 8.700 full-text articles from 68 journals, and particularly focussing on birds, lepidopterans and vascular plants. The texts are automatically preprocessed by the Natural Language Processing provider TextImager (Hemati et al. 2016) and will be linked to various databases such as Wikidata, Wikipedia, the Global Biodiversity Information Facility (GBIF), Encyclopedia of Life (EoL), Geonames, the Integrated Authority File (GND) and WordNet. For data retrieval, users can filter search results and download the article metadata as well as text annotations and database links in JavaScript Object Notation (JSON) format. For example, literature that mentions taxa from certain decades or co-occurrences of species can be searched. Our search engine recognises scientific and vernacular taxon names based on the GBIF Backbone Taxonomy and offers search suggestions to support the user. The semantic network of the BIOfid search portal is also enriched with data from the EoL trait bank, so that trait data can be included in the search queries. Thus, scientists can enhance their own data sets with the search results and feed them into the relevant biodiversity data repositories to sustainably expand the corresponding knowledge graphs with reliable data. Since BIOfid applies standard ontology terms, all data mobilized from literature can be combined with data on natural history collection objects or data from current research projects in order to generate more comprehensive knowledge. Furthermore, taxonomy, ecology and trait ontologies that have been built or extended within this project will be made available through appropriate platforms such as The Open Biological and Biomedical Ontology (OBO) Foundry and the Terminology Service of The German Federation for Biological Data (GFBio).

Download Full-text

Opinion Mining and Information Retrieval

Handbook of Research on Ambient Intelligence and Smart Environments - Advances in Computational Intelligence and Robotics ◽

10.4018/978-1-61692-857-5.ch030 ◽

2011 ◽

pp. 640-652

Author(s):

Shishir K. Shandilya ◽

Suresh Jain

Keyword(s):

Machine Learning ◽

Natural Language ◽

Language Processing ◽

Ambient Intelligence ◽

Opinion Mining ◽

Training Data ◽

Machine Learning Techniques ◽

Web Documents ◽

Opinion Extraction ◽

Traditional Natural

The explosive increase in Internet usage has attracted technologies for automatically mining the user-generated contents (UGC) from Web documents. These UGC-rich resources have raised new opportunities and challenges to carry out the opinion extraction and mining tasks for opinion summaries. The technology of opinion extraction allows users to retrieve and analyze people’s opinions scattered over Web documents. Opinion mining is a process which is concerned with the opinions generated by the consumers about the product. Opinion Mining aims at understanding, extraction and classification of opinions scattered in unstructured text of online resources. The search engines performs well when one wants to know about any product before purchase, but the filtering and analysis of search results often complex and time-consuming. This generated the need of intelligent technologies which could process these unstructured online text documents through automatic classification, concept recognition, text summarization, etc. These tools are based on traditional natural language techniques, statistical analysis, and machine learning techniques. Automatic knowledge extraction over large text collections like Internet has been a challenging task due to many constraints such as needs of large annotated training data, requirement of extensive manual processing of data, and huge amount of domain-specific terms. Ambient Intelligence (AmI) in wed-enabled technologies supports and promotes the intelligent e-commerce services to enable the provision of personalized, self-configurable, and intuitive applications for facilitating UGC knowledge for buying confidence. In this chapter, we will discuss various approaches of Opinion Mining which combines Ambient Intelligence, Natural Language Processing and Machine Learning methods based on textual and grammatical clues.

Download Full-text

Web Search Engine Misinformation Notifier Extension (SEMiNExt): A Machine Learning Based Approach during COVID-19 Pandemic

Healthcare ◽

10.3390/healthcare9020156 ◽

2021 ◽

Vol 9 (2) ◽

pp. 156

Author(s):

Abdullah Bin Shams ◽

Ehsanul Hoque Apu ◽

Ashiqur Rahman ◽

Md. Mohsin Sarker Raihan ◽

Nazeeba Siddika ◽

...

Keyword(s):

Public Health ◽

Machine Learning ◽

Real Time ◽

Search Engine ◽

Language Processing ◽

Web Search ◽

Training Data ◽

Small Data ◽

Web Search Engine ◽

User Query

Misinformation such as on coronavirus disease 2019 (COVID-19) drugs, vaccination or presentation of its treatment from untrusted sources have shown dramatic consequences on public health. Authorities have deployed several surveillance tools to detect and slow down the rapid misinformation spread online. Large quantities of unverified information are available online and at present there is no real-time tool available to alert a user about false information during online health inquiries over a web search engine. To bridge this gap, we propose a web search engine misinformation notifier extension (SEMiNExt). Natural language processing (NLP) and machine learning algorithm have been successfully integrated into the extension. This enables SEMiNExt to read the user query from the search bar, classify the veracity of the query and notify the authenticity of the query to the user, all in real-time to prevent the spread of misinformation. Our results show that SEMiNExt under artificial neural network (ANN) works best with an accuracy of 93%, F1-score of 92%, precision of 92% and a recall of 93% when 80% of the data is trained. Moreover, ANN is able to predict with a very high accuracy even for a small training data size. This is very important for an early detection of new misinformation from a small data sample available online that can significantly reduce the spread of misinformation and maximize public health safety. The SEMiNExt approach has introduced the possibility to improve online health management system by showing misinformation notifications in real-time, enabling safer web-based searching on health-related issues.

Download Full-text

Concept Recognition as a Machine Translation Problem

10.1101/2020.12.03.410829 ◽

2020 ◽

Author(s):

Mayla R Boguslav ◽

Negacy D Hailu ◽

Michael Bada ◽

William A Baumgartner ◽

Lawrence E Hunter

Keyword(s):

Machine Learning ◽

Machine Translation ◽

Language Processing ◽

State Of The Art ◽

Training Data ◽

Language Models ◽

Alternative Methods ◽

Automated Assignment ◽

Concept Recognition ◽

Alternative Approaches

AbstractBackgroundAutomated assignment of specific ontology concepts to mentions in text is a critical task in biomedical natural language processing, and the subject of many open shared tasks. Although the current state of the art involves the use of neural network language models as a post-processing step, the very large number of ontology classes to be recognized and the limited amount of gold-standard training data has impeded the creation of end-to-end systems based entirely on machine learning. Recently, Hailu et al. recast the concept recognition problem as a type of machine translation and demonstrated that sequence-to-sequence machine learning models had the potential to outperform multi-class classification approaches. Here we systematically characterize the factors that contribute to the accuracy and efficiency of several approaches to sequence-to-sequence machine learning.ResultsWe report on our extensive studies of alternative methods and hyperparameter selections. The results not only identify the best-performing systems and parameters across a wide variety of ontologies but also illuminate about the widely varying resource requirements and hyperparameter robustness of alternative approaches. Analysis of the strengths and weaknesses of such systems suggest promising avenues for future improvements as well as design choices that can increase computational efficiency with small costs in performance. Bidirectional Encoder Representations from Transformers for Biomedical Text Mining (BioBERT) for span detection (as previously found) along with the Open-source Toolkit for Neural Machine Translation (OpenNMT) for concept normalization achieve state-of-the-art performance for most ontologies in CRAFT Corpus. This approach uses substantially fewer computational resources, including hardware, memory, and time than several alternative approaches.ConclusionsMachine translation is a promising avenue for fully machine-learning-based concept recognition that achieves state-of-the-art results on the CRAFT Corpus, evaluated via a direct comparison to previous results from the 2019 CRAFT Shared Task. Experiments illuminating the reasons for the surprisingly good performance of sequence-to-sequence methods targeting ontology identifiers suggest that further progress may be possible by mapping to alternative target concept representations. All code and models can be found at: https://github.com/UCDenver-ccp/Concept-Recognition-as-Translation.

Download Full-text

HAMLET

Terminology ◽

10.1075/term.20017.rig ◽

2021 ◽

Author(s):

Ayla Rigouts Terryn ◽

Véronique Hoste ◽

Els Lefever

Keyword(s):

Machine Learning ◽

Language Processing ◽

Hybrid Approach ◽

Substantial Effect ◽

Training Data ◽

Supervised Machine Learning ◽

Learning Approach ◽

Term Extraction ◽

Machine Learning Approach ◽

Different Types

Abstract Automatic term extraction (ATE) is an important task within natural language processing, both separately, and as a preprocessing step for other tasks. In recent years, research has moved far beyond the traditional hybrid approach where candidate terms are extracted based on part-of-speech patterns and filtered and sorted with statistical termhood and unithood measures. While there has been an explosion of different types of features and algorithms, including machine learning methodologies, some of the fundamental problems remain unsolved, such as the ambiguous nature of the concept “term”. This has been a hurdle in the creation of data for ATE, meaning that datasets for both training and testing are scarce, and system evaluations are often limited and rarely cover multiple languages and domains. The ACTER Annotated Corpora for Term Extraction Research contain manual term annotations in four domains and three languages and have been used to investigate a supervised machine learning approach for ATE, using a binary random forest classifier with multiple types of features. The resulting system (HAMLET Hybrid Adaptable Machine Learning approach to Extract Terminology) provides detailed insights into its strengths and weaknesses. It highlights a certain unpredictability as an important drawback of machine learning methodologies, but also shows how the system appears to have learnt a robust definition of terms, producing results that are state-of-the-art, and contain few errors that are not (part of) terms in any way. Both the amount and the relevance of the training data have a substantial effect on results, and by varying the training data, it appears to be possible to adapt the system to various desired outputs, e.g., different types of terms. While certain issues remain difficult – such as the extraction of rare terms and multiword terms – this study shows how supervised machine learning is a promising methodology for ATE.

Download Full-text

Decoding EEG Brain Activity for Multi-Modal Natural Language Processing

Frontiers in Human Neuroscience ◽

10.3389/fnhum.2021.659410 ◽

2021 ◽

Vol 15 ◽

Author(s):

Nora Hollenstein ◽

Cedric Renggli ◽

Benjamin Glaus ◽

Maria Barrett ◽

Marius Troendle ◽

...

Keyword(s):

Machine Learning ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Large Scale ◽

Brain Activity ◽

Training Data ◽

Human Cognition ◽

Special Focus ◽

Eeg Data

Until recently, human behavioral data from reading has mainly been of interest to researchers to understand human cognition. However, these human language processing signals can also be beneficial in machine learning-based natural language processing tasks. Using EEG brain activity for this purpose is largely unexplored as of yet. In this paper, we present the first large-scale study of systematically analyzing the potential of EEG brain activity data for improving natural language processing tasks, with a special focus on which features of the signal are most beneficial. We present a multi-modal machine learning architecture that learns jointly from textual input as well as from EEG features. We find that filtering the EEG signals into frequency bands is more beneficial than using the broadband signal. Moreover, for a range of word embedding types, EEG data improves binary and ternary sentiment classification and outperforms multiple baselines. For more complex tasks such as relation detection, only the contextualized BERT embeddings outperform the baselines in our experiments, which raises the need for further research. Finally, EEG data shows to be particularly promising when limited training data is available.

Download Full-text