Wanted: Standards for FAIR taxonomic concept representations and relationships

Biodiversity Information Science and Standards ◽

10.3897/biss.5.75587 ◽

2021 ◽

Vol 5 ◽

Author(s):

Beckett Sterner ◽

Nathan Upham ◽

Prashant Gupta ◽

Caleb Powell ◽

Nico Franz

Keyword(s):

North American ◽

Multiple Sources ◽

Biodiversity Data ◽

Taxonomic Concept ◽

Reference Sequences ◽

Global Biodiversity ◽

Concept Representations ◽

Biodiversity Information ◽

Diagnostic Traits ◽

Occurrence Records

Making the most of biodiversity data requires linking observations of biological species from multiple sources both efficiently and accurately (Bisby 2000, Franz et al. 2016). Aggregating occurrence records using taxonomic names and synonyms is computationally efficient but known to experience significant limitations on accuracy when the assumption of one-to-one relationships between names and biological entities breaks down (Remsen 2016, Franz and Sterner 2018). Taxonomic treatments and checklists provide authoritative information about the correct usage of names for species, including operational representations of the meanings of those names in the form of range maps, reference genetic sequences, or diagnostic traits. They increasingly provide taxonomic intelligence in the form of precise description of the semantic relationships between different published names in the literature. Making this authoritative information Findable, Accessible, Interoperable, and Reusable (FAIR; Wilkinson et al. 2016) would be a transformative advance for biodiversity data sharing and help drive adoption and novel extensions of existing standards such as the Taxonomic Concept Schema and the OpenBiodiv Ontology (Kennedy et al. 2006, Senderov et al. 2018). We call for the greater, global Biodiversity Information Standards (TDWG) and taxonomy community to commit to extending and expanding on how FAIR applies to biodiversity data and include practical targets and criteria for the publication and digitization of taxonomic concept representations and alignments in taxonomic treatments, checklists, and backbones. As a motivating case, consider the abundantly sampled North American deer mouse—Peromyscus maniculatus (Wagner 1845)—which was recently split from one continental species into five more narrowly defined forms, so that the name P. maniculatus is now only applied east of the Mississippi River (Bradley et al. 2019, Greenbaum et al. 2019). That single change instantly rendered ambiguous ~7% of North American mammal records in the Global Biodiversity Information Facility (n=242,663, downloaded 2021-06-04; GBIF.org 2021) and ⅓ of all National Ecological Observatory Network (NEON) small mammal samples (n=10,256, downloaded 2021-06-27). While this type of ambiguity is common in name-based databases when species are split, the example of P. maniculatus is particularly striking for its impact upon biological questions ranging from hantavirus surveillance in North America to studies of climate change impacts upon rodent life-history traits. Of special relevance to NEON sampling is recent evidence suggesting deer mice potentially transmit SARS-CoV-2 (Griffin et al. 2021). Automating the updating of occurrence records in such cases and others will require operational representations of taxonomic concepts—e.g., range maps, reference sequences, and diagnostic traits—that are FAIR in addition to taxonomic concept alignment information (Franz and Peet 2009). Despite steady progress, it remains difficult to find, access, and reuse authoritative information about how to apply taxonomic names even when it is already digitized. It can also be difficult to tell without manual inspection whether similar types of concept representations derived from multiple sources, such as range maps or reference sequences selected from different research articles or checklists, are in fact interoperable for a particular application. The issue is therefore different from important ongoing efforts to digitize trait information in species circumscriptions, for example, and focuses on how already digitized knowledge can best be packaged to inform human experts and artifical intelligence applications (Sterner and Franz 2017). We therefore propose developing community guidelines and criteria for FAIR taxonomic concept representations as "semantic artefacts" of general relevance to linked open data and life sciences research (Le Franc et al. 2020).

Download Full-text

The Living Atlases community in action: the GBIF Benin data portal

Biodiversity Information Science and Standards ◽

10.3897/biss.2.25488 ◽

2018 ◽

Vol 2 ◽

pp. e25488

Author(s):

Anne-Sophie Archambeau ◽

Fabien Cavière ◽

Kourouma Koura ◽

Marie-Elise Lecoq ◽

Sophie Pamerlon ◽

...

Keyword(s):

African Country ◽

Biodiversity Data ◽

Global Biodiversity Information Facility ◽

Capacity Enhancement ◽

Support Programme ◽

Data Portal ◽

Global Biodiversity ◽

The University ◽

Biodiversity Information ◽

Occurrence Records

Atlas of Living Australia (ALA) (https://www.ala.org.au/) is the Global Biodiversity Information Facility (GBIF) node of Australia. They developed an open and free platform for sharing and exploring biodiversity data. All the modules are publicly available for reuse and customization on their GitHub account (https://github.com/AtlasOfLivingAustralia). GBIF Benin, hosted at the University of Abomey-Calavi, has published more than 338 000 occurrence records from 87 datasets and 2 checklists. Through the GBIF Capacity Enhancement Support Programme (https://www.gbif.org/programme/82219/capacity-enhancement-support-programme), GBIF Benin, with the help of GBIF France, is in the process of deploying the Beninese data portal using the GBIF France back-end architecture. GBIF Benin is the first African country to implement this module of the ALA infrastructure. In this presentation, we will show you an overview of the registry and the occurrence search engine using the Beninese data portal. We will begin with the administration interface and how to manage metadata, then we will continue with the user interface of the registry and how you can find Beninese occurrences through the hub.

Download Full-text

An audit of some processing effects in aggregated occurrence records

ZooKeys ◽

10.3897/zookeys.751.24791 ◽

2018 ◽

Vol 751 ◽

pp. 129-146 ◽

Cited By ~ 7

Author(s):

Robert Mesibov

Keyword(s):

Data Loss ◽

Global Biodiversity Information Facility ◽

Australian Museum ◽

Darwin Core ◽

Species Groups ◽

Processing Effects ◽

Global Biodiversity ◽

Name Changes ◽

Biodiversity Information ◽

Occurrence Records

A total of ca 800,000 occurrence records from the Australian Museum (AM), Museums Victoria (MV) and the New Zealand Arthropod Collection (NZAC) were audited for changes in selected Darwin Core fields after processing by the Atlas of Living Australia (ALA; for AM and MV records) and the Global Biodiversity Information Facility (GBIF; for AM, MV and NZAC records). Formal taxon names in the genus- and species-groups were changed in 13–21% of AM and MV records, depending on dataset and aggregator. There was little agreement between the two aggregators on processed names, with names changed in two to three times as many records by one aggregator alone compared to records with names changed by both aggregators. The type status of specimen records did not change with name changes, resulting in confusion as to the name with which a type was associated. Data losses of up to 100% were found after processing in some fields, apparently due to programming errors. The taxonomic usefulness of occurrence records could be improved if aggregators included both original and the processed taxonomic data items for each record. It is recommended that end-users check original and processed records for data loss and name replacements after processing by aggregators.

Download Full-text

Data integration enables global biodiversity synthesis

Proceedings of the National Academy of Sciences ◽

10.1073/pnas.2018093118 ◽

2021 ◽

Vol 118 (6) ◽

pp. e2018093118

Author(s):

J. Mason Heberling ◽

Joseph T. Miller ◽

Daniel Noesgaard ◽

Scott B. Weingart ◽

Dmitry Schigel

Keyword(s):

Data Integration ◽

Species Interactions ◽

Large Scale ◽

Data Use ◽

Biodiversity Data ◽

Global Biodiversity Information Facility ◽

Research Areas ◽

Global Biodiversity ◽

Biodiversity Information ◽

Global Data

The accessibility of global biodiversity information has surged in the past two decades, notably through widespread funding initiatives for museum specimen digitization and emergence of large-scale public participation in community science. Effective use of these data requires the integration of disconnected datasets, but the scientific impacts of consolidated biodiversity data networks have not yet been quantified. To determine whether data integration enables novel research, we carried out a quantitative text analysis and bibliographic synthesis of >4,000 studies published from 2003 to 2019 that use data mediated by the world’s largest biodiversity data network, the Global Biodiversity Information Facility (GBIF). Data available through GBIF increased 12-fold since 2007, a trend matched by global data use with roughly two publications using GBIF-mediated data per day in 2019. Data-use patterns were diverse by authorship, geographic extent, taxonomic group, and dataset type. Despite facilitating global authorship, legacies of colonial science remain. Studies involving species distribution modeling were most prevalent (31% of literature surveyed) but recently shifted in focus from theory to application. Topic prevalence was stable across the 17-y period for some research areas (e.g., macroecology), yet other topics proportionately declined (e.g., taxonomy) or increased (e.g., species interactions, disease). Although centered on biological subfields, GBIF-enabled research extends surprisingly across all major scientific disciplines. Biodiversity data mobilization through global data aggregation has enabled basic and applied research use at temporal, spatial, and taxonomic scales otherwise not possible, launching biodiversity sciences into a new era.

Download Full-text

BIDDSAT: visualizing the content of biodiversity data publishers in the Global Biodiversity Information Facility network

Bioinformatics ◽

10.1093/bioinformatics/bts359 ◽

2012 ◽

Vol 28 (16) ◽

pp. 2207-2208 ◽

Cited By ~ 6

Author(s):

J. Otegui ◽

A. H. Arino

Keyword(s):

Biodiversity Data ◽

Global Biodiversity Information Facility ◽

Global Biodiversity ◽

Biodiversity Information

Download Full-text

Unlocking the Entomological Collection of the Natural History Museum of Maputo, Mozambique

Biodiversity Data Journal ◽

10.3897/bdj.9.e64461 ◽

2021 ◽

Vol 9 ◽

Author(s):

Domingos Sandramo ◽

Enrico Nicosia ◽

Silvio Cianciullo ◽

Bernardo Muatinte ◽

Almeida Guissamulo

Keyword(s):

Natural History ◽

Crucial Role ◽

Development Programme ◽

Natural History Museum ◽

Biodiversity Data ◽

Global Biodiversity Information Facility ◽

History Museum ◽

Data Portal ◽

Global Biodiversity ◽

Biodiversity Information

The collections of the Natural History Museum of Maputo have a crucial role in the safeguarding of Mozambique's biodiversity, representing an important repository of data and materials regarding the natural heritage of the country. In this paper, a dataset is described, based on the Museum’s Entomological Collection recording 409 species belonging to seven orders and 48 families. Each specimen’s available data, such as geographical coordinates and taxonomic information, have been digitised to build the dataset. The specimens included in the dataset were obtained between 1914–2018 by collectors and researchers from the Natural History Museum of Maputo (once known as “Museu Alváro de Castro”) in all the country’s provinces, with the exception of Cabo Delgado Province. This paper adds data to the Biodiversity Network of Mozambique and the Global Biodiversity Information Facility, within the objectives of the SECOSUD II Project and the Biodiversity Information for Development Programme. The aforementioned insect dataset is available on the GBIF Engine data portal (https://doi.org/10.15468/j8ikhb). Data were also shared on the Mozambican national portal of biodiversity data BioNoMo (https://bionomo.openscidata.org), developed by SECOSUD II Project.

Download Full-text

Going Molecular: Sequence-based spatiotemporal biodiversity evidence in GBIF

Biodiversity Information Science and Standards ◽

10.3897/biss.3.37036 ◽

2019 ◽

Vol 3 ◽

Author(s):

Dmitry Schigel ◽

Thomas Jeppesen ◽

Robert Finn ◽

Guy Cochrane ◽

Urmas Kõljalg ◽

...

Keyword(s):

Dna Sequences ◽

Data Streams ◽

Large Scale ◽

Sequence Data ◽

Genetic Material ◽

Molecular Data ◽

Molecular Sequence ◽

Global Biodiversity ◽

Biodiversity Information ◽

Occurrence Records

The Global Biodiversity Information Facility (GBIF) was established by governments in 2001, largely through the initiative and leadership of the natural history collections community, following the 1999 recommendation by a working group under the Megascience Forum (predecessor of the Global Science Forum) of the Organization for Economic Cooperation and Development (OECD). Over 20 years, GBIF has helped develop standards and convened a global community of data-publishing institutions, aggregrating over one billion specimen occurrence records freely and openly available for use in research and policy making. These GBIF mediated data range from vouchered museum specimens to observation records generated by humans and machines. New data are being generated from integrated remote sensing, ecological sampling, and molecular sequencing that have strong geospatial components but lack traditional vouchers. GBIF is working with partners to develop best practices of bringing this data into the GBIF architecture. Following discussions during the second Global Biodiversity Information Conference in 2018, GBIF and the European Bioinformatics Institute (EMBL-EBI), supported by ELIXIR, have extended collaboration to share species occurrence records known only from their genetic material. When these data providers contribute data coordinates along with the sequences to the European Nucleotide Archive (ENA), the records will appear on GBIF maps and in spatial searches. This collaboration enables significant new molecular data streams to become discoverable through GBIF.org: by mid-March 2019, over 7.8m individual occurrence records via the ENA, and over 13.2m records as standardized Darwin Core sampling-event datasets via MGnify, a resource that provides taxonomic and functional annotations on sequences derived from environmental sequencing projects. Sequence-based occurrence records published by ENA and MGnify boost representation of microbial diversity which was underrepresented at GBIF. The ELIXIR-ENA-MGnify-GBIF partnership is working on further refinement of the dynamic data linkages, frequency of updates and other improvements. The API-based tool that connects GBIF data infrastructures is open to new data contributors and for indexes of molecular occurrences. Indexing of these data streams is dependent on the presence of a name (any rank) with the sequence. Under the current Codes of nomenclature, animals, fungi, plants, and algae cannot be described based on exclusively sequence data. Yet, a significant volume of biodiversity data has only been represented by DNA sequences. Barcoding and sequence clustering procedures vary among taxa and research communities, but clusters can be related to a taxon with a Latin name. Many DNA similarity clusters do not contain a sequence from a formally described taxon; however these sequence clusters provide provisional molecular names for nomenclatural communication. In the best cases, curated libraries of reference sequences, their metadata, clusters, alignments, and links to individuals and physical material become de facto naming conventions for certain taxonomic groups, and co-exist with Latin names. Integration of molecular names into the taxonomic backbone of GBIF started with Fungi and UNITE, a data management and identification environment for fungal ITS barcodes with 87,000+ fungal species hypotheses demarcating 800,000+ sequence specimens as of March 2019. Checklist publication of all names in UNITE through GBIF.org including Linnaean names and stable, DOI-trackable molecular sequence based ‘species hypotheses’, enables indexing of fungal metabarcoding data worldwide, such as BIOWIDE. As names are currently essential to indexing the world’s occurrence data, GBIF will develop similar linkages with names in the Barcode of Life data system (BOLD) and in SILVA - a resource for high-quality ribosomal RNA sequence data and taxonomy, and welcomes other reference systems to this development. Expanding the molecular data streams (Fig. 1) allows GBIF to address spatial, temporal and taxonomic gaps and biases, and to support large-scale data-intensive research openly and worldwide.

Download Full-text

Harvestmen occurrence database (Arachnida, Opiliones) of the Museu Paraense Emílio Goeldi, Brazil

Biodiversity Data Journal ◽

10.3897/bdj.7.e47456 ◽

2019 ◽

Vol 7 ◽

Author(s):

Valéria da Silva ◽

Manoel Aguiar-Neto ◽

Dan Teixeira ◽

Cleverson Santos ◽

Marcos de Sousa ◽

...

Keyword(s):

Public Consultation ◽

Biodiversity Data ◽

Global Biodiversity Information Facility ◽

The Third ◽

The World ◽

Northern Brazil ◽

Global Biodiversity ◽

The Government ◽

Biodiversity Information ◽

Brazilian Biodiversity

We present a dataset with information from the Opiliones collection of the Museu Paraense Emílio Goeldi, Northern Brazil. This collection currently has 6,400 specimens distributed in 13 families, 30 genera and 32 species and holotypes of four species: Imeri ajuba Coronato-Ribeiro, Pinto-da-Rocha & Rheims, 2013, Phareicranaus patauateua Pinto-da-Rocha & Bonaldo, 2011, Protimesius trocaraincola Pinto-da-Rocha, 1997 and Sickesia tremembe Pinto-da-Rocha & Carvalho, 2009. The material of the collection is exclusive from Brazil, mostly from the Amazon Region. The dataset is now available for public consultation on the Sistema de Informação sobre a Biodiversidade Brasileira (SiBBr) (https://ipt.sibbr.gov.br/goeldi/resource?r=museuparaenseemiliogoeldi-collection-aracnologiaopiliones). SiBBr is the Brazilian Biodiversity Information System, an initiative of the government and the Brazilian node of the Global Biodiversity Information Facility (GBIF), which aims to consolidate and make primary biodiversity data available on a platform (Dias et al. 2017). Harvestmen or Opiliones constitute the third largest arachnid order, with approximately 6,500 described species. Brazil is the holder of the greatest diversity in the world, with more than 1,000 described species, 95% (960 species) of which are endemic to the country. Of these, 32 species were identified and deposited in the collection of the Museu Paraense Emílio Goeldi.

Download Full-text

The Living Atlases community in action: the NBN Atlas Spatial Portal and “Explore Your Region” module

Biodiversity Information Science and Standards ◽

10.3897/biss.2.25486 ◽

2018 ◽

Vol 2 ◽

pp. e25486

Author(s):

Nick dos Remedios ◽

Marie-Elise Lecoq ◽

David Martin ◽

Sophia Ratcliffe

Keyword(s):

Search Engine ◽

Biodiversity Data ◽

Global Biodiversity Information Facility ◽

Isle Of Man ◽

Global Biodiversity ◽

The Uk ◽

Biodiversity Information

Atlas of Living Australia (ALA) (https://www.ala.org.au/) is the Global Biodiversity Information Facility (GBIF) node of Australia. Since 2010, they have developed and improved a platform for sharing and exploring biodiversity information. All the modules are publicly available for reuse and customization on their GitHub account (https://github.com/AtlasOfLivingAustralia). The National Biodiversity Network, a registered charity, is the UK GBIF node and has been sharing biodiversity data since 2000. They published more than 79 million occurrences from 818 datasets. In 2016, they launched the NBN Atlas Scotland (https://scotland.nbnatlas.org/) based on the Atlas of Living Australia infrastructure. Since then, they released the NBN Atlas (https://nbnatlas.org/), the NBN Atlas Wales (https://wales.nbnatlas.org/) and soon the NBN Atlas Isle of Man. In addition to the occurrence/species search engine and the metadata registry, they put in place several tools that help users to work with data published in the network: the spatial portal and "explore your region" module. Both elements are based on Atlas of Living Australia developments. Because the Atlas of Living Australia platform is really powerful an reusable, we want to show you these two applications used to make geographical analyses. In order to perform this, we will present you the specificities of each component by giving examples of some functionalities.

Download Full-text

BiGe-Onto: An ontology-based system for managing biodiversity and biogeography data1

Applied Ontology ◽

10.3233/ao-200228 ◽

2020 ◽

Vol 15 (4) ◽

pp. 411-437 ◽

Cited By ~ 3

Author(s):

Marcos Zárate ◽

Germán Braun ◽

Pablo Fillottrani ◽

Claudio Delrieux ◽

Mirtha Lewis

Keyword(s):

Data Sources ◽

Biodiversity Data ◽

Global Biodiversity Information Facility ◽

Sparql Endpoint ◽

Darwin Core ◽

Metadata Standards ◽

Great Progress ◽

Global Biodiversity ◽

Research Domains ◽

Biodiversity Information

Great progress to digitize the world’s available Biodiversity and Biogeography data have been made recently, but managing data from many different providers and research domains still remains a challenge. A review of the current landscape of metadata standards and ontologies in Biodiversity sciences suggests that existing standards, such as the Darwin Core terminology, are inadequate for describing Biodiversity data in a semantically meaningful and computationally useful way. As a contribution to fill this gap, we present an ontology-based system, called BiGe-Onto, designed to manage data together from Biodiversity and Biogeography. As data sources, we use two internationally recognized repositories: the Global Biodiversity Information Facility (GBIF) and the Ocean Biogeographic Information System (OBIS). BiGe-Onto system is composed of (i) BiGe-Onto Architecture (ii) a conceptual model called BiGe-Onto specified in OntoUML, (iii) an operational version of BiGe-Onto encoded in OWL 2, and (iv) an integrated dataset for its exploitation through a SPARQL endpoint. We will show use cases that allow researchers to answer questions that manage information from both domains.

Download Full-text

The Living Atlases Community: Communication and documentation

Biodiversity Information Science and Standards ◽

10.3897/biss.4.59273 ◽

2020 ◽

Vol 4 ◽

Author(s):

Marie-Elise Lecoq ◽

Vicente Ruiz Jurado

Keyword(s):

End Users ◽

The Internet ◽

Technical Documentation ◽

Biodiversity Data ◽

Global Biodiversity Information Facility ◽

Network Nodes ◽

Twitter Account ◽

Global Biodiversity ◽

One Year ◽

Biodiversity Information

Managers and developers from organizations within the Global Biodiversity Information Facility (GBIF) network nodes using the Atlas of Living Australia (ALA) modules have created the Living Atlases (LA) community. Since the beginning, two of our priorities have been the technical guides and communication inside and outside our network. A community can not be sustainable without useful technical documentation, as members must work by themselves as much as possible. Without communication, a community cannot grow either. More than one year ago, the Living Atlases community hired a technical coordinator, Vicente J. Ruiz Jurado. With the help of other participants, he greatly improved our technical documentation with the Living Atlas Quick Start Guide and increased communication with remote support sessions. The helpdesk, through the use of the LA Slack channel, has been improved as well. We have also increased our visibility on the Internet with our website and our Twitter account. Over the last few years, we have focused our work on end-users, with dedicated workshops, including exercises made by participants for their users and two videos showing how a Living Atlas works (How to search and download biodiversity data in an Atlas and How to use regions/spatial module in an Atlas).

Download Full-text