Data Location Quality at GBIF

Biodiversity Information Science and Standards ◽

10.3897/biss.3.35829 ◽

2019 ◽

Vol 3 ◽

Author(s):

John Waller

Keyword(s):

Data Quality ◽

Open Data ◽

R Package ◽

Large Network ◽

Global Biodiversity Information Facility ◽

Quality Issue ◽

Data Portal ◽

Quality Issues ◽

Data Location ◽

Biodiversity Information

I will cover how the Global Biodiversity Information Facility (GBIF) handles data quality issues, with specific focus on coordinate location issues, such as gridded datasets (Fig. 1) and country centroids. I will highlight the challenges GBIF faces identifying potential data quality problems and what we and others (Zizka et al. 2019) are doing to discover and address them. GBIF is the largest open-data portal of biodiversity data, which is a large network of individual datasets (> 40k) from various sources and publishers. Since these datasets are variable both within themselves and dataset-to-dataset, this creates a challenge for users wanting to use data collected from museums, smartphones, atlases, satellite tracking, DNA sequencing, and various other sources for research or analysis. Data quality at GBIF will always be a moving target (Chapman 2005), and GBIF already handles many obvious errors such as zero/impossible coordinates, empty or invalid data fields, and fuzzy taxon matching. Since GBIF primarily (but not exclusively) serves lat-lon location information, there is an expectation that occurrences fall somewhat close to where the species actually occurs. This is not always the case. Occurrence data can be hundereds of kilometers away from where the species naturally occur, and there can be multiple reasons for why this can happen, which might not be entirely obvious to users. One reasons is that many GBIF datasets are gridded. Gridded datasets are datasets that have low resolution due to equally-spaced sampling. This can be a data quality issue because a user might assume an occurrence record was recorded exactly at its coordinates. Country centroids are another reason why a species occurrence record might be far from where it occurs naturally. GBIF does not yet flag country centroids, which are records where the dataset publishers has entered the lat-long center of a country instead of leaving the field blank. I will discuss the challenges surrounding locating these issues and the current solutions (such as the CoordinateCleaner R package). I will touch on how existing DWCA terms like coordinateUncertaintyInMeters and footprintWKT are being utilized to highlight low coordinate resolution. Finally, I will highlight some other emerging data quality issues and how GBIF is beginning to experiment with dataset-level flagging. Currently we have flagged around 500 datasets as gridded and around 400 datasets as citizen science, but there are many more potential dataset flags.

Download Full-text

Quantifying the Effect of Data Quality on the Validity of an eMeasure

Applied Clinical Informatics ◽

10.4338/aci-2017-03-ra-0042 ◽

2017 ◽

Vol 08 (04) ◽

pp. 1012-1021 ◽

Cited By ~ 4

Author(s):

Steven Johnson ◽

Stuart Speedie ◽

Gyorgy Simon ◽

Vipin Kumar ◽

Bonnie Westra

Keyword(s):

Data Quality ◽

Best Practice ◽

Synthetic Data ◽

Birth Date ◽

Healthcare Organizations ◽

Healthcare Data ◽

Quality Issue ◽

Assessment Approach ◽

Quality Issues ◽

The Impact

Objective The objective of this study was to demonstrate the utility of a healthcare data quality framework by using it to measure the impact of synthetic data quality issues on the validity of an eMeasure (CMS178—urinary catheter removal after surgery). Methods Data quality issues were artificially created by systematically degrading the underlying quality of EHR data using two methods: independent and correlated degradation. A linear model that describes the change in the events included in the eMeasure quantifies the impact of each data quality issue. Results Catheter duration had the most impact on the CMS178 eMeasure with every 1% reduction in data quality causing a 1.21% increase in the number of missing events. For birth date and admission type, every 1% reduction in data quality resulted in a 1% increase in missing events. Conclusion This research demonstrated that the impact of data quality issues can be quantified using a generalized process and that the CMS178 eMeasure, as currently defined, may not measure how well an organization is meeting the intended best practice goal. Secondary use of EHR data is warranted only if the data are of sufficient quality. The assessment approach described in this study demonstrates how the impact of data quality issues on an eMeasure can be quantified and the approach can be generalized for other data analysis tasks. Healthcare organizations can prioritize data quality improvement efforts to focus on the areas that will have the most impact on validity and assess whether the values that are reported should be trusted.

Download Full-text

Unlocking the Entomological Collection of the Natural History Museum of Maputo, Mozambique

Biodiversity Data Journal ◽

10.3897/bdj.9.e64461 ◽

2021 ◽

Vol 9 ◽

Author(s):

Domingos Sandramo ◽

Enrico Nicosia ◽

Silvio Cianciullo ◽

Bernardo Muatinte ◽

Almeida Guissamulo

Keyword(s):

Natural History ◽

Crucial Role ◽

Development Programme ◽

Natural History Museum ◽

Biodiversity Data ◽

Global Biodiversity Information Facility ◽

History Museum ◽

Data Portal ◽

Global Biodiversity ◽

Biodiversity Information

The collections of the Natural History Museum of Maputo have a crucial role in the safeguarding of Mozambique's biodiversity, representing an important repository of data and materials regarding the natural heritage of the country. In this paper, a dataset is described, based on the Museum’s Entomological Collection recording 409 species belonging to seven orders and 48 families. Each specimen’s available data, such as geographical coordinates and taxonomic information, have been digitised to build the dataset. The specimens included in the dataset were obtained between 1914–2018 by collectors and researchers from the Natural History Museum of Maputo (once known as “Museu Alváro de Castro”) in all the country’s provinces, with the exception of Cabo Delgado Province. This paper adds data to the Biodiversity Network of Mozambique and the Global Biodiversity Information Facility, within the objectives of the SECOSUD II Project and the Biodiversity Information for Development Programme. The aforementioned insect dataset is available on the GBIF Engine data portal (https://doi.org/10.15468/j8ikhb). Data were also shared on the Mozambican national portal of biodiversity data BioNoMo (https://bionomo.openscidata.org), developed by SECOSUD II Project.

Download Full-text

The Living Atlases community in action: the GBIF Benin data portal

Biodiversity Information Science and Standards ◽

10.3897/biss.2.25488 ◽

2018 ◽

Vol 2 ◽

pp. e25488

Author(s):

Anne-Sophie Archambeau ◽

Fabien Cavière ◽

Kourouma Koura ◽

Marie-Elise Lecoq ◽

Sophie Pamerlon ◽

...

Keyword(s):

African Country ◽

Biodiversity Data ◽

Global Biodiversity Information Facility ◽

Capacity Enhancement ◽

Support Programme ◽

Data Portal ◽

Global Biodiversity ◽

The University ◽

Biodiversity Information ◽

Occurrence Records

Atlas of Living Australia (ALA) (https://www.ala.org.au/) is the Global Biodiversity Information Facility (GBIF) node of Australia. They developed an open and free platform for sharing and exploring biodiversity data. All the modules are publicly available for reuse and customization on their GitHub account (https://github.com/AtlasOfLivingAustralia). GBIF Benin, hosted at the University of Abomey-Calavi, has published more than 338 000 occurrence records from 87 datasets and 2 checklists. Through the GBIF Capacity Enhancement Support Programme (https://www.gbif.org/programme/82219/capacity-enhancement-support-programme), GBIF Benin, with the help of GBIF France, is in the process of deploying the Beninese data portal using the GBIF France back-end architecture. GBIF Benin is the first African country to implement this module of the ALA infrastructure. In this presentation, we will show you an overview of the registry and the occurrence search engine using the Beninese data portal. We will begin with the administration interface and how to manage metadata, then we will continue with the user interface of the registry and how you can find Beninese occurrences through the hub.

Download Full-text

A reference library for the identification of Canadian invertebrates: 1.5 million DNA barcodes, voucher specimens, and genomic samples

10.1101/701805 ◽

2019 ◽

Author(s):

Jeremy R. deWaard ◽

Sujeevan Ratnasingham ◽

Evgeny V. Zakharov ◽

Alex V. Borisenko ◽

Dirk Steinke ◽

...

Keyword(s):

Land Surface ◽

Animal Species ◽

Sequence Data ◽

Dna Barcodes ◽

Reference Library ◽

Global Biodiversity Information Facility ◽

Dna Sequence Data ◽

Voucher Specimens ◽

Data Portal ◽

Biodiversity Information

AbstractThe reliable taxonomic identification of organisms through DNA sequence data requires a well parameterized library of curated reference sequences. However, it is estimated that just 15% of described animal species are represented in public sequence repositories. To begin to address this deficiency, we provide DNA barcodes for 1,500,003 animal specimens collected from 23 terrestrial and aquatic ecozones at sites across Canada, a nation that comprises 7% of the planet’s land surface. In total, 14 phyla, 43 classes, 163 orders, 1123 families, 6186 genera, and 64,264 Barcode Index Numbers (BINs; a proxy for species) are represented. Species-level taxonomy was available for 38% of the specimens, but higher proportions were assigned to a genus (69.5%) and a family (99.9%). Voucher specimens and DNA extracts are archived at the Centre for Biodiversity Genomics where they are available for further research. The corresponding sequence and taxonomic data can be accessed through the Barcode of Life Data System, GenBank, the Global Biodiversity Information Facility, and the Global Genome Biodiversity Network Data Portal.

Download Full-text

The Living Atlases Community in Action: Sharing Species Pages through the Atlas of Living Costa Rica

Biodiversity Information Science and Standards ◽

10.3897/biss.2.25990 ◽

2018 ◽

Vol 2 ◽

pp. e25990 ◽

Cited By ~ 1

Author(s):

Manuel Vargas ◽

María Mora ◽

William Ulate ◽

José Cuadra

Keyword(s):

Costa Rica ◽

Conservation Status ◽

Biological Species ◽

Costa Rican ◽

Biodiversity Informatics ◽

Global Biodiversity Information Facility ◽

Data Portal ◽

Access To Data ◽

Biodiversity Information ◽

Occurrence Records

The Atlas of Living Costa Rica (http://www.crbio.cr/) is a biodiversity data portal, based on the Atlas of Living Australia (ALA), which provides integrated, free, and open access to data and information about Costa Rican biodiversity in order to support science, education, and conservation. It is managed by the Biodiversity Informatics Research Center (CRBio) and the National Biodiversity Institute (INBio). Currently, the Atlas of Living Costa Rica includes nearly 8 million georeferenced species occurrence records, mediated by the Global Biodiversity Information Facility (GBIF), which come from more than 900 databases and have been published by research centers in 36 countries. Half of those records are published by Costa Rican institutions. In addition, CRBio is making a special effort to enrich and share more than 5000 species pages, developed by INBio, about Costa Rican vertebrates, arthropods, molluscs, nematodes, plants and fungi. These pages contain information elements pertaining to, for instance, morphological descriptions, distribution, habitat, conservation status, management, nomenclature and multimedia. This effort is aligned with collaboration established by Costa Rica with other countries such as Spain, Mexico, Colombia and Brazil to standarize this type of information through Plinian Core (https://github.com/PlinianCore), a set of vocabulary terms that can be used to describe different aspects of biological species. The Biodiversity Information Explorer (BIE) is one of the modules made available by ALA which indexes taxonomic and species content and provides a search interface for it. We will present how CRBio is implementing BIE as part of the Atlas of Living Costa Rica in order to share all the information elements contained in the Costa Rican species pages.

Download Full-text

Towards Open Data Quality Improvements Based on Root Cause Analysis of Quality Issues

Lecture Notes in Computer Science - Electronic Government ◽

10.1007/978-3-319-98690-6_18 ◽

2018 ◽

pp. 208-220

Author(s):

Csaba Csáki

Keyword(s):

Data Quality ◽

Open Data ◽

Root Cause Analysis ◽

Quality Improvements ◽

Cause Analysis ◽

Root Cause ◽

Quality Issues

Download Full-text

A reference library for Canadian invertebrates with 1.5 million barcodes, voucher specimens, and DNA samples

Scientific Data ◽

10.1038/s41597-019-0320-2 ◽

2019 ◽

Vol 6 (1) ◽

Cited By ~ 3

Author(s):

Jeremy R. deWaard ◽

Sujeevan Ratnasingham ◽

Evgeny V. Zakharov ◽

Alex V. Borisenko ◽

Dirk Steinke ◽

...

Keyword(s):

Land Surface ◽

Animal Species ◽

Sequence Data ◽

Dna Barcodes ◽

Reference Library ◽

Global Biodiversity Information Facility ◽

Dna Sequence Data ◽

Voucher Specimens ◽

Data Portal ◽

Biodiversity Information

Download Full-text

Standardizing Biologging Data for LifeWatch: Camera Traps, Acoustic Telemetry and GPS Tracking

Biodiversity Information Science and Standards ◽

10.3897/biss.3.37413 ◽

2019 ◽

Vol 3 ◽

Author(s):

Peter Desmet ◽

Stijn Van Hoey ◽

Lien Reyserhove ◽

Dimitri Brosens ◽

Damiano Oldoni ◽

...

Keyword(s):

Interest Group ◽

Open Data ◽

Camera Trap ◽

Camera Traps ◽

Gps Tracking ◽

Observation Data ◽

Global Biodiversity Information Facility ◽

Darwin Core ◽

Acoustic Receiver ◽

Biodiversity Information

The Research Institute for Nature and Forest (INBO) is co-managing three biologging networks as part of a terrestrial and freshwater observatory for LifeWatch Belgium. The networks are a GPS tracking network for large birds, an acoustic receiver network for fish, and a camera trap network for mammals. As part of our mission at the Open science lab for biodiversity, we are publishing the machine observations these networks generate as standardized, open data. One of the challenges however, is finding the appropriate standards and platforms to do so. In this talk, we will present the three networks, the type of biologging data they collect and how we (plan to) standardize these to specific community standards and to Darwin Core (Wieczorek et al. 2012). Data from the bird tracking network have been published in 2014 as one of the first biologging datasets on the Global Biodiversity Information Facility (GBIF) (Stienen et al. 2014). We are now planning to upload the data to Movebank instead and contribute to a generic mapping between the Movebank format and Darwin Core. Data from the acoustic receiver network are being mapped using the Darwin Core guidelines proposed by the Machine Observations Interest Group of Biodiversity Information Standards (TDWG). Images generated by the camera trap network are managed in the annotation system Agouti, for which we plan to export the data in the Camera Trap Metadata Language (Forrester et al. 2016). We also aim to write a software package to deposit camera trap images and data on Zenodo and map the observation data to Darwin Core. We hope that our work will contribute to discussions and guidelines on how to best map biologging data to Darwin Core, which is one of the aims of the Machine Observations Interest Group of Biodiversity Information Standards (TDWG).

Download Full-text

Game of Tops: Trends in GBIF’s Community of Users

Biodiversity Information Science and Standards ◽

10.3897/biss.3.37187 ◽

2019 ◽

Vol 3 ◽

Author(s):

Nora Escribano ◽

David Galicia ◽

Arturo H. Ariño

Keyword(s):

Information Exchange ◽

Full Range ◽

Open Data ◽

Biodiversity Informatics ◽

Biodiversity Data ◽

Global Biodiversity Information Facility ◽

Research Areas ◽

Scientific Papers ◽

Opening Up ◽

Biodiversity Information

Building on the development of Biodiversity Informatics, the Global Biodiversity Information Facility (GBIF) undertook the task of enabling access to the world’s wealth of biodiversity data via the Internet. To date, GBIF has become, in many respects, the most extensive biodiversity information exchange infrastructure in the world, opening up a full range of possibilities for science. Science has benefited from such access to biodiversity data in research areas ranging from the effects of environmental change on biodiversity to the spread of invasive species, among many others. As of this writing, more than 7,000 published items (scientific papers, reviews, conference proceedings) have been indexed in the GBIF Secretariat’s literature tracking programme. On the basis on this database, we will represent trends in GBIF in the users’ behaviour over time regarding openness, social structure, and other features associated to such scientific production: what is the measurable impact of research using GBIF data? How is the GBIF community of users growing? Is the science made with, and enabled by, open data, actually open? Mapping GBIF users’ choices will show how biodiversity research is evolving through time, synthesising past and current priorities of this community in an attempt to forecast whether summer—or winter—is coming.

Download Full-text

Infrastructure and Population of the OpenBiodiv Biodiversity Knowledge Graph

Biodiversity Data Journal ◽

10.3897/bdj.9.e67671 ◽

2021 ◽

Vol 9 ◽

Author(s):

Mariya Dimitrova ◽

Viktor Senderov ◽

Teodor Georgiev ◽

Georgi Zhelezov ◽

Lyubomir Penev

Keyword(s):

Data Storage ◽

Open Data ◽

Knowledge Graph ◽

Global Biodiversity Information Facility ◽

Academic Literature ◽

Biodiversity Knowledge ◽

Software Packages ◽

Description Framework ◽

Resource Description ◽

Biodiversity Information

OpenBiodiv is a biodiversity knowledge graph containing a synthetic linked open dataset, OpenBiodiv-LOD, which combines knowledge extracted from academic literature with the taxonomic backbone used by the Global Biodiversity Information Facility. The linked open data is modelled according to the OpenBiodiv-O ontology integrating semantic resource types from recognised biodiversity and publishing ontologies with OpenBiodiv-O resource types, introduced to capture the semantics of resources not modelled before. We introduce the new release of the OpenBiodiv-LOD attained through information extraction and modelling of additional biodiversity entities. It was achieved by further developments to OpenBiodiv-O, the data storage infrastructure and the workflow and accompanying R software packages used for transformation of academic literature into Resource Description Framework (RDF). We discuss how to utilise the LOD in biodiversity informatics and give examples by providing solutions to several competency questions. We investigate performance issues that arise due to the large amount of inferred statements in the graph and conclude that OWL-full inference is impractical for the project and that unnecessary inference should be avoided.

Download Full-text