scholarly journals Data Location Quality at GBIF

Author(s):  
John Waller

I will cover how the Global Biodiversity Information Facility (GBIF) handles data quality issues, with specific focus on coordinate location issues, such as gridded datasets (Fig. 1) and country centroids. I will highlight the challenges GBIF faces identifying potential data quality problems and what we and others (Zizka et al. 2019) are doing to discover and address them. GBIF is the largest open-data portal of biodiversity data, which is a large network of individual datasets (> 40k) from various sources and publishers. Since these datasets are variable both within themselves and dataset-to-dataset, this creates a challenge for users wanting to use data collected from museums, smartphones, atlases, satellite tracking, DNA sequencing, and various other sources for research or analysis. Data quality at GBIF will always be a moving target (Chapman 2005), and GBIF already handles many obvious errors such as zero/impossible coordinates, empty or invalid data fields, and fuzzy taxon matching. Since GBIF primarily (but not exclusively) serves lat-lon location information, there is an expectation that occurrences fall somewhat close to where the species actually occurs. This is not always the case. Occurrence data can be hundereds of kilometers away from where the species naturally occur, and there can be multiple reasons for why this can happen, which might not be entirely obvious to users. One reasons is that many GBIF datasets are gridded. Gridded datasets are datasets that have low resolution due to equally-spaced sampling. This can be a data quality issue because a user might assume an occurrence record was recorded exactly at its coordinates. Country centroids are another reason why a species occurrence record might be far from where it occurs naturally. GBIF does not yet flag country centroids, which are records where the dataset publishers has entered the lat-long center of a country instead of leaving the field blank. I will discuss the challenges surrounding locating these issues and the current solutions (such as the CoordinateCleaner R package). I will touch on how existing DWCA terms like coordinateUncertaintyInMeters and footprintWKT are being utilized to highlight low coordinate resolution. Finally, I will highlight some other emerging data quality issues and how GBIF is beginning to experiment with dataset-level flagging. Currently we have flagged around 500 datasets as gridded and around 400 datasets as citizen science, but there are many more potential dataset flags.

2017 ◽  
Vol 08 (04) ◽  
pp. 1012-1021 ◽  
Author(s):  
Steven Johnson ◽  
Stuart Speedie ◽  
Gyorgy Simon ◽  
Vipin Kumar ◽  
Bonnie Westra

Objective The objective of this study was to demonstrate the utility of a healthcare data quality framework by using it to measure the impact of synthetic data quality issues on the validity of an eMeasure (CMS178—urinary catheter removal after surgery). Methods Data quality issues were artificially created by systematically degrading the underlying quality of EHR data using two methods: independent and correlated degradation. A linear model that describes the change in the events included in the eMeasure quantifies the impact of each data quality issue. Results Catheter duration had the most impact on the CMS178 eMeasure with every 1% reduction in data quality causing a 1.21% increase in the number of missing events. For birth date and admission type, every 1% reduction in data quality resulted in a 1% increase in missing events. Conclusion This research demonstrated that the impact of data quality issues can be quantified using a generalized process and that the CMS178 eMeasure, as currently defined, may not measure how well an organization is meeting the intended best practice goal. Secondary use of EHR data is warranted only if the data are of sufficient quality. The assessment approach described in this study demonstrates how the impact of data quality issues on an eMeasure can be quantified and the approach can be generalized for other data analysis tasks. Healthcare organizations can prioritize data quality improvement efforts to focus on the areas that will have the most impact on validity and assess whether the values that are reported should be trusted.


2021 ◽  
Vol 9 ◽  
Author(s):  
Domingos Sandramo ◽  
Enrico Nicosia ◽  
Silvio Cianciullo ◽  
Bernardo Muatinte ◽  
Almeida Guissamulo

The collections of the Natural History Museum of Maputo have a crucial role in the safeguarding of Mozambique's biodiversity, representing an important repository of data and materials regarding the natural heritage of the country. In this paper, a dataset is described, based on the Museum’s Entomological Collection recording 409 species belonging to seven orders and 48 families. Each specimen’s available data, such as geographical coordinates and taxonomic information, have been digitised to build the dataset. The specimens included in the dataset were obtained between 1914–2018 by collectors and researchers from the Natural History Museum of Maputo (once known as “Museu Alváro de Castro”) in all the country’s provinces, with the exception of Cabo Delgado Province. This paper adds data to the Biodiversity Network of Mozambique and the Global Biodiversity Information Facility, within the objectives of the SECOSUD II Project and the Biodiversity Information for Development Programme. The aforementioned insect dataset is available on the GBIF Engine data portal (https://doi.org/10.15468/j8ikhb). Data were also shared on the Mozambican national portal of biodiversity data BioNoMo (https://bionomo.openscidata.org), developed by SECOSUD II Project.


2018 ◽  
Vol 2 ◽  
pp. e25488
Author(s):  
Anne-Sophie Archambeau ◽  
Fabien Cavière ◽  
Kourouma Koura ◽  
Marie-Elise Lecoq ◽  
Sophie Pamerlon ◽  
...  

Atlas of Living Australia (ALA) (https://www.ala.org.au/) is the Global Biodiversity Information Facility (GBIF) node of Australia. They developed an open and free platform for sharing and exploring biodiversity data. All the modules are publicly available for reuse and customization on their GitHub account (https://github.com/AtlasOfLivingAustralia). GBIF Benin, hosted at the University of Abomey-Calavi, has published more than 338 000 occurrence records from 87 datasets and 2 checklists. Through the GBIF Capacity Enhancement Support Programme (https://www.gbif.org/programme/82219/capacity-enhancement-support-programme), GBIF Benin, with the help of GBIF France, is in the process of deploying the Beninese data portal using the GBIF France back-end architecture. GBIF Benin is the first African country to implement this module of the ALA infrastructure. In this presentation, we will show you an overview of the registry and the occurrence search engine using the Beninese data portal. We will begin with the administration interface and how to manage metadata, then we will continue with the user interface of the registry and how you can find Beninese occurrences through the hub.


2019 ◽  
Author(s):  
Jeremy R. deWaard ◽  
Sujeevan Ratnasingham ◽  
Evgeny V. Zakharov ◽  
Alex V. Borisenko ◽  
Dirk Steinke ◽  
...  

AbstractThe reliable taxonomic identification of organisms through DNA sequence data requires a well parameterized library of curated reference sequences. However, it is estimated that just 15% of described animal species are represented in public sequence repositories. To begin to address this deficiency, we provide DNA barcodes for 1,500,003 animal specimens collected from 23 terrestrial and aquatic ecozones at sites across Canada, a nation that comprises 7% of the planet’s land surface. In total, 14 phyla, 43 classes, 163 orders, 1123 families, 6186 genera, and 64,264 Barcode Index Numbers (BINs; a proxy for species) are represented. Species-level taxonomy was available for 38% of the specimens, but higher proportions were assigned to a genus (69.5%) and a family (99.9%). Voucher specimens and DNA extracts are archived at the Centre for Biodiversity Genomics where they are available for further research. The corresponding sequence and taxonomic data can be accessed through the Barcode of Life Data System, GenBank, the Global Biodiversity Information Facility, and the Global Genome Biodiversity Network Data Portal.


2018 ◽  
Vol 2 ◽  
pp. e25990 ◽  
Author(s):  
Manuel Vargas ◽  
María Mora ◽  
William Ulate ◽  
José Cuadra

The Atlas of Living Costa Rica (http://www.crbio.cr/) is a biodiversity data portal, based on the Atlas of Living Australia (ALA), which provides integrated, free, and open access to data and information about Costa Rican biodiversity in order to support science, education, and conservation. It is managed by the Biodiversity Informatics Research Center (CRBio) and the National Biodiversity Institute (INBio). Currently, the Atlas of Living Costa Rica includes nearly 8 million georeferenced species occurrence records, mediated by the Global Biodiversity Information Facility (GBIF), which come from more than 900 databases and have been published by research centers in 36 countries. Half of those records are published by Costa Rican institutions. In addition, CRBio is making a special effort to enrich and share more than 5000 species pages, developed by INBio, about Costa Rican vertebrates, arthropods, molluscs, nematodes, plants and fungi. These pages contain information elements pertaining to, for instance, morphological descriptions, distribution, habitat, conservation status, management, nomenclature and multimedia. This effort is aligned with collaboration established by Costa Rica with other countries such as Spain, Mexico, Colombia and Brazil to standarize this type of information through Plinian Core (https://github.com/PlinianCore), a set of vocabulary terms that can be used to describe different aspects of biological species. The Biodiversity Information Explorer (BIE) is one of the modules made available by ALA which indexes taxonomic and species content and provides a search interface for it. We will present how CRBio is implementing BIE as part of the Atlas of Living Costa Rica in order to share all the information elements contained in the Costa Rican species pages.


2019 ◽  
Vol 6 (1) ◽  
Author(s):  
Jeremy R. deWaard ◽  
Sujeevan Ratnasingham ◽  
Evgeny V. Zakharov ◽  
Alex V. Borisenko ◽  
Dirk Steinke ◽  
...  

AbstractThe reliable taxonomic identification of organisms through DNA sequence data requires a well parameterized library of curated reference sequences. However, it is estimated that just 15% of described animal species are represented in public sequence repositories. To begin to address this deficiency, we provide DNA barcodes for 1,500,003 animal specimens collected from 23 terrestrial and aquatic ecozones at sites across Canada, a nation that comprises 7% of the planet’s land surface. In total, 14 phyla, 43 classes, 163 orders, 1123 families, 6186 genera, and 64,264 Barcode Index Numbers (BINs; a proxy for species) are represented. Species-level taxonomy was available for 38% of the specimens, but higher proportions were assigned to a genus (69.5%) and a family (99.9%). Voucher specimens and DNA extracts are archived at the Centre for Biodiversity Genomics where they are available for further research. The corresponding sequence and taxonomic data can be accessed through the Barcode of Life Data System, GenBank, the Global Biodiversity Information Facility, and the Global Genome Biodiversity Network Data Portal.


Author(s):  
Peter Desmet ◽  
Stijn Van Hoey ◽  
Lien Reyserhove ◽  
Dimitri Brosens ◽  
Damiano Oldoni ◽  
...  

The Research Institute for Nature and Forest (INBO) is co-managing three biologging networks as part of a terrestrial and freshwater observatory for LifeWatch Belgium. The networks are a GPS tracking network for large birds, an acoustic receiver network for fish, and a camera trap network for mammals. As part of our mission at the Open science lab for biodiversity, we are publishing the machine observations these networks generate as standardized, open data. One of the challenges however, is finding the appropriate standards and platforms to do so. In this talk, we will present the three networks, the type of biologging data they collect and how we (plan to) standardize these to specific community standards and to Darwin Core (Wieczorek et al. 2012). Data from the bird tracking network have been published in 2014 as one of the first biologging datasets on the Global Biodiversity Information Facility (GBIF) (Stienen et al. 2014). We are now planning to upload the data to Movebank instead and contribute to a generic mapping between the Movebank format and Darwin Core. Data from the acoustic receiver network are being mapped using the Darwin Core guidelines proposed by the Machine Observations Interest Group of Biodiversity Information Standards (TDWG). Images generated by the camera trap network are managed in the annotation system Agouti, for which we plan to export the data in the Camera Trap Metadata Language (Forrester et al. 2016). We also aim to write a software package to deposit camera trap images and data on Zenodo and map the observation data to Darwin Core. We hope that our work will contribute to discussions and guidelines on how to best map biologging data to Darwin Core, which is one of the aims of the Machine Observations Interest Group of Biodiversity Information Standards (TDWG).


Author(s):  
Nora Escribano ◽  
David Galicia ◽  
Arturo H. Ariño

Building on the development of Biodiversity Informatics, the Global Biodiversity Information Facility (GBIF) undertook the task of enabling access to the world’s wealth of biodiversity data via the Internet. To date, GBIF has become, in many respects, the most extensive biodiversity information exchange infrastructure in the world, opening up a full range of possibilities for science. Science has benefited from such access to biodiversity data in research areas ranging from the effects of environmental change on biodiversity to the spread of invasive species, among many others. As of this writing, more than 7,000 published items (scientific papers, reviews, conference proceedings) have been indexed in the GBIF Secretariat’s literature tracking programme. On the basis on this database, we will represent trends in GBIF in the users’ behaviour over time regarding openness, social structure, and other features associated to such scientific production: what is the measurable impact of research using GBIF data? How is the GBIF community of users growing? Is the science made with, and enabled by, open data, actually open? Mapping GBIF users’ choices will show how biodiversity research is evolving through time, synthesising past and current priorities of this community in an attempt to forecast whether summer—or winter—is coming.


2021 ◽  
Vol 9 ◽  
Author(s):  
Mariya Dimitrova ◽  
Viktor Senderov ◽  
Teodor Georgiev ◽  
Georgi Zhelezov ◽  
Lyubomir Penev

OpenBiodiv is a biodiversity knowledge graph containing a synthetic linked open dataset, OpenBiodiv-LOD, which combines knowledge extracted from academic literature with the taxonomic backbone used by the Global Biodiversity Information Facility. The linked open data is modelled according to the OpenBiodiv-O ontology integrating semantic resource types from recognised biodiversity and publishing ontologies with OpenBiodiv-O resource types, introduced to capture the semantics of resources not modelled before. We introduce the new release of the OpenBiodiv-LOD attained through information extraction and modelling of additional biodiversity entities. It was achieved by further developments to OpenBiodiv-O, the data storage infrastructure and the workflow and accompanying R software packages used for transformation of academic literature into Resource Description Framework (RDF). We discuss how to utilise the LOD in biodiversity informatics and give examples by providing solutions to several competency questions. We investigate performance issues that arise due to the large amount of inferred statements in the graph and conclude that OWL-full inference is impractical for the project and that unnecessary inference should be avoided.


Sign in / Sign up

Export Citation Format

Share Document