Trait Data Integration from the Perspective of a Data Aggregator

The Pensoft Data Publishing Workflow: The FAIRway from articles to Linked Open Data

Biodiversity Information Science and Standards ◽

10.3897/biss.3.35902 ◽

2019 ◽

Vol 3 ◽

Author(s):

Lyubomir Penev ◽

Teodor Georgiev ◽

Viktor Senderov ◽

Mariya Dimitrova ◽

Pavel Stoev

Keyword(s):

Open Data ◽

Structured Data ◽

Linked Open Data ◽

Data Publishing ◽

Knowledge Graph ◽

Supplementary File ◽

Biodiversity Data ◽

Text Format ◽

Biodiversity Knowledge ◽

Data Elements

As one of the first advocates of open access and open data in the field of biodiversity publishiing, Pensoft has adopted a multiple data publishing model, resulting in the ARPHA-BioDiv toolbox (Penev et al. 2017). ARPHA-BioDiv consists of several data publishing workflows and tools described in the Strategies and Guidelines for Publishing of Biodiversity Data and elsewhere: Data underlying research results are deposited in an external repository and/or published as supplementary file(s) to the article and then linked/cited in the article text; supplementary files are published under their own DOIs and bear their own citation details. Data deposited in trusted repositories and/or supplementary files and described in data papers; data papers may be submitted in text format or converted into manuscripts from Ecological Metadata Language (EML) metadata. Integrated narrative and data publishing realised by the Biodiversity Data Journal, where structured data are imported into the article text from tables or via web services and downloaded/distributed from the published article. Data published in structured, semanticaly enriched, full-text XMLs, so that several data elements can thereafter easily be harvested by machines. Linked Open Data (LOD) extracted from literature, converted into interoperable RDF triples in accordance with the OpenBiodiv-O ontology (Senderov et al. 2018) and stored in the OpenBiodiv Biodiversity Knowledge Graph. Data underlying research results are deposited in an external repository and/or published as supplementary file(s) to the article and then linked/cited in the article text; supplementary files are published under their own DOIs and bear their own citation details. Data deposited in trusted repositories and/or supplementary files and described in data papers; data papers may be submitted in text format or converted into manuscripts from Ecological Metadata Language (EML) metadata. Integrated narrative and data publishing realised by the Biodiversity Data Journal, where structured data are imported into the article text from tables or via web services and downloaded/distributed from the published article. Data published in structured, semanticaly enriched, full-text XMLs, so that several data elements can thereafter easily be harvested by machines. Linked Open Data (LOD) extracted from literature, converted into interoperable RDF triples in accordance with the OpenBiodiv-O ontology (Senderov et al. 2018) and stored in the OpenBiodiv Biodiversity Knowledge Graph. The above mentioned approaches are supported by a whole ecosystem of additional workflows and tools, for example: (1) pre-publication data auditing, involving both human and machine data quality checks (workflow 2); (2) web-service integration with data repositories and data centres, such as Global Biodiversity Information Facility (GBIF), Barcode of Life Data Systems (BOLD), Integrated Digitized Biocollections (iDigBio), Data Observation Network for Earth (DataONE), Long Term Ecological Research (LTER), PlutoF, Dryad, and others (workflows 1,2); (3) semantic markup of the article texts in the TaxPub format facilitating further extraction, distribution and re-use of sub-article elements and data (workflows 3,4); (4) server-to-server import of specimen data from GBIF, BOLD, iDigBio and PlutoR into manuscript text (workflow 3); (5) automated conversion of EML metadata into data paper manuscripts (workflow 2); (6) export of Darwin Core Archive and automated deposition in GBIF (workflow 3); (7) submission of individual images and supplementary data under own DOIs to the Biodiversity Literature Repository, BLR (workflows 1-3); (8) conversion of key data elements from TaxPub articles and taxonomic treatments extracted by Plazi into RDF handled by OpenBiodiv (workflow 5). These approaches represent different aspects of the prospective scholarly publishing of biodiversity data, which in a combination with text and data mining (TDM) technologies for legacy literature (PDF) developed by Plazi, lay the ground of an entire data publishing ecosystem for biodiversity, supplying FAIR (Findable, Accessible, Interoperable and Reusable data to several interoperable overarching infrastructures, such as GBIF, BLR, Plazi TreatmentBank, OpenBiodiv and various end users.

Download Full-text

Data ownership and data publishing

ARPHA Conference Abstracts ◽

10.3897/aca.2.e39250 ◽

2019 ◽

Vol 2 ◽

Author(s):

Lyubomir Penev

Keyword(s):

Data Protection ◽

Open Data ◽

Data Publishing ◽

Supplementary File ◽

Biodiversity Data ◽

Biodiversity Knowledge ◽

Data Ownership ◽

Data Hoarding ◽

Data Elements ◽

Access To Data

"Data ownership" is actually an oxymoron, because there could not be a copyright (ownership) on facts or ideas, hence no data onwership rights and law exist. The term refers to various kinds of data protection instruments: Intellectual Property Rights (IPR) (mostly copyright) asserted to indicate some kind of data ownership, confidentiality clauses/rules, database right protection (in the European Union only), or personal data protection (GDPR) (Scassa 2018). Data protection is often realised via different mechanisms of "data hoarding", that is witholding access to data for various reasons (Sieber 1989). Data hoarding, however, does not put the data into someone's ownership. Nonetheless, the access to and the re-use of data, and biodiversuty data in particular, is hampered by technical, economic, sociological, legal and other factors, although there should be no formal legal provisions related to copyright that may prevent anyone who needs to use them (Egloff et al. 2014, Egloff et al. 2017, see also the Bouchout Declaration). One of the best ways to provide access to data is to publish these so that the data creators and holders are credited for their efforts. As one of the pioneers in biodiversity data publishing, Pensoft has adopted a multiple-approach data publishing model, resulting in the ARPHA-BioDiv toolbox and in extensive Strategies and Guidelines for Publishing of Biodiversity Data (Penev et al. 2017a, Penev et al. 2017b). ARPHA-BioDiv consists of several data publishing workflows: Deposition of underlying data in an external repository and/or its publication as supplementary file(s) to the related article which are then linked and/or cited in-tex. Supplementary files are published under their own DOIs to increase citability). Description of data in data papers after they have been deposited in trusted repositories and/or as supplementary files; the systme allows for data papers to be submitted both as plain text or converted into manuscripts from Ecological Metadata Language (EML) metadata. Import of structured data into the article text from tables or via web services and their susequent download/distribution from the published article as part of the integrated narrative and data publishing workflow realised by the Biodiversity Data Journal. Publication of data in structured, semanticaly enriched, full-text XMLs where data elements are machine-readable and easy-to-harvest. Extraction of Linked Open Data (LOD) from literature, which is then converted into interoperable RDF triples (in accordance with the OpenBiodiv-O ontology) (Senderov et al. 2018) and stored in the OpenBiodiv Biodiversity Knowledge Graph Deposition of underlying data in an external repository and/or its publication as supplementary file(s) to the related article which are then linked and/or cited in-tex. Supplementary files are published under their own DOIs to increase citability). Description of data in data papers after they have been deposited in trusted repositories and/or as supplementary files; the systme allows for data papers to be submitted both as plain text or converted into manuscripts from Ecological Metadata Language (EML) metadata. Import of structured data into the article text from tables or via web services and their susequent download/distribution from the published article as part of the integrated narrative and data publishing workflow realised by the Biodiversity Data Journal. Publication of data in structured, semanticaly enriched, full-text XMLs where data elements are machine-readable and easy-to-harvest. Extraction of Linked Open Data (LOD) from literature, which is then converted into interoperable RDF triples (in accordance with the OpenBiodiv-O ontology) (Senderov et al. 2018) and stored in the OpenBiodiv Biodiversity Knowledge Graph In combination with text and data mining (TDM) technologies for legacy literature (PDF) developed by Plazi, these approaches show different angles to the future of biodiversity data publishing and, lay the foundations of an entire data publishing ecosystem in the field, while also supplying FAIR (Findable, Accessible, Interoperable and Reusable) data to several interoperable overarching infrastructures, such as Global Biodiversity Information Facility (GBIF), Biodiversity Literature Repository (BLR), Plazi TreatmentBank, OpenBiodiv, as well as to various end users.

Download Full-text

Sharing Open Data in Agriculture

Advances in Library and Information Science - Open Access Implications for Sustainable Social, Political, and Economic Development ◽

10.4018/978-1-7998-5018-2.ch013 ◽

2021 ◽

pp. 244-266

Author(s):

Liah Shonhe

Keyword(s):

Agricultural Sector ◽

Open Data ◽

Research Data ◽

Data Sets ◽

Research Activity ◽

African Countries ◽

Data Set ◽

Data Repositories ◽

Bibliographic Data ◽

Prolific Authors

The main focus of the study was to explore the practices of open data sharing in the agricultural sector, including establishing the research outputs concerning open data in agriculture. The study adopted a desktop research methodology based on literature review and bibliographic data from WoS database. Bibliometric indicators discussed include yearly productivity, most prolific authors, and enhanced countries. Study findings revealed that research activity in the field of agriculture and open access is very low. There were 36 OA articles and only 6 publications had an open data badge. Most researchers do not yet embrace the need to openly publish their data set despite the availability of numerous open data repositories. Unfortunately, most African countries are still lagging behind in management of agricultural open data. The study therefore recommends that researchers should publish their research data sets as OA. African countries need to put more efforts in establishing open data repositories and implementing the necessary policies to facilitate OA.

Download Full-text

How about Habitats and Darwin Core?

Biodiversity Information Science and Standards ◽

10.3897/biss.3.39271 ◽

2019 ◽

Vol 3 ◽

Author(s):

Remy Jomier ◽

Remy Poncet ◽

Noemie Michez

Keyword(s):

European Commission ◽

Information Transfer ◽

Data Exchange ◽

Spatial Information ◽

Global Scale ◽

National Standard ◽

Habitats Directive ◽

Biodiversity Data ◽

Biodiversity Knowledge ◽

Darwin Core

As part of the Biodiversity Information System on Nature and Landscapes (SINP), the French National Museum of Natural History was appointed to develop biodiversity data exchange standards, with the goal of sharing French marine and terrestrial data nationally, meeting domestic and European requirements, e.g., the Infrastructure for spatial information in Europe Directive (INSPIRE Directive, European Commission 2007). Data standards are now recognised as useful to improve and share biodiversity knowledge (e.g., species distribution) and play a key role in data valorisation (e.g., vulnerability assessment, conservation policy). For example, in order to fulfill report obligations within the Fauna and Flora Habitats Directive (European Commission 1992), and the Marine Strategy Framework Directive (European Commission 2008), information about taxa and habitat occurrences are required periodically, involving data exchange and compilation at a national scale. National and international data exchange standards are focused on species, and only a few solutions exist when there is a need to deal with habitat data. Darwin Core has been built to fit with species data exchange needs and only contains one habitat attribute that allows for a bit of leeway to have such an information transfer, but is deemed to be one of the least standardized fields. However, Darwin Core does not allow for a transit of only habitat data, as the scientific name of the taxon is mandatory. The SINP standard for habitats was developed by a dedicated working group, representative of biodiversity European Commission 2008 stakeholders in France. This standard focuses on core attributes that characterize habitat observation and monitoring. Interoperability remains to be achieved with the Darwin Core standard, or something similar on a world scale (e.g., Humboldt Core), as habitat data are regularly gathered irrespective of whether taxon occurrences are associated with it. The results of the French initiative proved useful to compile and share data nationally, bringing together data providers that otherwise would have been excluded. However, at a global scale, it faces some challenges that still need to be fully addressed, interoperability being the main one. Regardless of the problems that remain to be solved, some lessons can be learnt from this effort. With the ultimate goal of making biodiversity data readily available, these lessons should be kept in mind for future initiatives. The presentation deals with how this work was undertaken and how the required elements could be integrated into a French national standard to allow for comprehensive habitat data reporting. It will show hypothesis as to what could be added to the Darwin Core to allow for a better understanding of habitats with at least one taxon attached (or not) to them.

Download Full-text

Essential Biodiversity Variables: Integrating In-Situ Observations and Remote Sensing Through Modeling

Remote Sensing of Plant Biodiversity ◽

10.1007/978-3-030-33157-3_18 ◽

2020 ◽

pp. 485-501 ◽

Cited By ~ 4

Author(s):

Néstor Fernández ◽

Simon Ferrier ◽

Laetitia M. Navarro ◽

Henrique M. Pereira

Keyword(s):

Remote Sensing ◽

Data Integration ◽

Spatial Scales ◽

Data Sets ◽

Biodiversity Monitoring ◽

Multiple Sources ◽

Biodiversity Data ◽

Essential Biodiversity Variables ◽

Critical Components

AbstractEssential biodiversity variables (EBVs) are designed to support the detection and quantification of biodiversity change and to define priorities in biodiversity monitoring. Unlike most primary observations of biodiversity phenomena, EBV products should provide information readily available to produce policy-relevant biodiversity indicators, ideally at multiple spatial scales, from global to subnational. This information is typically complex to produce from a single set of data or type of observation, thus requiring approaches that integrate multiple sources of in situ and remote sensing (RS) data. Here we present an up-to-date EBV concept for biodiversity data integration and discuss the critical components of workflows for EBV production. We argue that open and reproducible workflows for data integration are critical to ensure traceability and reproducibility so that each EBV endures and can be updated as novel biodiversity models are adopted, new observation systems become available, and new data sets are incorporated. Fulfilling the EBV vision requires strengthening efforts to mobilize massive amounts of in situ biodiversity data that are not yet publicly available and taking full advantage of emerging RS technologies, novel biodiversity models, and informatics infrastructures, in alignment with the development of a globally coordinated system for biodiversity monitoring.

Download Full-text

BIODIVERSITY MAPPING VIA NATURA 2000 CONSERVATION STATUS AND EBV ASSESSMENT USING AIRBORNE LASER SCANNING IN ALKALI GRASSLANDS

ISPRS - International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences ◽

10.5194/isprs-archives-xli-b8-1293-2016 ◽

2016 ◽

Vol XLI-B8 ◽

pp. 1293-1299

Author(s):

A. Zlinszky ◽

B. Deák ◽

A. Kania ◽

A. Schroiff ◽

N. Pfeifer

Keyword(s):

Laser Scanning ◽

Conservation Status ◽

Regional Scale ◽

Point Clouds ◽

Natura 2000 ◽

Global Scale ◽

Sensor Data ◽

Continental Scale ◽

Data Products ◽

Essential Biodiversity Variables

Biodiversity is an ecological concept, which essentially involves a complex sum of several indicators. One widely accepted such set of indicators is prescribed for habitat conservation status assessment within Natura 2000, a continental-scale conservation programme of the European Union. Essential Biodiversity Variables are a set of indicators designed to be relevant for biodiversity and suitable for global-scale operational monitoring. Here we revisit a study of Natura 2000 conservation status mapping via airbone LIDAR that develops individual remote sensing-derived proxies for every parameter required by the Natura 2000 manual, from the perspective of developing regional-scale Essential Biodiversity Variables. Based on leaf-on and leaf-off point clouds (10 pt/m2) collected in an alkali grassland area, a set of data products were calculated at 0.5 ×0.5 m resolution. These represent various aspects of radiometric and geometric texture. A Random Forest machine learning classifier was developed to create fuzzy vegetation maps of classes of interest based on these data products. In the next step, either classification results or LIDAR data products were selected as proxies for individual Natura 2000 conservation status variables, and fine-tuned based on field references. These proxies showed adequate performance and were summarized to deliver Natura 2000 conservation status with 80% overall accuracy compared to field references. This study draws attention to the potential of LIDAR for regional-scale Essential Biodiversity variables, and also holds implications for global-scale mapping. These are (i) the use of sensor data products together with habitat-level classification, (ii) the utility of seasonal data, including for non-seasonal variables such as grassland canopy structure, and (iii) the potential of fuzzy mapping-derived class probabilities as proxies for species presence and absence.

Download Full-text

Options to streamline and enrich biodiversity data aggregation

Biodiversity Information Science and Standards ◽

10.3897/biss.2.26808 ◽

2018 ◽

Vol 2 ◽

pp. e26808

Author(s):

Donald Hobern ◽

Andrea Hahn ◽

Tim Robertson

Keyword(s):

Data Management ◽

Open Data ◽

Linked Open Data ◽

Biodiversity Data ◽

Rigorous Approach ◽

Governance Model ◽

Natural History Collection ◽

Wide Range ◽

Essential Biodiversity Variables ◽

Data Elements

The success of Darwin Core and ABCD Schema as flexible standards for sharing specimen data and species occurrence records has enabled GBIF to aggregate around one billion data records. At the same time, other thematic, national or regional aggregators have developed a wide range of other data indexes and portals, many of which enrich the data by interpreting and normalising elements not currently handled by GBIF or by linking other data from geospatial layers, trait databases, etc. Unfortunately, although each of these aggregators has specific strengths and supports particular audiences, this diversification produces many weaknesses and deficiencies for data publishers and for data users, including: incomplete and inconsistent inclusion of relevant datasets; proliferation of record identifiers; inconsistent and bespoke workflows to interpret and standardise data; absence of any shared basis for linked open data and annotations; divergent data formats and APIs; lack of clarity around provenance and impact; etc. The time is ripe for the global community to review these processes. From a technical standpoint, it would be feasible to develop a shared, integrated pipeline which harvested, validated and normalised all relevant biodiversity data records on behalf of all stakeholders. Such a system could build on TDWG expertise to standardise data checks and all stages in data transformation. It could incorporate a modular structure that allowed thematic, national or regional networks to generate additional data elements appropriate to the needs of their users, but for all of these elements to remain part of a single record with a single identifier, facilitating a much more rigorous approach to linked open data. Most of the other issues we currently face around fitness-for-use, predictability and repeatability, transparency and provenance could be supported much more readily under such a model. The key challenges that would need to be overcome would be around social factors, particularly to deliver a flexible and appropriate governance model and to allow research networks, national agencies, etc. to embed modular components within a shared workflow. Given the urgent need to improve data management to support Essential Biodiversity Variables and to deliver an effective global virtual natural history collection, we should review these challenges and seek to establish a data management and aggregation architecture that will support us for the coming decades.

Download Full-text

Incentivising research data sharing: a scoping review

Wellcome Open Research ◽

10.12688/wellcomeopenres.17286.1 ◽

2021 ◽

Vol 6 ◽

pp. 355

Author(s):

Helen Buckley Woods ◽

Stephen Pinfield

Keyword(s):

Data Sharing ◽

Cultural Change ◽

Scoping Review ◽

Good Practice ◽

Open Data ◽

Open Science ◽

Research Data ◽

Data Sets ◽

Data Repositories ◽

Data Practices

Background: Numerous mechanisms exist to incentivise researchers to share their data. This scoping review aims to identify and summarise evidence of the efficacy of different interventions to promote open data practices and provide an overview of current research. Methods: This scoping review is based on data identified from Web of Science and LISTA, limited from 2016 to 2021. A total of 1128 papers were screened, with 38 items being included. Items were selected if they focused on designing or evaluating an intervention or presenting an initiative to incentivise sharing. Items comprised a mixture of research papers, opinion pieces and descriptive articles. Results: Seven major themes in the literature were identified: publisher/journal data sharing policies, metrics, software solutions, research data sharing agreements in general, open science ‘badges’, funder mandates, and initiatives. Conclusions: A number of key messages for data sharing include: the need to build on existing cultures and practices, meeting people where they are and tailoring interventions to support them; the importance of publicising and explaining the policy/service widely; the need to have disciplinary data champions to model good practice and drive cultural change; the requirement to resource interventions properly; and the imperative to provide robust technical infrastructure and protocols, such as labelling of data sets, use of DOIs, data standards and use of data repositories.

Download Full-text

BIODIVERSITY MAPPING VIA NATURA 2000 CONSERVATION STATUS AND EBV ASSESSMENT USING AIRBORNE LASER SCANNING IN ALKALI GRASSLANDS

ISPRS - International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences ◽

10.5194/isprsarchives-xli-b8-1293-2016 ◽

2016 ◽

Vol XLI-B8 ◽

pp. 1293-1299 ◽

Cited By ~ 1

Author(s):

A. Zlinszky ◽

B. Deák ◽

A. Kania ◽

A. Schroiff ◽

N. Pfeifer

Keyword(s):

Laser Scanning ◽

Conservation Status ◽

Regional Scale ◽

Point Clouds ◽

Natura 2000 ◽

Global Scale ◽

Sensor Data ◽

Continental Scale ◽

Data Products ◽

Essential Biodiversity Variables

Biodiversity is an ecological concept, which essentially involves a complex sum of several indicators. One widely accepted such set of indicators is prescribed for habitat conservation status assessment within Natura 2000, a continental-scale conservation programme of the European Union. Essential Biodiversity Variables are a set of indicators designed to be relevant for biodiversity and suitable for global-scale operational monitoring. Here we revisit a study of Natura 2000 conservation status mapping via airbone LIDAR that develops individual remote sensing-derived proxies for every parameter required by the Natura 2000 manual, from the perspective of developing regional-scale Essential Biodiversity Variables. Based on leaf-on and leaf-off point clouds (10 pt/m2) collected in an alkali grassland area, a set of data products were calculated at 0.5 ×0.5 m resolution. These represent various aspects of radiometric and geometric texture. A Random Forest machine learning classifier was developed to create fuzzy vegetation maps of classes of interest based on these data products. In the next step, either classification results or LIDAR data products were selected as proxies for individual Natura 2000 conservation status variables, and fine-tuned based on field references. These proxies showed adequate performance and were summarized to deliver Natura 2000 conservation status with 80% overall accuracy compared to field references. This study draws attention to the potential of LIDAR for regional-scale Essential Biodiversity variables, and also holds implications for global-scale mapping. These are (i) the use of sensor data products together with habitat-level classification, (ii) the utility of seasonal data, including for non-seasonal variables such as grassland canopy structure, and (iii) the potential of fuzzy mapping-derived class probabilities as proxies for species presence and absence.

Download Full-text

Smarter Open Government Data for Society 5.0: Are Your Open Data Smart Enough?

Sensors ◽

10.3390/s21155204 ◽

2021 ◽

Vol 21 (15) ◽

pp. 5204

Author(s):

Anastasija Nikiforova

Keyword(s):

Industry 4.0 ◽

Economic Value ◽

Open Data ◽

Digital Data ◽

Open Government ◽

Data Sets ◽

Time Data ◽

Open Government Data ◽

Information And Communication ◽

Government Data

Nowadays, governments launch open government data (OGD) portals that provide data that can be accessed and used by everyone for their own needs. Although the potential economic value of open (government) data is assessed in millions and billions, not all open data are reused. Moreover, the open (government) data initiative as well as users’ intent for open (government) data are changing continuously and today, in line with IoT and smart city trends, real-time data and sensor-generated data have higher interest for users. These “smarter” open (government) data are also considered to be one of the crucial drivers for the sustainable economy, and might have an impact on information and communication technology (ICT) innovation and become a creativity bridge in developing a new ecosystem in Industry 4.0 and Society 5.0. The paper inspects OGD portals of 60 countries in order to understand the correspondence of their content to the Society 5.0 expectations. The paper provides a report on how much countries provide these data, focusing on some open (government) data success facilitating factors for both the portal in general and data sets of interest in particular. The presence of “smarter” data, their level of accessibility, availability, currency and timeliness, as well as support for users, are analyzed. The list of most competitive countries by data category are provided. This makes it possible to understand which OGD portals react to users’ needs, Industry 4.0 and Society 5.0 request the opening and updating of data for their further potential reuse, which is essential in the digital data-driven world.

Download Full-text