scholarly journals Trait Data Integration from the Perspective of a Data Aggregator

Author(s):  
Jennifer Hammock ◽  
Katja Schulz

The Encyclopedia of Life currently hosts ~8M attribute records for ~400k taxa (March 2019, not including geographic categories, Fig. 1). Our aggregation priorities include Essential Biodiversity Variables (Kissling et al. 2018) and other global scale research data priorities. Our primary strategy remains partnership with specialist open data aggregators; we are also developing tools for the deployment of evolutionarily conserved attribute values that scale quickly for global taxonomic coverage, for instance: tissue mineralization type (aragonite, calcite, silica...); trophic guild in certain clades; sensory modalities. To support the aggregation and integration of trait information, data sets should be well structured, properly annotated and free of licensing or contractual restrictions so that they are ‘findable, accessible, interoperable, and reusable’ for both humans and machines (FAIR principles; Wilkinson et al. 2016). To this end, we are improving the documentation of protocols for the transformation, curation, and analysis of EOL data, and associated scripts and software are made available to ensure reproducibility. Proper acknowledgement of contributors and tracking of credit through derived data products promote both open data sharing and the use of aggregated resources. By exposing unique identifiers for data products, people, and institutions, data providers and aggregators can stimulate the development of automated solutions for the creation of contribution metrics. Since different aspects of provenance will be significant depending on the intended data use, better standardization of contributor roles (e.g., author, compiler, publisher, funder) is needed, as well as more detailed attribution guidance for data users. Global scale biodiversity data resources should resolve into a graph, linking taxa, specimens, occurrences, attributes, localities, and ecological interactions, as well as human agents, publications and institutions. Two key data categories for ensuring rich connectivity in the graph will be taxonomic and trait data. This graph can be supported by existing data hubs, if they share identifiers and/or create mappings between them, using standards and sharing practices developed by the biodiversity data community. Versioned archives of the combined graph could be published at intervals to appropriate open data repositories, and open source tools and training provided for researchers to access the combined graph of biodiversity knowledge from all sources. To achieve this, good communication among data hubs will be needed. We will need to share information about preferred vocabularies and identifier management practices, and collaborate on identifier mappings.

Author(s):  
Lyubomir Penev ◽  
Teodor Georgiev ◽  
Viktor Senderov ◽  
Mariya Dimitrova ◽  
Pavel Stoev

As one of the first advocates of open access and open data in the field of biodiversity publishiing, Pensoft has adopted a multiple data publishing model, resulting in the ARPHA-BioDiv toolbox (Penev et al. 2017). ARPHA-BioDiv consists of several data publishing workflows and tools described in the Strategies and Guidelines for Publishing of Biodiversity Data and elsewhere: Data underlying research results are deposited in an external repository and/or published as supplementary file(s) to the article and then linked/cited in the article text; supplementary files are published under their own DOIs and bear their own citation details. Data deposited in trusted repositories and/or supplementary files and described in data papers; data papers may be submitted in text format or converted into manuscripts from Ecological Metadata Language (EML) metadata. Integrated narrative and data publishing realised by the Biodiversity Data Journal, where structured data are imported into the article text from tables or via web services and downloaded/distributed from the published article. Data published in structured, semanticaly enriched, full-text XMLs, so that several data elements can thereafter easily be harvested by machines. Linked Open Data (LOD) extracted from literature, converted into interoperable RDF triples in accordance with the OpenBiodiv-O ontology (Senderov et al. 2018) and stored in the OpenBiodiv Biodiversity Knowledge Graph. Data underlying research results are deposited in an external repository and/or published as supplementary file(s) to the article and then linked/cited in the article text; supplementary files are published under their own DOIs and bear their own citation details. Data deposited in trusted repositories and/or supplementary files and described in data papers; data papers may be submitted in text format or converted into manuscripts from Ecological Metadata Language (EML) metadata. Integrated narrative and data publishing realised by the Biodiversity Data Journal, where structured data are imported into the article text from tables or via web services and downloaded/distributed from the published article. Data published in structured, semanticaly enriched, full-text XMLs, so that several data elements can thereafter easily be harvested by machines. Linked Open Data (LOD) extracted from literature, converted into interoperable RDF triples in accordance with the OpenBiodiv-O ontology (Senderov et al. 2018) and stored in the OpenBiodiv Biodiversity Knowledge Graph. The above mentioned approaches are supported by a whole ecosystem of additional workflows and tools, for example: (1) pre-publication data auditing, involving both human and machine data quality checks (workflow 2); (2) web-service integration with data repositories and data centres, such as Global Biodiversity Information Facility (GBIF), Barcode of Life Data Systems (BOLD), Integrated Digitized Biocollections (iDigBio), Data Observation Network for Earth (DataONE), Long Term Ecological Research (LTER), PlutoF, Dryad, and others (workflows 1,2); (3) semantic markup of the article texts in the TaxPub format facilitating further extraction, distribution and re-use of sub-article elements and data (workflows 3,4); (4) server-to-server import of specimen data from GBIF, BOLD, iDigBio and PlutoR into manuscript text (workflow 3); (5) automated conversion of EML metadata into data paper manuscripts (workflow 2); (6) export of Darwin Core Archive and automated deposition in GBIF (workflow 3); (7) submission of individual images and supplementary data under own DOIs to the Biodiversity Literature Repository, BLR (workflows 1-3); (8) conversion of key data elements from TaxPub articles and taxonomic treatments extracted by Plazi into RDF handled by OpenBiodiv (workflow 5). These approaches represent different aspects of the prospective scholarly publishing of biodiversity data, which in a combination with text and data mining (TDM) technologies for legacy literature (PDF) developed by Plazi, lay the ground of an entire data publishing ecosystem for biodiversity, supplying FAIR (Findable, Accessible, Interoperable and Reusable data to several interoperable overarching infrastructures, such as GBIF, BLR, Plazi TreatmentBank, OpenBiodiv and various end users.


2019 ◽  
Vol 2 ◽  
Author(s):  
Lyubomir Penev

"Data ownership" is actually an oxymoron, because there could not be a copyright (ownership) on facts or ideas, hence no data onwership rights and law exist. The term refers to various kinds of data protection instruments: Intellectual Property Rights (IPR) (mostly copyright) asserted to indicate some kind of data ownership, confidentiality clauses/rules, database right protection (in the European Union only), or personal data protection (GDPR) (Scassa 2018). Data protection is often realised via different mechanisms of "data hoarding", that is witholding access to data for various reasons (Sieber 1989). Data hoarding, however, does not put the data into someone's ownership. Nonetheless, the access to and the re-use of data, and biodiversuty data in particular, is hampered by technical, economic, sociological, legal and other factors, although there should be no formal legal provisions related to copyright that may prevent anyone who needs to use them (Egloff et al. 2014, Egloff et al. 2017, see also the Bouchout Declaration). One of the best ways to provide access to data is to publish these so that the data creators and holders are credited for their efforts. As one of the pioneers in biodiversity data publishing, Pensoft has adopted a multiple-approach data publishing model, resulting in the ARPHA-BioDiv toolbox and in extensive Strategies and Guidelines for Publishing of Biodiversity Data (Penev et al. 2017a, Penev et al. 2017b). ARPHA-BioDiv consists of several data publishing workflows: Deposition of underlying data in an external repository and/or its publication as supplementary file(s) to the related article which are then linked and/or cited in-tex. Supplementary files are published under their own DOIs to increase citability). Description of data in data papers after they have been deposited in trusted repositories and/or as supplementary files; the systme allows for data papers to be submitted both as plain text or converted into manuscripts from Ecological Metadata Language (EML) metadata. Import of structured data into the article text from tables or via web services and their susequent download/distribution from the published article as part of the integrated narrative and data publishing workflow realised by the Biodiversity Data Journal. Publication of data in structured, semanticaly enriched, full-text XMLs where data elements are machine-readable and easy-to-harvest. Extraction of Linked Open Data (LOD) from literature, which is then converted into interoperable RDF triples (in accordance with the OpenBiodiv-O ontology) (Senderov et al. 2018) and stored in the OpenBiodiv Biodiversity Knowledge Graph Deposition of underlying data in an external repository and/or its publication as supplementary file(s) to the related article which are then linked and/or cited in-tex. Supplementary files are published under their own DOIs to increase citability). Description of data in data papers after they have been deposited in trusted repositories and/or as supplementary files; the systme allows for data papers to be submitted both as plain text or converted into manuscripts from Ecological Metadata Language (EML) metadata. Import of structured data into the article text from tables or via web services and their susequent download/distribution from the published article as part of the integrated narrative and data publishing workflow realised by the Biodiversity Data Journal. Publication of data in structured, semanticaly enriched, full-text XMLs where data elements are machine-readable and easy-to-harvest. Extraction of Linked Open Data (LOD) from literature, which is then converted into interoperable RDF triples (in accordance with the OpenBiodiv-O ontology) (Senderov et al. 2018) and stored in the OpenBiodiv Biodiversity Knowledge Graph In combination with text and data mining (TDM) technologies for legacy literature (PDF) developed by Plazi, these approaches show different angles to the future of biodiversity data publishing and, lay the foundations of an entire data publishing ecosystem in the field, while also supplying FAIR (Findable, Accessible, Interoperable and Reusable) data to several interoperable overarching infrastructures, such as Global Biodiversity Information Facility (GBIF), Biodiversity Literature Repository (BLR), Plazi TreatmentBank, OpenBiodiv, as well as to various end users.


Author(s):  
Liah Shonhe

The main focus of the study was to explore the practices of open data sharing in the agricultural sector, including establishing the research outputs concerning open data in agriculture. The study adopted a desktop research methodology based on literature review and bibliographic data from WoS database. Bibliometric indicators discussed include yearly productivity, most prolific authors, and enhanced countries. Study findings revealed that research activity in the field of agriculture and open access is very low. There were 36 OA articles and only 6 publications had an open data badge. Most researchers do not yet embrace the need to openly publish their data set despite the availability of numerous open data repositories. Unfortunately, most African countries are still lagging behind in management of agricultural open data. The study therefore recommends that researchers should publish their research data sets as OA. African countries need to put more efforts in establishing open data repositories and implementing the necessary policies to facilitate OA.


Author(s):  
Remy Jomier ◽  
Remy Poncet ◽  
Noemie Michez

As part of the Biodiversity Information System on Nature and Landscapes (SINP), the French National Museum of Natural History was appointed to develop biodiversity data exchange standards, with the goal of sharing French marine and terrestrial data nationally, meeting domestic and European requirements, e.g., the Infrastructure for spatial information in Europe Directive (INSPIRE Directive, European Commission 2007). Data standards are now recognised as useful to improve and share biodiversity knowledge (e.g., species distribution) and play a key role in data valorisation (e.g., vulnerability assessment, conservation policy). For example, in order to fulfill report obligations within the Fauna and Flora Habitats Directive (European Commission 1992), and the Marine Strategy Framework Directive (European Commission 2008), information about taxa and habitat occurrences are required periodically, involving data exchange and compilation at a national scale. National and international data exchange standards are focused on species, and only a few solutions exist when there is a need to deal with habitat data. Darwin Core has been built to fit with species data exchange needs and only contains one habitat attribute that allows for a bit of leeway to have such an information transfer, but is deemed to be one of the least standardized fields. However, Darwin Core does not allow for a transit of only habitat data, as the scientific name of the taxon is mandatory. The SINP standard for habitats was developed by a dedicated working group, representative of biodiversity European Commission 2008 stakeholders in France. This standard focuses on core attributes that characterize habitat observation and monitoring. Interoperability remains to be achieved with the Darwin Core standard, or something similar on a world scale (e.g., Humboldt Core), as habitat data are regularly gathered irrespective of whether taxon occurrences are associated with it. The results of the French initiative proved useful to compile and share data nationally, bringing together data providers that otherwise would have been excluded. However, at a global scale, it faces some challenges that still need to be fully addressed, interoperability being the main one. Regardless of the problems that remain to be solved, some lessons can be learnt from this effort. With the ultimate goal of making biodiversity data readily available, these lessons should be kept in mind for future initiatives. The presentation deals with how this work was undertaken and how the required elements could be integrated into a French national standard to allow for comprehensive habitat data reporting. It will show hypothesis as to what could be added to the Darwin Core to allow for a better understanding of habitats with at least one taxon attached (or not) to them.


Author(s):  
Néstor Fernández ◽  
Simon Ferrier ◽  
Laetitia M. Navarro ◽  
Henrique M. Pereira

AbstractEssential biodiversity variables (EBVs) are designed to support the detection and quantification of biodiversity change and to define priorities in biodiversity monitoring. Unlike most primary observations of biodiversity phenomena, EBV products should provide information readily available to produce policy-relevant biodiversity indicators, ideally at multiple spatial scales, from global to subnational. This information is typically complex to produce from a single set of data or type of observation, thus requiring approaches that integrate multiple sources of in situ and remote sensing (RS) data. Here we present an up-to-date EBV concept for biodiversity data integration and discuss the critical components of workflows for EBV production. We argue that open and reproducible workflows for data integration are critical to ensure traceability and reproducibility so that each EBV endures and can be updated as novel biodiversity models are adopted, new observation systems become available, and new data sets are incorporated. Fulfilling the EBV vision requires strengthening efforts to mobilize massive amounts of in situ biodiversity data that are not yet publicly available and taking full advantage of emerging RS technologies, novel biodiversity models, and informatics infrastructures, in alignment with the development of a globally coordinated system for biodiversity monitoring.


Author(s):  
A. Zlinszky ◽  
B. Deák ◽  
A. Kania ◽  
A. Schroiff ◽  
N. Pfeifer

Biodiversity is an ecological concept, which essentially involves a complex sum of several indicators. One widely accepted such set of indicators is prescribed for habitat conservation status assessment within Natura 2000, a continental-scale conservation programme of the European Union. Essential Biodiversity Variables are a set of indicators designed to be relevant for biodiversity and suitable for global-scale operational monitoring. Here we revisit a study of Natura 2000 conservation status mapping via airbone LIDAR that develops individual remote sensing-derived proxies for every parameter required by the Natura 2000 manual, from the perspective of developing regional-scale Essential Biodiversity Variables. Based on leaf-on and leaf-off point clouds (10 pt/m2) collected in an alkali grassland area, a set of data products were calculated at 0.5 ×0.5 m resolution. These represent various aspects of radiometric and geometric texture. A Random Forest machine learning classifier was developed to create fuzzy vegetation maps of classes of interest based on these data products. In the next step, either classification results or LIDAR data products were selected as proxies for individual Natura 2000 conservation status variables, and fine-tuned based on field references. These proxies showed adequate performance and were summarized to deliver Natura 2000 conservation status with 80% overall accuracy compared to field references. This study draws attention to the potential of LIDAR for regional-scale Essential Biodiversity variables, and also holds implications for global-scale mapping. These are (i) the use of sensor data products together with habitat-level classification, (ii) the utility of seasonal data, including for non-seasonal variables such as grassland canopy structure, and (iii) the potential of fuzzy mapping-derived class probabilities as proxies for species presence and absence.


2018 ◽  
Vol 2 ◽  
pp. e26808
Author(s):  
Donald Hobern ◽  
Andrea Hahn ◽  
Tim Robertson

The success of Darwin Core and ABCD Schema as flexible standards for sharing specimen data and species occurrence records has enabled GBIF to aggregate around one billion data records. At the same time, other thematic, national or regional aggregators have developed a wide range of other data indexes and portals, many of which enrich the data by interpreting and normalising elements not currently handled by GBIF or by linking other data from geospatial layers, trait databases, etc. Unfortunately, although each of these aggregators has specific strengths and supports particular audiences, this diversification produces many weaknesses and deficiencies for data publishers and for data users, including: incomplete and inconsistent inclusion of relevant datasets; proliferation of record identifiers; inconsistent and bespoke workflows to interpret and standardise data; absence of any shared basis for linked open data and annotations; divergent data formats and APIs; lack of clarity around provenance and impact; etc. The time is ripe for the global community to review these processes. From a technical standpoint, it would be feasible to develop a shared, integrated pipeline which harvested, validated and normalised all relevant biodiversity data records on behalf of all stakeholders. Such a system could build on TDWG expertise to standardise data checks and all stages in data transformation. It could incorporate a modular structure that allowed thematic, national or regional networks to generate additional data elements appropriate to the needs of their users, but for all of these elements to remain part of a single record with a single identifier, facilitating a much more rigorous approach to linked open data. Most of the other issues we currently face around fitness-for-use, predictability and repeatability, transparency and provenance could be supported much more readily under such a model. The key challenges that would need to be overcome would be around social factors, particularly to deliver a flexible and appropriate governance model and to allow research networks, national agencies, etc. to embed modular components within a shared workflow. Given the urgent need to improve data management to support Essential Biodiversity Variables and to deliver an effective global virtual natural history collection, we should review these challenges and seek to establish a data management and aggregation architecture that will support us for the coming decades.


2021 ◽  
Vol 6 ◽  
pp. 355
Author(s):  
Helen Buckley Woods ◽  
Stephen Pinfield

Background: Numerous mechanisms exist to incentivise researchers to share their data. This scoping review aims to identify and summarise evidence of the efficacy of different interventions to promote open data practices and provide an overview of current research. Methods: This scoping review is based on data identified from Web of Science and LISTA, limited from 2016 to 2021. A total of 1128 papers were screened, with 38 items being included. Items were selected if they focused on designing or evaluating an intervention or presenting an initiative to incentivise sharing. Items comprised a mixture of research papers, opinion pieces and descriptive articles. Results: Seven major themes in the literature were identified: publisher/journal data sharing policies, metrics, software solutions, research data sharing agreements in general, open science ‘badges’, funder mandates, and initiatives. Conclusions: A number of key messages for data sharing include: the need to build on existing cultures and practices, meeting people where they are and tailoring interventions to support them; the importance of publicising and explaining the policy/service widely; the need to have disciplinary data champions to model good practice and drive cultural change; the requirement to resource interventions properly; and the imperative to provide robust technical infrastructure and protocols, such as labelling of data sets, use of DOIs, data standards and use of data repositories.


Author(s):  
A. Zlinszky ◽  
B. Deák ◽  
A. Kania ◽  
A. Schroiff ◽  
N. Pfeifer

Biodiversity is an ecological concept, which essentially involves a complex sum of several indicators. One widely accepted such set of indicators is prescribed for habitat conservation status assessment within Natura 2000, a continental-scale conservation programme of the European Union. Essential Biodiversity Variables are a set of indicators designed to be relevant for biodiversity and suitable for global-scale operational monitoring. Here we revisit a study of Natura 2000 conservation status mapping via airbone LIDAR that develops individual remote sensing-derived proxies for every parameter required by the Natura 2000 manual, from the perspective of developing regional-scale Essential Biodiversity Variables. Based on leaf-on and leaf-off point clouds (10 pt/m2) collected in an alkali grassland area, a set of data products were calculated at 0.5 ×0.5 m resolution. These represent various aspects of radiometric and geometric texture. A Random Forest machine learning classifier was developed to create fuzzy vegetation maps of classes of interest based on these data products. In the next step, either classification results or LIDAR data products were selected as proxies for individual Natura 2000 conservation status variables, and fine-tuned based on field references. These proxies showed adequate performance and were summarized to deliver Natura 2000 conservation status with 80% overall accuracy compared to field references. This study draws attention to the potential of LIDAR for regional-scale Essential Biodiversity variables, and also holds implications for global-scale mapping. These are (i) the use of sensor data products together with habitat-level classification, (ii) the utility of seasonal data, including for non-seasonal variables such as grassland canopy structure, and (iii) the potential of fuzzy mapping-derived class probabilities as proxies for species presence and absence.


Sensors ◽  
2021 ◽  
Vol 21 (15) ◽  
pp. 5204
Author(s):  
Anastasija Nikiforova

Nowadays, governments launch open government data (OGD) portals that provide data that can be accessed and used by everyone for their own needs. Although the potential economic value of open (government) data is assessed in millions and billions, not all open data are reused. Moreover, the open (government) data initiative as well as users’ intent for open (government) data are changing continuously and today, in line with IoT and smart city trends, real-time data and sensor-generated data have higher interest for users. These “smarter” open (government) data are also considered to be one of the crucial drivers for the sustainable economy, and might have an impact on information and communication technology (ICT) innovation and become a creativity bridge in developing a new ecosystem in Industry 4.0 and Society 5.0. The paper inspects OGD portals of 60 countries in order to understand the correspondence of their content to the Society 5.0 expectations. The paper provides a report on how much countries provide these data, focusing on some open (government) data success facilitating factors for both the portal in general and data sets of interest in particular. The presence of “smarter” data, their level of accessibility, availability, currency and timeliness, as well as support for users, are analyzed. The list of most competitive countries by data category are provided. This makes it possible to understand which OGD portals react to users’ needs, Industry 4.0 and Society 5.0 request the opening and updating of data for their further potential reuse, which is essential in the digital data-driven world.


Sign in / Sign up

Export Citation Format

Share Document