scholarly journals Biodiversity Knowledge Graphs: Time to move up a gear!

Author(s):  
Franck Michel ◽  
Antonia Ettorre ◽  
Catherine Faron ◽  
Julien Kaplan ◽  
Olivier Gargominy

Harnessing worldwide biodiversity data requires integrating myriad pieces of information, often sparse and incomplete, into a global, coherent data space. To do so, projects like the Global Biodiversity Information Facility, Catalog of Life and Encyclopedia of Life have set up platforms that gather, consolidate, and centralize billions of records from multiple data sources. This approach lowers the entry barrier for scientists willing to consume aggregated biodiversity data but tends to build silos that hamper cross-platform interoperability. The Web of Data embodies a different approach underpinned by the Linked Open Data (LOD) principles (Heath and Bizer 2011). These principles bring about the building of a large, distributed, cross-domain knowledge graph (KG), wherein data description relies on vocabularies with shared, formal, machine-processable semantics. So far however, little biodiversity data have been published this way. Early efforts focused primarily on taxonomic registers, such as NCBI, VTO and AGROVOC. More recent efforts have started paving the way for the publication of more diverse biodiversity KGs (Page 2019, Penev et al. 2019, Michel et al. 2017). Today, we believe that it is time for more biodiversity data producers to join in and start publishing connected KGs spanning a much broader set of domains, far beyond just taxonomic registers. In this talk, we wish to present an on-going endeavor in line with this vision. In a previous work, we published TAXREF-LD (Michel et al. 2017), a LOD representation of the French taxonomic register developed and maintained by the French National Museum of Natural History. We modeled nomenclatural information as a thesaurus of scientific names, taxonomic information as an ontology of classes denoting taxa, and additional information such as ranks and vernacular names. Recently, we have extended the scope of TAXREF-LD to represent and interlink data as various as geographic locations, species interactions, development stages, trophic levels, as well as conservation, biogeographic, and legal status (regulations, protections, etc.). We put a specific effort into working out a model that accurately accounts for the semantics of the data while respecting knowledge engineering practices. For instance, a common design shortcoming is to attach all information as properties of a taxon. This is a rightful choice for some properties like a scientific name or conservation status, but properties that actually pertain to biological individuals themselves, e.g. habitat and trophic level, should better be attched to class members. With the presentation of this work, we wish to advance the discussion about integration scenarios based on knowledge graphs with the different biodiversity data stakeholders.

2021 ◽  
Vol 118 (6) ◽  
pp. e2018093118
Author(s):  
J. Mason Heberling ◽  
Joseph T. Miller ◽  
Daniel Noesgaard ◽  
Scott B. Weingart ◽  
Dmitry Schigel

The accessibility of global biodiversity information has surged in the past two decades, notably through widespread funding initiatives for museum specimen digitization and emergence of large-scale public participation in community science. Effective use of these data requires the integration of disconnected datasets, but the scientific impacts of consolidated biodiversity data networks have not yet been quantified. To determine whether data integration enables novel research, we carried out a quantitative text analysis and bibliographic synthesis of >4,000 studies published from 2003 to 2019 that use data mediated by the world’s largest biodiversity data network, the Global Biodiversity Information Facility (GBIF). Data available through GBIF increased 12-fold since 2007, a trend matched by global data use with roughly two publications using GBIF-mediated data per day in 2019. Data-use patterns were diverse by authorship, geographic extent, taxonomic group, and dataset type. Despite facilitating global authorship, legacies of colonial science remain. Studies involving species distribution modeling were most prevalent (31% of literature surveyed) but recently shifted in focus from theory to application. Topic prevalence was stable across the 17-y period for some research areas (e.g., macroecology), yet other topics proportionately declined (e.g., taxonomy) or increased (e.g., species interactions, disease). Although centered on biological subfields, GBIF-enabled research extends surprisingly across all major scientific disciplines. Biodiversity data mobilization through global data aggregation has enabled basic and applied research use at temporal, spatial, and taxonomic scales otherwise not possible, launching biodiversity sciences into a new era.


Author(s):  
Dag Endresen ◽  
Armine Abrahamyan ◽  
Akobir Mirzorakhimov ◽  
Andreas Melikyan ◽  
Brecht Verstraete ◽  
...  

BioDATA (Biodiversity Data for Internationalisation in Higher Education) is an international project to develop and deliver biodiversity data training for undergraduate and postgraduate students from Armenia, Belarus, Tajikistan, and Ukraine. By training early career (student) biodiversity scholars, we aim at turning the current academic and education biodiversity landscape into a more open-data-friendly one. Professional practitioners (researchers, museum curators, and collection managers involved in data publishing) from each country were also invited to join the project as assistant teachers (mentors). The project is developed by the Research School in Biosystematics - ForBio and the Norwegian GBIF-node, both at the Natural History Museum of the University of Oslo in collaboration with the Secretariat of the Global Biodiversity Information Facility (GBIF) and partners from each of the target countries. The teaching material is based on the GBIF curriculum for data mobilization and all students will have the opportunity to gain the respective GBIF certification. All materials are made freely available for reuse and even in this very early phase of the project, we have already seen the first successful reuse of teaching materials among the project partners. The first BioDATA training event was organized in Minsk (Belarus) in February 2019 with the objective to train a minimum of four mentors from each target country. The mentor-trainees from this event will help us to teach the course to students in their home country together with teachers from the project team. BioDATA mentors will have the opportunity to gain GBIF certification as expert mentors which will open opportunities to contribute to future training events in the larger GBIF network. The BioDATA training events for the students will take place in Dushanbe (Tajikistan) in June 2019, in Minsk (Belarus) in November 2019, in Yerevan (Armenia) in April 2020, and in Kiev (Ukraine) in October 2020. Students from each country are invited to express their interest to participate by contacting their national project partner. We will close the project with a final symposium at the University of Oslo in March 2021. The project is funded by the Norwegian Agency for International Cooperation and Quality Enhancement in Higher Education (DIKU).


2018 ◽  
Vol 2 ◽  
pp. e27087
Author(s):  
Donald Hobern ◽  
Andrea Hahn ◽  
Tim Robertson

For more than a decade, the biodiversity informatics community has recognised the importance of stable resolvable identifiers to enable unambiguous references to data objects and the associated concepts and entities, including museum/herbarium specimens and, more broadly, all records serving as evidence of species occurrence in time and space. Early efforts built on the Darwin Core institutionCode, collectionCode and catalogueNumber terms, treated as a triple and expected to uniquely to identify a specimen. Following review of current technologies for globally unique identifiers, TDWG adopted Life Science Identifiers (LSIDs) (Pereira et al. 2009). Unfortunately, the key stakeholders in the LSID consortium soon withdrew support for the technology, leaving TDWG committed to a moribund technology. Subsequently, publishers of biodiversity data have adopted a range of technologies to provide unique identifiers, including (among others) HTTP Universal Resource Identifiers (URIs), Universal Unique Identifiers (UUIDs), Archival Resource Keys (ARKs), and Handles. Each of these technologies has merit but they do not provide consistent guarantees of persistence or resolvability. More importantly, the heterogeneity of these solutions hampers delivery of services that can treat all of these data objects as part of a consistent linked-open-data domain. The geoscience community has established the System for Earth Sample Registration (SESAR) that enables collections to publish standard metadata records for their samples and for each of these to be associated with an International Geo Sample Number (IGSN http://www.geosamples.org/igsnabout). IGSNs follow a standard format, distribute responsibility for uniqueness between SESAR and the publishing collections, and support resolution via HTTP URI or Handles. Each IGSN resolves to a standard metadata page, roughly equivalent in detail to a Darwin Core specimen record. The standardisation of identifiers has allowed the community to secure support from some journal publishers for promotion and use of IGSNs within articles. The biodiversity informatics community encompasses a much larger number of publishers and greater pre-existing variation in identifier formats. Nevertheless, it would be possible to deliver a shared global identifier scheme with the same features as IGSNs by building off the aggregation services offered by the Global Biodiversity Information Facility (GBIF). The GBIF data index includes normalised Darwin Core metadata for all data records from registered data sources and could serve as a platform for resolution of HTTP URIs and/or Handles for all specimens and for all occurrence records. The most significant trade-off requiring consideration would be between autonomy for collections and other publishers in how they format identifiers within their own data and the benefits that may arise from greater consistency and predictability in the form of resolvable identifiers.


Author(s):  
Daniel Noesgaard

The work required to collect, clean and publish biodiversity datasets is significant, and those who do it deserve recognition for their efforts. Researchers publish studies using open biodiversity data available from GBIF—the Global Biodiversity Information Facility—at a rate of about two papers a day. These studies cover areas such as macroecology, evolution, climate change, and invasive alien species, relying on data sharing by hundreds of publishing institutions and the curatorial work of thousands of individual contributors. With more than 90 per cent of these datasets licensed under Creative Commons Attribution licenses (CC BY and CC BY-NC), data users are required to credit the dataset providers. For GBIF, it is crucial to link these scientific uses to the underlying data as one means of demonstrating the value and impact of open science, while seeking to ensure attribution of individual, organizational and national contributions to the global pool of open data about biodiversity. Every single authenticated download of occurrence records from GBIF.org is issued a unique Digital Object Identifier (DOI). These DOIs each resolve to a landing page that contains details of the search parameters used to generate the download a quantitative map of the underlying datasets that contributed to the download a simple citation to be included in works that rely on the data the search parameters used to generate the download a quantitative map of the underlying datasets that contributed to the download a simple citation to be included in works that rely on the data When used properly by authors and deposited correctly by journals in the article metadata, the DOI citation establishes a direct link between a scientific paper and the underlying data. Crossref—the main DOI Registration Agency for academic literature— exposes such links in Event Data, which can be consumed programmatically to report direct use of individual datasets. GBIF also records these links, permanently preserving the download archives while exposing a citation count on download landing pages that is also summarized on the landing pages of each contributing datasets and publishers. The citation counts can be expanded to produce lists of all papers unambiguously linked to use of specific datasets. In 2018, just 15 per cent of papers based on GBIF-mediated data used DOIs to cite or acknowledge the datasets used in the studies. To promote crediting of data publishers and digital recognition of data sharing, the GBIF Secretariat has been reaching out systematically to authors and publishers since April 2018 whenever a paper fails to include a proper data citation. While publishing lags may hinder immediate effects, preliminary findings suggest that uptake is improving—as the number of papers with DOI data citations during the first part of 2019 is up more than 60 per cent compared to 2018. Focusing on the value of linking scientific publications and data, this presentation will explore the potential for establishing automatic linkage through DOI metadata while demonstrating efforts to improve metrics of data use and attribution of data providers through outreach campaigns to authors and journal publishers.


Author(s):  
Nora Escribano ◽  
David Galicia ◽  
Arturo H. Ariño

Building on the development of Biodiversity Informatics, the Global Biodiversity Information Facility (GBIF) undertook the task of enabling access to the world’s wealth of biodiversity data via the Internet. To date, GBIF has become, in many respects, the most extensive biodiversity information exchange infrastructure in the world, opening up a full range of possibilities for science. Science has benefited from such access to biodiversity data in research areas ranging from the effects of environmental change on biodiversity to the spread of invasive species, among many others. As of this writing, more than 7,000 published items (scientific papers, reviews, conference proceedings) have been indexed in the GBIF Secretariat’s literature tracking programme. On the basis on this database, we will represent trends in GBIF in the users’ behaviour over time regarding openness, social structure, and other features associated to such scientific production: what is the measurable impact of research using GBIF data? How is the GBIF community of users growing? Is the science made with, and enabled by, open data, actually open? Mapping GBIF users’ choices will show how biodiversity research is evolving through time, synthesising past and current priorities of this community in an attempt to forecast whether summer—or winter—is coming.


2021 ◽  
Vol 13 (5) ◽  
pp. 124
Author(s):  
Jiseong Son ◽  
Chul-Su Lim ◽  
Hyoung-Seop Shim ◽  
Ji-Sun Kang

Despite the development of various technologies and systems using artificial intelligence (AI) to solve problems related to disasters, difficult challenges are still being encountered. Data are the foundation to solving diverse disaster problems using AI, big data analysis, and so on. Therefore, we must focus on these various data. Disaster data depend on the domain by disaster type and include heterogeneous data and lack interoperability. In particular, in the case of open data related to disasters, there are several issues, where the source and format of data are different because various data are collected by different organizations. Moreover, the vocabularies used for each domain are inconsistent. This study proposes a knowledge graph to resolve the heterogeneity among various disaster data and provide interoperability among domains. Among disaster domains, we describe the knowledge graph for flooding disasters using Korean open datasets and cross-domain knowledge graphs. Furthermore, the proposed knowledge graph is used to assist, solve, and manage disaster problems.


2021 ◽  
Vol 12 (1) ◽  
Author(s):  
Anna Åkesson ◽  
Alva Curtsdotter ◽  
Anna Eklöf ◽  
Bo Ebenman ◽  
Jon Norberg ◽  
...  

AbstractEco-evolutionary dynamics are essential in shaping the biological response of communities to ongoing climate change. Here we develop a spatially explicit eco-evolutionary framework which features more detailed species interactions, integrating evolution and dispersal. We include species interactions within and between trophic levels, and additionally, we incorporate the feature that species’ interspecific competition might change due to increasing temperatures and affect the impact of climate change on ecological communities. Our modeling framework captures previously reported ecological responses to climate change, and also reveals two key results. First, interactions between trophic levels as well as temperature-dependent competition within a trophic level mitigate the negative impact of climate change on biodiversity, emphasizing the importance of understanding biotic interactions in shaping climate change impact. Second, our trait-based perspective reveals a strong positive relationship between the within-community variation in preferred temperatures and the capacity to respond to climate change. Temperature-dependent competition consistently results both in higher trait variation and more responsive communities to altered climatic conditions. Our study demonstrates the importance of species interactions in an eco-evolutionary setting, further expanding our knowledge of the interplay between ecological and evolutionary processes.


Author(s):  
Lyubomir Penev ◽  
Teodor Georgiev ◽  
Viktor Senderov ◽  
Mariya Dimitrova ◽  
Pavel Stoev

As one of the first advocates of open access and open data in the field of biodiversity publishiing, Pensoft has adopted a multiple data publishing model, resulting in the ARPHA-BioDiv toolbox (Penev et al. 2017). ARPHA-BioDiv consists of several data publishing workflows and tools described in the Strategies and Guidelines for Publishing of Biodiversity Data and elsewhere: Data underlying research results are deposited in an external repository and/or published as supplementary file(s) to the article and then linked/cited in the article text; supplementary files are published under their own DOIs and bear their own citation details. Data deposited in trusted repositories and/or supplementary files and described in data papers; data papers may be submitted in text format or converted into manuscripts from Ecological Metadata Language (EML) metadata. Integrated narrative and data publishing realised by the Biodiversity Data Journal, where structured data are imported into the article text from tables or via web services and downloaded/distributed from the published article. Data published in structured, semanticaly enriched, full-text XMLs, so that several data elements can thereafter easily be harvested by machines. Linked Open Data (LOD) extracted from literature, converted into interoperable RDF triples in accordance with the OpenBiodiv-O ontology (Senderov et al. 2018) and stored in the OpenBiodiv Biodiversity Knowledge Graph. Data underlying research results are deposited in an external repository and/or published as supplementary file(s) to the article and then linked/cited in the article text; supplementary files are published under their own DOIs and bear their own citation details. Data deposited in trusted repositories and/or supplementary files and described in data papers; data papers may be submitted in text format or converted into manuscripts from Ecological Metadata Language (EML) metadata. Integrated narrative and data publishing realised by the Biodiversity Data Journal, where structured data are imported into the article text from tables or via web services and downloaded/distributed from the published article. Data published in structured, semanticaly enriched, full-text XMLs, so that several data elements can thereafter easily be harvested by machines. Linked Open Data (LOD) extracted from literature, converted into interoperable RDF triples in accordance with the OpenBiodiv-O ontology (Senderov et al. 2018) and stored in the OpenBiodiv Biodiversity Knowledge Graph. The above mentioned approaches are supported by a whole ecosystem of additional workflows and tools, for example: (1) pre-publication data auditing, involving both human and machine data quality checks (workflow 2); (2) web-service integration with data repositories and data centres, such as Global Biodiversity Information Facility (GBIF), Barcode of Life Data Systems (BOLD), Integrated Digitized Biocollections (iDigBio), Data Observation Network for Earth (DataONE), Long Term Ecological Research (LTER), PlutoF, Dryad, and others (workflows 1,2); (3) semantic markup of the article texts in the TaxPub format facilitating further extraction, distribution and re-use of sub-article elements and data (workflows 3,4); (4) server-to-server import of specimen data from GBIF, BOLD, iDigBio and PlutoR into manuscript text (workflow 3); (5) automated conversion of EML metadata into data paper manuscripts (workflow 2); (6) export of Darwin Core Archive and automated deposition in GBIF (workflow 3); (7) submission of individual images and supplementary data under own DOIs to the Biodiversity Literature Repository, BLR (workflows 1-3); (8) conversion of key data elements from TaxPub articles and taxonomic treatments extracted by Plazi into RDF handled by OpenBiodiv (workflow 5). These approaches represent different aspects of the prospective scholarly publishing of biodiversity data, which in a combination with text and data mining (TDM) technologies for legacy literature (PDF) developed by Plazi, lay the ground of an entire data publishing ecosystem for biodiversity, supplying FAIR (Findable, Accessible, Interoperable and Reusable data to several interoperable overarching infrastructures, such as GBIF, BLR, Plazi TreatmentBank, OpenBiodiv and various end users.


Author(s):  
Erica Krimmel ◽  
Austin Mast ◽  
Deborah Paul ◽  
Robert Bruhn ◽  
Nelson Rios ◽  
...  

Genomic evidence suggests that the causative virus of COVID-19 (SARS-CoV-2) was introduced to humans from horseshoe bats (family Rhinolophidae) (Andersen et al. 2020) and that species in this family as well as in the closely related Hipposideridae and Rhinonycteridae families are reservoirs of several SARS-like coronaviruses (Gouilh et al. 2011). Specimens collected over the past 400 years and curated by natural history collections around the world provide an essential reference as we work to understand the distributions, life histories, and evolutionary relationships of these bats and their viruses. While the importance of biodiversity specimens to emerging infectious disease research is clear, empowering disease researchers with specimen data is a relatively new goal for the collections community (DiEuliis et al. 2016). Recognizing this, a team from Florida State University is collaborating with partners at GEOLocate, Bionomia, University of Florida, the American Museum of Natural History, and Arizona State University to produce a deduplicated, georeferenced, vetted, and versioned data product of the world's specimens of horseshoe bats and relatives for researchers studying COVID-19. The project will serve as a model for future rapid data product deployments about biodiversity specimens. The project underscores the value of biodiversity data aggregators iDigBio and the Global Biodiversity Information Facility (GBIF), which are sources for 58,617 and 79,862 records, respectively, as of July 2020, of horseshoe bat and relative specimens held by over one hundred natural history collections. Although much of the specimen-based biodiversity data served by iDigBio and GBIF is high quality, it can be considered raw data and therefore often requires additional wrangling, standardizing, and enhancement to be fit for specific applications. The project will create efficiencies for the coronavirus research community by producing an enhanced, research-ready data product, which will be versioned and published through Zenodo, an open-access repository (see doi.org/10.5281/zenodo.3974999). In this talk, we highlight lessons learned from the initial phases of the project, including deduplicating specimen records, standardizing country information, and enhancing taxonomic information. We also report on our progress to date, related to enhancing information about agents (e.g., collectors or determiners) associated with these specimens, and to georeferencing specimen localities. We seek also to explore how much we can use the added agent information (i.e., ORCID iDs and Wikidata Q identifiers) to inform our georeferencing efforts and to support crediting those collecting and doing identifications. The project will georeference approximately one third of our specimen records, based on those lacking geospatial coordinates but containing textual locality descriptions. We furthermore provide an overview of our holistic approach to enhancing specimen records, which we hope will maximize the value of the bat specimens at the center of what has been recently termed the "extended specimen network" (Lendemer et al. 2020). The centrality of the physical specimen in the network reinforces the importance of archived materials for reproducible research. Recognizing this, we view the collections providing data to iDigBio and GBIF as essential partners, as we expect that they will be responsible for the long-term management of enhanced data associated with the physical specimens they curate. We hope that this project can provide a model for better facilitating the reintegration of enhanced data back into local specimen data management systems.


Author(s):  
Gil Nelson ◽  
Deborah L Paul

Integrated Digitized Biocollections (iDigBio) is the United States’ (US) national resource and coordinating center for biodiversity specimen digitization and mobilization. It was established in 2011 through the US National Science Foundation’s (NSF) Advancing Digitization of Biodiversity Collections (ADBC) program, an initiative that grew from a working group of museum-based and other biocollections professionals working in concert with NSF to make collections' specimen data accessible for science, education, and public consumption. The working group, Network Integrated Biocollections Alliance (NIBA), released two reports (Beach et al. 2010, American Institute of Biological Sciences 2013) that provided the foundation for iDigBio and ADBC. iDigBio is restricted in focus to the ingestion of data generated by public, non-federal museum and academic collections. Its focus is on specimen-based (as opposed to observational) occurrence records. iDigBio currently serves about 118 million transcribed specimen-based records and 29 million specimen-based media records from approximately 1600 datasets. These digital objects have been contributed by about 700 collections representing nearly 400 institutions and is the most comprehensive biodiversity data aggregator in the US. Currently, iDigBio, DiSSCo (Distributed System of Scientific Collections), GBIF (Global Biodiversity Information Facility), and the Atlas of Living Australia (ALA) are collaborating on a global framework to harmonize technologies towards standardizing and synchronizing ingestion strategies, data models and standards, cyberinfrastructure, APIs (application programming interface), specimen record identifiers, etc. in service to a developing consolidated global data product that can provide a common source for the world’s digital biodiversity data. The collaboration strives to harness and combine the unique strengths of its partners in ways that ensure the individual needs of each partner’s constituencies are met, design pathways for accommodating existing and emerging aggregators, simultaneously strengthen and enhance access to the world’s biodiversity data, and underscore the scope and importance of worldwide biodiversity informatics activities. Collaborators will share technology strategies and outputs, align conceptual understandings, and establish and draw from an international knowledge base. These collaborators, along with Biodiversity Information Standards (TDWG), will join iDigBio and the Smithsonian National Museum of Natural History as they host Biodiversity 2020 in Washington, DC. Biodiversity 2020 will combine an international celebration of the worldwide progress made in biodiversity data accessibility in the 21st century with a biodiversity data conference that extends the life of Biodiversity Next. It will provide a venue for the GBIF governing board meeting, TDWG annual meeting, and the annual iDigBio Summit as well as three days of plenary and concurrent sessions focused on the present and future of biodiversity data generation, mobilization, and use.


Sign in / Sign up

Export Citation Format

Share Document