scholarly journals Wikidata and the biodiversity knowledge graph

Author(s):  
Roderic Page

This talk explores the role Wikidata (Vrandečić and Krötzsch 2014) might play in the task of assembling biodiversity information into a single, richly annotated and cross linked structure known as the biodiversity knowledge graph (Page 2016). Initially conceived as a language-independent data store of facts derived from the Wikipedia, Wikidata has morphed into a global knowledge graph, complete with a user friendly interface for data entry and a powerful implementation of the SPARQL query language. Wikidata already underpins projects such as Gene Wiki (Burgstaller-Muehlbacher et al. 2016) and Scholia (Nielsen et al. 2017). Much of the content of Wikispecies is being automatically added to Wikidata, hence many of the entities relevant to biodiversity (such as taxa, taxonomic publications, and taxonomists) well represented in Wikidata, making it even more attractive. Much of the data relevant to biodiversity is widely scattered in different locations, requiring considerable manual effort to collect and curate. Appeals to the taxonomic community to undertake these tasks have not always met with success. For example, the Global Registry of Biodiversity Repositories (GrBio) was an attempt to create a global list of biodiversity repositories, such as natural history museums and herbaria. An appeal by Schindel et al. (2016) for the taxonomic community to curate this list largely fell on deaf ears, and at the time of writing the GrBio project is moribund. Given that many repositories are housed in institutions that are the subject of articles in Wikipedia, many of these repositories already have entries in Wikidata. Hence, rather than follow the route GrBio took of building a resource and then hoping a community will assemble around that resource, we could go to Wikidata where there is an existing community and build the resource there. An impressive example of the potential for this is WikiCite, which initially had the goal of including in Wikidata every article cited in any of the Wikipedias. Taxonomic articles are highly cited in Wikipedia (Nielsen 2007), hence already fall within the remit of WikiCite. Hence Wikidata is a candidate for the “bibliography of life” (King et al. 2011), a database of all taxonomic literature. Another important role Wikidata can play is to define the boundaries of a biodiversity knowledge graph. Entities such as journals, articles, people, museums, and herbaria are often already in Wikidata, hence we can delegate managing that content to the Wikidata community (bolstered by our own contributions), and focus instead on domain-specific entities such as DNA sequences, specimens, etc., or domain specific attributes of those entities if they are already in Wikidata. This means we can avoid the inevitable “mission creep” that bedevils any attempt to link together information from multiple disciplines. These ideas are explored using examples based on content entirely within Wikidata (including entities such as publications, authorship, and natural history collections), as well as approaches that combine Wikidata with external knowledge graphs such as Ozymandias (Page 2018).

Author(s):  
Roderic Page

This talk explores different strategies for assembling the “biodiversity knowledge graph” (Page 2016). The first is a centralised, crowd-sourced approach using Wikidata as the foundation. Wikidata is becoming increasingly attractive as a knowledge graph for the life sciences (Waagmeester et al. 2020), and I will discuss some of its strengths and limitations, particularly as a source of bibliographic and taxonomic information. For example, Wikidata’s handling of taxonomy is somewhat problematic given the lack of clear separation of taxa and their names. A second approach is to build biodiversity knowledge graphs from scratch, such as OpenBioDiv (Penev et al. 2019) and my own Ozymandias (Page 2019). These approaches use either generalised vocabularies such as schema.org, or domain specific ones such as TaxPub (Catapano 2010) and the Semantic Publishing and Referencing Ontologies (SPAR) (Peroni and Shotton 2018), and to date tend to have restricted focus, whether geographic (e.g., Australian animals in Ozymandias) or temporal (recent taxonomic literature, OpenBioDiv). A growing number of data sources are now using schema.org to describe their data, including ORCID and Zenodo, and efforts to extend schema.org into biology (Bioschemas) suggest we may soon be able to build comprehensive knowledge graphs using just schema.org and its derivatives. A third approach is not to build an entire knowledge graph, but instead focus on constructing small pieces of the graph tightly linked to supporting evidence, for example via annotations. Annotations are increasingly used to mark up both the biomedical literature (e.g., Kim et al. 2015, Venkatesan et al. 2017) and the biodiversity literature (Batista-Navarro et al. 2017). One could argue that taxonomic databases are essentially lists of annotations (“this name appears in this publication on this page”), which suggests we could link literature projects such as the Biodiversity Heritage Library (BHL) to taxonomic databases via annotations. Given that the International Image Interoperability Framework (IIIF) provides a framework for treating publications themselves as a set of annotations (e.g., page images) upon which other annotations can be added (Zundert 2018), this suggests ways that knowledge graphs could lead directly to visualising the links between taxonomy and the taxonomic literature. All three approaches will be discussed, accompanied by working examples.


Semantic Web ◽  
2020 ◽  
pp. 1-45
Author(s):  
Valentina Anita Carriero ◽  
Aldo Gangemi ◽  
Maria Letizia Mancinelli ◽  
Andrea Giovanni Nuzzolese ◽  
Valentina Presutti ◽  
...  

Ontology Design Patterns (ODPs) have become an established and recognised practice for guaranteeing good quality ontology engineering. There are several ODP repositories where ODPs are shared as well as ontology design methodologies recommending their reuse. Performing rigorous testing is recommended as well for supporting ontology maintenance and validating the resulting resource against its motivating requirements. Nevertheless, it is less than straightforward to find guidelines on how to apply such methodologies for developing domain-specific knowledge graphs. ArCo is the knowledge graph of Italian Cultural Heritage and has been developed by using eXtreme Design (XD), an ODP- and test-driven methodology. During its development, XD has been adapted to the need of the CH domain e.g. gathering requirements from an open, diverse community of consumers, a new ODP has been defined and many have been specialised to address specific CH requirements. This paper presents ArCo and describes how to apply XD to the development and validation of a CH knowledge graph, also detailing the (intellectual) process implemented for matching the encountered modelling problems to ODPs. Relevant contributions also include a novel web tool for supporting unit-testing of knowledge graphs, a rigorous evaluation of ArCo, and a discussion of methodological lessons learned during ArCo’s development.


Nature ◽  
2021 ◽  
Vol 598 (7879) ◽  
pp. 32-32
Author(s):  
Corrie S. Moreau ◽  
Jessica L. Ware

Author(s):  
Ming Sheng ◽  
Anqi Li ◽  
Yuelin Bu ◽  
Jing Dong ◽  
Yong Zhang ◽  
...  

Author(s):  
Lyubomir Penev ◽  
Teodor Georgiev ◽  
Viktor Senderov ◽  
Mariya Dimitrova ◽  
Pavel Stoev

As one of the first advocates of open access and open data in the field of biodiversity publishiing, Pensoft has adopted a multiple data publishing model, resulting in the ARPHA-BioDiv toolbox (Penev et al. 2017). ARPHA-BioDiv consists of several data publishing workflows and tools described in the Strategies and Guidelines for Publishing of Biodiversity Data and elsewhere: Data underlying research results are deposited in an external repository and/or published as supplementary file(s) to the article and then linked/cited in the article text; supplementary files are published under their own DOIs and bear their own citation details. Data deposited in trusted repositories and/or supplementary files and described in data papers; data papers may be submitted in text format or converted into manuscripts from Ecological Metadata Language (EML) metadata. Integrated narrative and data publishing realised by the Biodiversity Data Journal, where structured data are imported into the article text from tables or via web services and downloaded/distributed from the published article. Data published in structured, semanticaly enriched, full-text XMLs, so that several data elements can thereafter easily be harvested by machines. Linked Open Data (LOD) extracted from literature, converted into interoperable RDF triples in accordance with the OpenBiodiv-O ontology (Senderov et al. 2018) and stored in the OpenBiodiv Biodiversity Knowledge Graph. Data underlying research results are deposited in an external repository and/or published as supplementary file(s) to the article and then linked/cited in the article text; supplementary files are published under their own DOIs and bear their own citation details. Data deposited in trusted repositories and/or supplementary files and described in data papers; data papers may be submitted in text format or converted into manuscripts from Ecological Metadata Language (EML) metadata. Integrated narrative and data publishing realised by the Biodiversity Data Journal, where structured data are imported into the article text from tables or via web services and downloaded/distributed from the published article. Data published in structured, semanticaly enriched, full-text XMLs, so that several data elements can thereafter easily be harvested by machines. Linked Open Data (LOD) extracted from literature, converted into interoperable RDF triples in accordance with the OpenBiodiv-O ontology (Senderov et al. 2018) and stored in the OpenBiodiv Biodiversity Knowledge Graph. The above mentioned approaches are supported by a whole ecosystem of additional workflows and tools, for example: (1) pre-publication data auditing, involving both human and machine data quality checks (workflow 2); (2) web-service integration with data repositories and data centres, such as Global Biodiversity Information Facility (GBIF), Barcode of Life Data Systems (BOLD), Integrated Digitized Biocollections (iDigBio), Data Observation Network for Earth (DataONE), Long Term Ecological Research (LTER), PlutoF, Dryad, and others (workflows 1,2); (3) semantic markup of the article texts in the TaxPub format facilitating further extraction, distribution and re-use of sub-article elements and data (workflows 3,4); (4) server-to-server import of specimen data from GBIF, BOLD, iDigBio and PlutoR into manuscript text (workflow 3); (5) automated conversion of EML metadata into data paper manuscripts (workflow 2); (6) export of Darwin Core Archive and automated deposition in GBIF (workflow 3); (7) submission of individual images and supplementary data under own DOIs to the Biodiversity Literature Repository, BLR (workflows 1-3); (8) conversion of key data elements from TaxPub articles and taxonomic treatments extracted by Plazi into RDF handled by OpenBiodiv (workflow 5). These approaches represent different aspects of the prospective scholarly publishing of biodiversity data, which in a combination with text and data mining (TDM) technologies for legacy literature (PDF) developed by Plazi, lay the ground of an entire data publishing ecosystem for biodiversity, supplying FAIR (Findable, Accessible, Interoperable and Reusable data to several interoperable overarching infrastructures, such as GBIF, BLR, Plazi TreatmentBank, OpenBiodiv and various end users.


PeerJ ◽  
2020 ◽  
Vol 8 ◽  
pp. e8225 ◽  
Author(s):  
Freek T. Bakker ◽  
Alexandre Antonelli ◽  
Julia A. Clarke ◽  
Joseph A. Cook ◽  
Scott V. Edwards ◽  
...  

Natural history museums are unique spaces for interdisciplinary research and educational innovation. Through extensive exhibits and public programming and by hosting rich communities of amateurs, students, and researchers at all stages of their careers, they can provide a place-based window to focus on integration of science and discovery, as well as a locus for community engagement. At the same time, like a synthesis radio telescope, when joined together through emerging digital resources, the global community of museums (the ‘Global Museum’) is more than the sum of its parts, allowing insights and answers to diverse biological, environmental, and societal questions at the global scale, across eons of time, and spanning vast diversity across the Tree of Life. We argue that, whereas natural history collections and museums began with a focus on describing the diversity and peculiarities of species on Earth, they are now increasingly leveraged in new ways that significantly expand their impact and relevance. These new directions include the possibility to ask new, often interdisciplinary questions in basic and applied science, such as in biomimetic design, and by contributing to solutions to climate change, global health and food security challenges. As institutions, they have long been incubators for cutting-edge research in biology while simultaneously providing core infrastructure for research on present and future societal needs. Here we explore how the intersection between pressing issues in environmental and human health and rapid technological innovation have reinforced the relevance of museum collections. We do this by providing examples as food for thought for both the broader academic community and museum scientists on the evolving role of museums. We also identify challenges to the realization of the full potential of natural history collections and the Global Museum to science and society and discuss the critical need to grow these collections. We then focus on mapping and modelling of museum data (including place-based approaches and discovery), and explore the main projects, platforms and databases enabling this growth. Finally, we aim to improve relevant protocols for the long-term storage of specimens and tissues, ensuring proper connection with tomorrow’s technologies and hence further increasing the relevance of natural history museums.


2020 ◽  
Vol 245 ◽  
pp. 04044
Author(s):  
Jérôme Fulachier ◽  
Jérôme Odier ◽  
Fabian Lambert

This document describes the design principles of the Metadata Querying Language (MQL) implemented in ATLAS Metadata Interface (AMI), a metadata-oriented domain-specific language allowing to query databases without knowing the relation between tables. With this simplified yet generic grammar, MQL permits writing complex queries more simply than with Structured Query Language (SQL).


Sign in / Sign up

Export Citation Format

Share Document