scholarly journals Unleash the Potential of your Website! 180,000 webpages from the French Natural History Museum marked up with Bioschemas/Schema.org biodiversity types

Author(s):  
Franck Michel ◽  
Gargominy Olivier ◽  
Benjamin Ledentec ◽  
The Bioschemas Community

The challenge of finding, retrieving and making sense of biodiversity data is being tackled by many different approaches. Projects like the Global Biodiversity Information Facility (GBIF) or Encyclopedia of Life (EoL) adopt an integrative approach where they republish, in a uniform manner, records aggregated from multiple data sources. With this centralized, siloed approach, such projects stand as powerful one-stop shops, but tend to reduce the visibility of other data sources that are not (yet) aggregated. At the other end of the spectrum, the Web of Data promotes the building of a global, distributed knowledge graph consisting of datasets published by independent institutions according to the Linked Open Data principles (Heath and Bizer 2011), such as Wikidata or DBpedia. Beyond these "sophisticated" infrastructures, websites remain the most common way of publishing and sharing scientific data at low cost. Thanks to web search engines, everyone can discover webpages. Yet, the summaries provided in results lists are often insufficiently informative to decide whether a web page is relevant with respect to some research interests, such that integrating data published by a wealth of websites is hardly possible. A strategy around this issue lies in annotating websites with structured, semantic metadata such as the Schema.org vocabulary (Guha et al. 2015). Webpages typically embed Schema.org annotations in the form of markup data (written in the RDFa or JSON-LD formats), which search engines harvest and exploit to improve ranking and provide more informative summarization. Bioschemas is a community effort working to extend Schema.org to support markup for Life Sciences websites (Michel and The Bioschemas Community 2018, Garcia et al. 2017). Bioschemas primarily re-uses existing terms from Schema.org, occasionally re-uses terms from third-party vocabularies, and when necessary proposes new terms to be endorsed by Schema.org. As of today, Bioschemas's biodiversity group has proposed the Taxon type*1 to support the annotation of any webpage denoting taxa, TaxonName to support more specifically the annotation of taxonomic names registries, and guidelines describing how to leverage existing vocabularies such as Darwin Core terms. To proceed further, the biodiversity community must now demonstrate its interest in having these terms endorsed by Schema.org: (1) through a critical mass of live markup deployments, and (2) by the development of applications capable of exploiting this markup data. Therefore, as a first step, the French National Museum of Natural History has marked up its natural heritage inventory website: over 180,000 webpages describing the species inventoried in French territories have been annotated with the Taxon and TaxonName types in the form of JSON-LD scripts (see example scripts). As an example, one can check the source of the Delphinus delphis page. In this presentation, by demonstrating that marking up existing webpages can be very inexpensive, we wish to encourage the biodiversity community to adopt this practice, engage in the discussion about biodiversity-related markup, and possibly propose new terms related e.g. to traits or collections. We believe that generalizing the use of such markup by the many websites reporting checklists, museum collections, occurrences, life traits etc. shall be a major step towards the generalized adoption of FAIR*2 principles (Wilkinson 2016), shall dramatically improve information discovery using search engines, and shall be a key accelerator for the development of novel, web-scale, biodiversity data integration scenarios.

Author(s):  
Erica Krimmel ◽  
Austin Mast ◽  
Deborah Paul ◽  
Robert Bruhn ◽  
Nelson Rios ◽  
...  

Genomic evidence suggests that the causative virus of COVID-19 (SARS-CoV-2) was introduced to humans from horseshoe bats (family Rhinolophidae) (Andersen et al. 2020) and that species in this family as well as in the closely related Hipposideridae and Rhinonycteridae families are reservoirs of several SARS-like coronaviruses (Gouilh et al. 2011). Specimens collected over the past 400 years and curated by natural history collections around the world provide an essential reference as we work to understand the distributions, life histories, and evolutionary relationships of these bats and their viruses. While the importance of biodiversity specimens to emerging infectious disease research is clear, empowering disease researchers with specimen data is a relatively new goal for the collections community (DiEuliis et al. 2016). Recognizing this, a team from Florida State University is collaborating with partners at GEOLocate, Bionomia, University of Florida, the American Museum of Natural History, and Arizona State University to produce a deduplicated, georeferenced, vetted, and versioned data product of the world's specimens of horseshoe bats and relatives for researchers studying COVID-19. The project will serve as a model for future rapid data product deployments about biodiversity specimens. The project underscores the value of biodiversity data aggregators iDigBio and the Global Biodiversity Information Facility (GBIF), which are sources for 58,617 and 79,862 records, respectively, as of July 2020, of horseshoe bat and relative specimens held by over one hundred natural history collections. Although much of the specimen-based biodiversity data served by iDigBio and GBIF is high quality, it can be considered raw data and therefore often requires additional wrangling, standardizing, and enhancement to be fit for specific applications. The project will create efficiencies for the coronavirus research community by producing an enhanced, research-ready data product, which will be versioned and published through Zenodo, an open-access repository (see doi.org/10.5281/zenodo.3974999). In this talk, we highlight lessons learned from the initial phases of the project, including deduplicating specimen records, standardizing country information, and enhancing taxonomic information. We also report on our progress to date, related to enhancing information about agents (e.g., collectors or determiners) associated with these specimens, and to georeferencing specimen localities. We seek also to explore how much we can use the added agent information (i.e., ORCID iDs and Wikidata Q identifiers) to inform our georeferencing efforts and to support crediting those collecting and doing identifications. The project will georeference approximately one third of our specimen records, based on those lacking geospatial coordinates but containing textual locality descriptions. We furthermore provide an overview of our holistic approach to enhancing specimen records, which we hope will maximize the value of the bat specimens at the center of what has been recently termed the "extended specimen network" (Lendemer et al. 2020). The centrality of the physical specimen in the network reinforces the importance of archived materials for reproducible research. Recognizing this, we view the collections providing data to iDigBio and GBIF as essential partners, as we expect that they will be responsible for the long-term management of enhanced data associated with the physical specimens they curate. We hope that this project can provide a model for better facilitating the reintegration of enhanced data back into local specimen data management systems.


2020 ◽  
Author(s):  
Vaughn Shirey ◽  
Michael W. Belitz ◽  
Vijay Barve ◽  
Robert Guralnick

AbstractAggregate biodiversity data from museum specimens and community observations have promise for macroscale ecological analyses. Despite this, many groups are under-sampled, and sampling is not homogeneous across space. Here we used butterflies, the best documented group of insects, to examine inventory completeness across North America. We separated digitally accessible butterfly records into those from natural history collections and burgeoning community science observations to determine if these data sources have differential spatio-taxonomic biases. When we combined all data, we found startling under-sampling in regions with the most dramatic trajectories of climate change and across biomes. We also found support for the hypothesis that community science observations are filling more gaps in sampling but are more biased towards areas with the highest human footprint. Finally, we found that both types of occurrences have familial-level taxonomic completeness biases, in contrast to the hypothesis of less taxonomic bias in natural history collections data. These results suggest that higher inventory completeness, driven by rapid growth of community science observations, is partially offset by higher spatio-taxonomic biases. We use the findings here to provide recommendations on how to alleviate some of these gaps in the context of prioritizing global change research.


2018 ◽  
Vol 374 (1763) ◽  
pp. 20170391 ◽  
Author(s):  
Gil Nelson ◽  
Shari Ellis

The first two decades of the twenty-first century have seen a rapid rise in the mobilization of digital biodiversity data. This has thrust natural history museums into the forefront of biodiversity research, underscoring their central role in the modern scientific enterprise. The advent of mobilization initiatives such as the United States National Science Foundation's Advancing Digitization of Biodiversity Collections (ADBC), Australia's Atlas of Living Australia (ALA), Mexico's National Commission for the Knowledge and Use of Biodiversity (CONABIO), Brazil's Centro de Referência em Informação (CRIA) and China's National Specimen Information Infrastructure (NSII) has led to a rapid rise in data aggregators and an exponential increase in digital data for scientific research and arguably provide the best evidence of where species live. The international Global Biodiversity Information Facility (GBIF) now serves about 131 million museum specimen records, and Integrated Digitized Biocollections (iDigBio) in the USA has amassed more than 115 million. These resources expose collections to a wider audience of researchers, provide the best biodiversity data in the modern era outside of nature itself and ensure the primacy of specimen-based research. Here, we provide a brief history of worldwide data mobilization, their impact on biodiversity research, challenges for ensuring data quality, their contribution to scientific publications and evidence of the rising profiles of natural history collections. This article is part of the theme issue ‘Biological collections for understanding biodiversity in the Anthropocene’.


2021 ◽  
Vol 9 ◽  
Author(s):  
Domingos Sandramo ◽  
Enrico Nicosia ◽  
Silvio Cianciullo ◽  
Bernardo Muatinte ◽  
Almeida Guissamulo

The collections of the Natural History Museum of Maputo have a crucial role in the safeguarding of Mozambique's biodiversity, representing an important repository of data and materials regarding the natural heritage of the country. In this paper, a dataset is described, based on the Museum’s Entomological Collection recording 409 species belonging to seven orders and 48 families. Each specimen’s available data, such as geographical coordinates and taxonomic information, have been digitised to build the dataset. The specimens included in the dataset were obtained between 1914–2018 by collectors and researchers from the Natural History Museum of Maputo (once known as “Museu Alváro de Castro”) in all the country’s provinces, with the exception of Cabo Delgado Province. This paper adds data to the Biodiversity Network of Mozambique and the Global Biodiversity Information Facility, within the objectives of the SECOSUD II Project and the Biodiversity Information for Development Programme. The aforementioned insect dataset is available on the GBIF Engine data portal (https://doi.org/10.15468/j8ikhb). Data were also shared on the Mozambican national portal of biodiversity data BioNoMo (https://bionomo.openscidata.org), developed by SECOSUD II Project.


2020 ◽  
Vol 15 (4) ◽  
pp. 411-437 ◽  
Author(s):  
Marcos Zárate ◽  
Germán Braun ◽  
Pablo Fillottrani ◽  
Claudio Delrieux ◽  
Mirtha Lewis

Great progress to digitize the world’s available Biodiversity and Biogeography data have been made recently, but managing data from many different providers and research domains still remains a challenge. A review of the current landscape of metadata standards and ontologies in Biodiversity sciences suggests that existing standards, such as the Darwin Core terminology, are inadequate for describing Biodiversity data in a semantically meaningful and computationally useful way. As a contribution to fill this gap, we present an ontology-based system, called BiGe-Onto, designed to manage data together from Biodiversity and Biogeography. As data sources, we use two internationally recognized repositories: the Global Biodiversity Information Facility (GBIF) and the Ocean Biogeographic Information System (OBIS). BiGe-Onto system is composed of (i) BiGe-Onto Architecture (ii) a conceptual model called BiGe-Onto specified in OntoUML, (iii) an operational version of BiGe-Onto encoded in OWL 2, and (iv) an integrated dataset for its exploitation through a SPARQL endpoint. We will show use cases that allow researchers to answer questions that manage information from both domains.


Author(s):  
Maxim Shashkov ◽  
Natalya Ivanova

Russia is a huge gap on the open access global biodiversity map of the Global Biodiversity Information Facility (GBIF). National biodiversity data are stored in various sources including museums, herbaria, scientific literature and reports as well as in the private collections and local databases. The best known and largest of the Russian herbarium collections are the collections stored in Komarov Botanical Institute of the Russian Academy of Science (>6 M sheets) and Moscow University (>1 M sheets). The largest zoological collection is located in Zoological institute of the Russian Academy of Science, with >60 M specimens. But most of the national biodiversity data is not yet digitized. The national biodiversity portal as well as the list of Russian biodiversity data sources are still absent. Despite this, projects and other activities are implemented to mobilize a national data using international biodiversity data standards. Currently Russia is not a GBIF member, but in the last 5 years, more than 1.6 M occurrences were published by Russian publishers through GBIF.org (69 datasets at the end of March 2019). The largest GBIF data provider in Russia is the Lomonosov Moscow State University. The Digital Moscow University Herbarium includes 971,732 specimens collected from Russia and many other countries. The Russian GBIF community is steadily expanding (Fig. 1); this is reflected in an increase in the number of publishers and published datasets. The current GBIF network infrastructure in Russia includes 5 IPT (Integrated Publishing Toolkit) installations in Saint Petersburg (two), Pushchino (Moscow region), Moscow, and Syktyvkar (Komi Republic). Russian-language biodiversity informatics materials are collected and presented from an informal web site http://gbif.ru/ with three main sections: data publishing through GBIF, Russian GBIF activities, and Russian biodiversity data sources. data publishing through GBIF, Russian GBIF activities, and Russian biodiversity data sources. Additional sections are dedicated to iNaturalist citizen science system and Russian Specify Software Project community. We provide technical helpdesk support not only for Russian publishers, but also for Russian speakers from the former USSR. The national mailing-list (via google groups) aims to provide a platform for news sharing. Now it includes >240 subscribers. Since the end of 2014, regular biodiversity informatics events are being held in Russia. Last year, two data training courses, funded by GBIF (project ID Russia-02 - "GBIF.ru data mobilization activities") and ForBIO (Research school in biosystematics), were organized in Moscow and Irkutsk region with the participation of 29 Russian researchers. National biodiversity informatics conferences were held in Apatity (2017) and Irkutsk (2018). We believe Russia already has a well established community that can become the basis for further development when Russia becomes a GBIF member.


2020 ◽  
Vol 19 (10) ◽  
pp. 1602-1618 ◽  
Author(s):  
Thibault Robin ◽  
Julien Mariethoz ◽  
Frédérique Lisacek

A key point in achieving accurate intact glycopeptide identification is the definition of the glycan composition file that is used to match experimental with theoretical masses by a glycoproteomics search engine. At present, these files are mainly built from searching the literature and/or querying data sources focused on posttranslational modifications. Most glycoproteomics search engines include a default composition file that is readily used when processing MS data. We introduce here a glycan composition visualizing and comparative tool associated with the GlyConnect database and called GlyConnect Compozitor. It offers a web interface through which the database can be queried to bring out contextual information relative to a set of glycan compositions. The tool takes advantage of compositions being related to one another through shared monosaccharide counts and outputs interactive graphs summarizing information searched in the database. These results provide a guide for selecting or deselecting compositions in a file in order to reflect the context of a study as closely as possible. They also confirm the consistency of a set of compositions based on the content of the GlyConnect database. As part of the tool collection of the Glycomics@ExPASy initiative, Compozitor is hosted at https://glyconnect.expasy.org/compozitor/ where it can be run as a web application. It is also directly accessible from the GlyConnect database.


Author(s):  
Mohanbir Sawhney

Jacob Matthews, chief strategy officer for Career Central Corp. (CEC), was faced with the challenge of growing the client base for CEC's database of job seekers. While CEC had gained traction in signing up potential recruits, the number of employers using the site was still low, and if the trend continued, the recruits might soon start leaving the site. To grow dramatically, Matthews was exploring the possibility of partnering with executive recruiters, search firms, and other online search firms. But how could he structure such partnerships without compromising the confidentiality of his candidates? How could he minimize the risk involved in trusting a third party with the company's valuable database of employees? What was the value proposition that CEC offered its clients who currently used its competitors both online and offline? Refining the marketing message, structuring strategic partnerships, and consistently delivering on its promise were the issues that CEC had to address to grow its business.


Author(s):  
Emery R. Boose ◽  
Barbara S. Lerner

The metadata that describe how scientific data are created and analyzed are typically limited to a general description of data sources, software used, and statistical tests applied and are presented in narrative form in the methods section of a scientific paper or a data set description. Recognizing that such narratives are usually inadequate to support reproduction of the analysis of the original work, a growing number of journals now require that authors also publish their data. However, finer-scale metadata that describe exactly how individual items of data were created and transformed and the processes by which this was done are rarely provided, even though such metadata have great potential to improve data set reliability. This chapter focuses on the detailed process metadata, called “data provenance,” required to ensure reproducibility of analyses and reliable re-use of the data.


Author(s):  
Katharine Barker ◽  
Jonas Astrin ◽  
Gabriele Droege ◽  
Jonathan Coddington ◽  
Ole Seberg

Most successful research programs depend on easily accessible and standardized research infrastructures. Until recently, access to tissue or DNA samples with standardized metadata and of a sufficiently high quality, has been a major bottleneck for genomic research. The Global Geonome Biodiversity Network (GGBN) fills this critical gap by offering standardized, legal access to samples. Presently, GGBN’s core activity is enabling access to searchable DNA and tissue collections across natural history museums and botanic gardens. Activities are gradually being expanded to encompass all kinds of biodiversity biobanks such as culture collections, zoological gardens, aquaria, arboreta, and environmental biobanks. Broadly speaking, these collections all provide long-term storage and standardized public access to samples useful for molecular research. GGBN facilitates sample search and discovery for its distributed member collections through a single entry point. It stores standardized information on mostly geo-referenced, vouchered samples, their physical location, availability, quality, and the necessary legal information on over 50,000 species of Earth’s biodiversity, from unicellular to multicellular organisms. The GGBN Data Portal and the GGBN Data Standard are complementary to existing infrastructures such as the Global Biodiversity Information Facility (GBIF) and International Nucleotide Sequence Database (INSDC). Today, many well-known open-source collection management databases such as Arctos, Specify, and Symbiota, are implementing the GGBN data standard. GGBN continues to increase its collections strategically, based on the needs of the research community, adding over 1.3 million online records in 2018 alone, and today two million sample data are available through GGBN. Together with Consortium of European Taxonomic Facilities (CETAF), Society for the Preservation of Natural History Collections (SPNHC), Biodiversity Information Standards (TDWG), and Synthesis of Systematic Resources (SYNTHESYS+), GGBN provides best practices for biorepositories on meeting the requirements of the Nagoya Protocol on Access and Benefit Sharing (ABS). By collaboration with the Biodiversity Heritage Library (BHL), GGBN is exploring options for tagging publications that reference GGBN collections and associated specimens, made searchable through GGBN’s document library. Through its collaborative efforts, standards, and best practices GGBN aims at facilitating trust and transparency in the use of genetic resources.


Sign in / Sign up

Export Citation Format

Share Document