scholarly journals Aligning Standards Communities: Sustainable Darwin Core MIxS Interoperability

Author(s):  
Raïssa Meyer ◽  
Pier Buttigieg ◽  
John Wieczorek ◽  
Thomas Jeppesen ◽  
William Duncan ◽  
...  

Biodiversity is increasingly being assessed using omic technologies (e.g. metagenomics or metatranscriptomics); however, the metadata generated by omic investigations is not fully harmonised with that of the broader biodiversity community. There are two major communities developing metadata standards specifications relevant to omic biodiversity data: TDWG, through its Darwin Core (DwC) standard, and the Genomic Standard Consortium (GSC), through its Minimum Information about any (x) Sequence (MIxS) checklists. To prevent these specifications leading to silos between the communities using them (e.g. INSDC: an internationally mandated database collaboration for nucleotide sequencing data [from health, biodiversity, microbiology, etc.] using the MIxS checklists; OBIS and GBIF: global biodiversity data networks using the DwC standard), there is a need to harmonise them at the level of the standards organisations themselves. To this end, we have brought together representatives from these standardisation bodies, along with representatives from established biodiversity data infrastructures, domain experts, data generators, and publishers to develop sustainable interoperability between the two specifications. Together, we have: generated a semantic mapping between the terminology used in each specification, and syntactic mapping of their associated values following the Simple Standard for Sharing Ontology Mappings (SSSOM), and created an example MIxS-DwC extension showing the incorporation of unmapped MIxS terms into a DwC-Archive. generated a semantic mapping between the terminology used in each specification, and syntactic mapping of their associated values following the Simple Standard for Sharing Ontology Mappings (SSSOM), and created an example MIxS-DwC extension showing the incorporation of unmapped MIxS terms into a DwC-Archive. To sustain these mechanisms of interoperability, we have proposed a Memorandum of Understanding between the GSC and TDWG. During our work, we also noted a number of key challenges that currently preclude interoperation between these two specifications. In this talk, we will outline the major steps we took to get here, as well as the future activities we recommend based on our outputs.

2020 ◽  
Vol 15 (4) ◽  
pp. 411-437 ◽  
Author(s):  
Marcos Zárate ◽  
Germán Braun ◽  
Pablo Fillottrani ◽  
Claudio Delrieux ◽  
Mirtha Lewis

Great progress to digitize the world’s available Biodiversity and Biogeography data have been made recently, but managing data from many different providers and research domains still remains a challenge. A review of the current landscape of metadata standards and ontologies in Biodiversity sciences suggests that existing standards, such as the Darwin Core terminology, are inadequate for describing Biodiversity data in a semantically meaningful and computationally useful way. As a contribution to fill this gap, we present an ontology-based system, called BiGe-Onto, designed to manage data together from Biodiversity and Biogeography. As data sources, we use two internationally recognized repositories: the Global Biodiversity Information Facility (GBIF) and the Ocean Biogeographic Information System (OBIS). BiGe-Onto system is composed of (i) BiGe-Onto Architecture (ii) a conceptual model called BiGe-Onto specified in OntoUML, (iii) an operational version of BiGe-Onto encoded in OWL 2, and (iv) an integrated dataset for its exploitation through a SPARQL endpoint. We will show use cases that allow researchers to answer questions that manage information from both domains.


Author(s):  
Lauren Weatherdon

Ensuring that we have the data and information necessary to make informed decisions is a core requirement in an era of increasing complexity and anthropogenic impact. With cumulative challenges such as the decline in biodiversity and accelerating climate change, the need for spatially-explicit and methodologically-consistent data that can be compiled to produce useful and reliable indicators of biological change and ecosystem health is growing. Technological advances—including satellite imagery—are beginning to make this a reality, yet uptake of biodiversity information standards and scaling of data to ensure its applicability at multiple levels of decision-making are still in progress. The complementary Essential Biodiversity Variables (EBVs) and Essential Ocean Variables (EOVs), combined with Darwin Core and other data and metadata standards, provide the underpinnings necessary to produce data that can inform indicators. However, perhaps the largest challenge in developing global, biological change indicators is achieving consistent and holistic coverage over time, with recognition of biodiversity data as global assets that are critical to tracking progress toward the UN Sustainable Development Goals and Targets set by the international community (see Jensen and Campbell (2019) for discussion). Through this talk, I will describe some of the efforts towards producing and collating effective biodiversity indicators, such as those based on authoritative datasets like the World Database on Protected Areas (https://www.protectedplanet.net/), and work achieved through the Biodiversity Indicators Partnership (https://www.bipindicators.net/). I will also highlight some of the characteristics of effective indicators, and global biodiversity reporting and communication needs as we approach 2020 and beyond.


Author(s):  
Filipi Soares ◽  
Benildes Maculan ◽  
Debora Drucker

Agricultural Biodiversity has been defined by the Convention on Biological Diversity as the set of elements of biodiversity that are relevant to agriculture and food production. These elements are arranged into an agro-ecosystem that compasses "the variability among living organisms from all sources including terrestrial, marine and other aquatic ecosystems and the ecological complexes of which they are part: this includes diversity within species, between species and of ecosystems" (UNEP 1992). As with any other field in Biology, Agricultural Biodiversity work produces data. In order to publish data in a way it can be efficiently retrieved on web, one must describe it with proper metadata. A metadata element set is a group of statements made about something. These statements have three elements, named subject (thing represented), predicate (space filled up with data) and object (data itself). This representation is called triples. For example, the title is a metadata element. A book is the subject; title is the predicate; and The Chronicles of Narnia is the object. Some metadata standards have been developed to describe biodiversity data, as ABCD Data Schema, Darwin Core (DwC) and Ecological Metadata Language (EML). The DwC is said to be the most used metadata standard to publish data about species occurrence worldwide (Global Biodiversity Information Facility 2019). "Darwin Core is a standard maintained by the Darwin Core maintenance group. It includes a glossary of terms (in other contexts these might be called properties, elements, fields, columns, attributes, or concepts) intended to facilitate the sharing of information about biological diversity by providing identifiers, labels, and definitions. Darwin Core is primarily based on taxa, their occurrence in nature as documented by observations, specimens, samples, and related information" (Biodiversity Information Standards (TDWG) 2014). Within this thematic context, a master research project is in progress at the Federal University of Minas Gerais in partnership with the Brazilian Agricultural Research Corporation (EMBRAPA). It aims to apply the DwC on Brazil’s Agricultural Biodiversity data. A pragmatic analysis of DwC and DwC Extensions demonstrated that important concepts and relations from Agricultural Biodiversity are not represented in DwC elements. For example, DwC does not have significant metadata to describe biological interactions, to convey important information about relations between organisms in an ecological perspective. Pollination is one of the biological interactions relevant to Agricultural Biodiversity, for which we need enhanced metadata. Given these problems, the principles of metadata construction of DwC will be followed in order to develop a metadata extension able to represent data about Agricultural Biodiversity. These principles are the Dublin Core Abstract Model, which present propositions for creating the triples (subject-predicate-object). The standard format of DwC Extensions (see Darwin Core Archive Validator) will be followed to shape the metadata extension. At the end of the research, we expect to present a model of DwC metadata record to publish data about Agricultural Biodiversity in Brazil, including metadata already existent in Simple DwC and the new metadata of Brazil’s Agricultural Biodiversity Metadata Extension. The resulting extension will be useful to represent Agricultural Diversity worldwide.


Author(s):  
Edward Gilbert ◽  
Corinna Gries ◽  
Nico Franz ◽  
Landrum Leslie R. ◽  
Thomas H. Nash III

The SEINet Portal Network has a complex social and development history spanning nearly two decades. Initially established as a basic online search engine for a select handful of biological collections curated within the southwestern United States, SEINet has since matured into a biodiversity data network incorporating more than 330 institutions and 1,900 individual data contributors. Participating institutions manage and publish over 14 million specimen records, 215,000 observations, and 8 million images. Approximately 70% of the collections make use of the data portal as their primary "live" specimen management platform. The SEINet interface now supports 13 regional data portals distributed across the United States and northern Mexico (http://symbiota.org/docs/seinet/). Through many collaborative efforts, it has matured into a tool for biodiversity data exploration, which includes species inventories, interactive identification keys, specimen and field images, taxonomic information, species distribution maps, and taxonomic descriptions. SEINet’s initial developmental goals were to construct a read-only interface that integrated specimen records harvested from a handful of distributed natural history databases. Intermittent network conductivity and inconsistent data exchange protocols frequently restricted data persistence. National funding opportunities supported a complete redesign towards the development of a centralized data cache model with periodic "snapshot" updates from original data sources. A service-based management infrastructure was integrated into the interface to mobilize small- to medium-sized collections (<1 million specimen records) that commonly lack consistent infrastructure and technical expertise to maintain a standard compliant specimen database. These developments were the precursors to the Symbiota software project (Gries et al. 2014). Through further development of Symbiota, SEINet transformed into a robust specimen management system specifically geared toward specimen digitization with features including data entry from label images, harvesting data from specimen duplicates, batch georeferencing, data validation and cleaning, generating progress reports, and additional tools to improve the efficiency of the digitization process. The central developmental paradigm focused on data mobilization through the production of: a versatile import module capable of ingesting a diverse range of data structures, a robust toolkit to assist in digitizing and managing specimen data and images, and a Darwin Core Archive (DwC-A) compliant data publishing and export toolkit to facilitate data distribution to global aggregators such as Global Biodiversity Information Facility (GBIF) and iDigBio. a versatile import module capable of ingesting a diverse range of data structures, a robust toolkit to assist in digitizing and managing specimen data and images, and a Darwin Core Archive (DwC-A) compliant data publishing and export toolkit to facilitate data distribution to global aggregators such as Global Biodiversity Information Facility (GBIF) and iDigBio. User interfaces consist of a decentralized network of regional data portals, all connecting to a centralized shared data source. Each of the 13 data portals are configured to present a regional perspective specifically tailored to represent the needs of the local research community. This infrastructure has supported the formation of regional consortia, who provide network support to aid local institutions in digitizing and publishing their collections within the network. The community-based infrastructure creates a sense of ownership – perhaps even good-natured competition – by the data providers and provides extra incentive to improve data quality and expand the network. Certain areas of development remain challenging in spite of the project's overall success. For instance, data managers continuously struggle to maintain a current local taxonomic thesaurus used for name validation, data cleaning, and to resolve taxonomic discrepancies commonly encountered when integrating collection datasets. We will discuss the successes and challenges associated with the long-term sustainability model and explore potential future paths for SEINet that support the long-term goal of maintaining a data provider that is in full compliance with the FAIR use principles of making the datasets findable, accessible, interoperable, and reusable (Wilkinson et al. 2016).


2017 ◽  
Vol 83 (17) ◽  
Author(s):  
Laura M. Carroll ◽  
Jasna Kovac ◽  
Rachel A. Miller ◽  
Martin Wiedmann

ABSTRACT The Bacillus cereus group comprises nine species, several of which are pathogenic. Differentiating between isolates that may cause disease and those that do not is a matter of public health and economic importance, but it can be particularly challenging due to the high genomic similarity within the group. To this end, we have developed BTyper, a computational tool that employs a combination of (i) virulence gene-based typing, (ii) multilocus sequence typing (MLST), (iii) panC clade typing, and (iv) rpoB allelic typing to rapidly classify B. cereus group isolates using nucleotide sequencing data. BTyper was applied to a set of 662 B. cereus group genome assemblies to (i) identify anthrax-associated genes in non-B. anthracis members of the B. cereus group, and (ii) identify assemblies from B. cereus group strains with emetic potential. With BTyper, the anthrax toxin genes cya, lef, and pagA were detected in 8 genomes classified by the NCBI as B. cereus that clustered into two distinct groups using k-medoids clustering, while either the B. anthracis poly-γ-d-glutamate capsule biosynthesis genes capABCDE or the hyaluronic acid capsule hasA gene was detected in an additional 16 assemblies classified as either B. cereus or Bacillus thuringiensis isolated from clinical, environmental, and food sources. The emetic toxin genes cesABCD were detected in 24 assemblies belonging to panC clades III and VI that had been isolated from food, clinical, and environmental settings. The command line version of BTyper is available at https://github.com/lmc297/BTyper . In addition, BMiner, a companion application for analyzing multiple BTyper output files in aggregate, can be found at https://github.com/lmc297/BMiner . IMPORTANCE Bacillus cereus is a foodborne pathogen that is estimated to cause tens of thousands of illnesses each year in the United States alone. Even with molecular methods, it can be difficult to distinguish nonpathogenic B. cereus group isolates from their pathogenic counterparts, including the human pathogen Bacillus anthracis, which is responsible for anthrax, as well as the insect pathogen B. thuringiensis. By using the variety of typing schemes employed by BTyper, users can rapidly classify, characterize, and assess the virulence potential of any isolate using its nucleotide sequencing data.


2021 ◽  
Vol 14 (1) ◽  
Author(s):  
Haeseung Lee ◽  
Min-Goo Seo ◽  
Seung-Hun Lee ◽  
Jae-Ku Oem ◽  
Seon-Hee Kim ◽  
...  

Abstract Background Bats are hosts for many ectoparasites and act as reservoirs for several infectious agents, some of which exhibit zoonotic potential. Here, species of bats and bat flies were identified and screened for microorganisms that could be mediated by bat flies. Methods Bat species were identified on the basis of their morphological characteristics. Bat flies associated with bat species were initially morphologically identified and further identified at the genus level by analyzing the cytochrome c oxidase subunit I gene. Different vector-borne pathogens and endosymbionts were screened using PCR to assess all possible relationships among bats, parasitic bat flies, and their associated organisms. Results Seventy-four bat flies were collected from 198 bats; 66 of these belonged to Nycteribiidae and eight to Streblidae families. All Streblidae bat flies were hosted by Rhinolophus ferrumequinum, known as the most common Korean bat. Among the 74 tested bat flies, PCR and nucleotide sequencing data showed that 35 (47.3%) and 20 (27.0%) carried Wolbachia and Bartonella bacteria, respectively, whereas tests for Anaplasma, Borrelia, Hepatozoon, Babesia, Theileria, and Coxiella were negative. Phylogenetic analysis revealed that Wolbachia endosymbionts belonged to two different supergroups, A and F. One sequence of Bartonella was identical to that of Bartonella isolated from Taiwanese bats. Conclusions The vectorial role of bat flies should be checked by testing the same pathogen and bacterial organisms by collecting blood from host bats. This study is of great interest in the fields of disease ecology and public health owing to the bats’ potential to transmit pathogens to humans and/or livestock. Graphical abstract


ZooKeys ◽  
2018 ◽  
Vol 751 ◽  
pp. 129-146 ◽  
Author(s):  
Robert Mesibov

A total of ca 800,000 occurrence records from the Australian Museum (AM), Museums Victoria (MV) and the New Zealand Arthropod Collection (NZAC) were audited for changes in selected Darwin Core fields after processing by the Atlas of Living Australia (ALA; for AM and MV records) and the Global Biodiversity Information Facility (GBIF; for AM, MV and NZAC records). Formal taxon names in the genus- and species-groups were changed in 13–21% of AM and MV records, depending on dataset and aggregator. There was little agreement between the two aggregators on processed names, with names changed in two to three times as many records by one aggregator alone compared to records with names changed by both aggregators. The type status of specimen records did not change with name changes, resulting in confusion as to the name with which a type was associated. Data losses of up to 100% were found after processing in some fields, apparently due to programming errors. The taxonomic usefulness of occurrence records could be improved if aggregators included both original and the processed taxonomic data items for each record. It is recommended that end-users check original and processed records for data loss and name replacements after processing by aggregators.


Author(s):  
Beckett Sterner ◽  
Nathan Upham ◽  
Prashant Gupta ◽  
Caleb Powell ◽  
Nico Franz

Making the most of biodiversity data requires linking observations of biological species from multiple sources both efficiently and accurately (Bisby 2000, Franz et al. 2016). Aggregating occurrence records using taxonomic names and synonyms is computationally efficient but known to experience significant limitations on accuracy when the assumption of one-to-one relationships between names and biological entities breaks down (Remsen 2016, Franz and Sterner 2018). Taxonomic treatments and checklists provide authoritative information about the correct usage of names for species, including operational representations of the meanings of those names in the form of range maps, reference genetic sequences, or diagnostic traits. They increasingly provide taxonomic intelligence in the form of precise description of the semantic relationships between different published names in the literature. Making this authoritative information Findable, Accessible, Interoperable, and Reusable (FAIR; Wilkinson et al. 2016) would be a transformative advance for biodiversity data sharing and help drive adoption and novel extensions of existing standards such as the Taxonomic Concept Schema and the OpenBiodiv Ontology (Kennedy et al. 2006, Senderov et al. 2018). We call for the greater, global Biodiversity Information Standards (TDWG) and taxonomy community to commit to extending and expanding on how FAIR applies to biodiversity data and include practical targets and criteria for the publication and digitization of taxonomic concept representations and alignments in taxonomic treatments, checklists, and backbones. As a motivating case, consider the abundantly sampled North American deer mouse—Peromyscus maniculatus (Wagner 1845)—which was recently split from one continental species into five more narrowly defined forms, so that the name P. maniculatus is now only applied east of the Mississippi River (Bradley et al. 2019, Greenbaum et al. 2019). That single change instantly rendered ambiguous ~7% of North American mammal records in the Global Biodiversity Information Facility (n=242,663, downloaded 2021-06-04; GBIF.org 2021) and ⅓ of all National Ecological Observatory Network (NEON) small mammal samples (n=10,256, downloaded 2021-06-27). While this type of ambiguity is common in name-based databases when species are split, the example of P. maniculatus is particularly striking for its impact upon biological questions ranging from hantavirus surveillance in North America to studies of climate change impacts upon rodent life-history traits. Of special relevance to NEON sampling is recent evidence suggesting deer mice potentially transmit SARS-CoV-2 (Griffin et al. 2021). Automating the updating of occurrence records in such cases and others will require operational representations of taxonomic concepts—e.g., range maps, reference sequences, and diagnostic traits—that are FAIR in addition to taxonomic concept alignment information (Franz and Peet 2009). Despite steady progress, it remains difficult to find, access, and reuse authoritative information about how to apply taxonomic names even when it is already digitized. It can also be difficult to tell without manual inspection whether similar types of concept representations derived from multiple sources, such as range maps or reference sequences selected from different research articles or checklists, are in fact interoperable for a particular application. The issue is therefore different from important ongoing efforts to digitize trait information in species circumscriptions, for example, and focuses on how already digitized knowledge can best be packaged to inform human experts and artifical intelligence applications (Sterner and Franz 2017). We therefore propose developing community guidelines and criteria for FAIR taxonomic concept representations as "semantic artefacts" of general relevance to linked open data and life sciences research (Le Franc et al. 2020).


Author(s):  
Yanina Sica ◽  
Paula Zermoglio

Biodiversity inventories, i.e., recording multiple species at a specific place and time, are routinely performed and offer high-quality data for characterizing biodiversity and its change. Digitization, sharing and reuse of incidental point records (i.e., records that are not readily associated with systematic sampling or monitoring, typically museum specimens and many observations from citizen science projects) has been the focus for many years in the biodiversity data community. Only more recently, attention has been directed towards mobilizing data from both new and longstanding inventories and monitoring efforts. These kinds of studies provide very rich data that can enable inferences about species absence, but their reliability depends on the methodology implemented, the survey effort and completeness. The information about these elements has often been regarded as metadata and captured in an unstructured manner, thus making their full use very challenging. Unlocking and integrating inventory data requires data standards that can facilitate capture and sharing of data with the appropriate depth. The Darwin Core standard (Wieczorek et al. 2012) currently enables reporting some of the information contained in inventories, particularly using Darwin Core Event terms such as samplingProtocol, sampleSizeValue, sampleSizeUnit, samplingEffort. However, it is limited in its ability to accommodate spatial, temporal, and taxonomic scopes, and other key aspects of the inventory sampling process, such as direct or inferred measures of sampling effort and completeness. The lack of a standardized way to share inventory data has hindered their mobilization, integration, and broad reuse. In an effort to overcome these limitations, a framework was developed to standardize inventory data reporting: Humboldt Core (Guralnick et al. 2018). Humboldt Core identified three types of inventories (single, elementary, and summary inventories) and proposed a series of terms to report their content. These terms were organized in six categories: dataset and identification; geospatial and habitat scope; temporal scope; taxonomic scope; methodology description; and completeness and effort. While originally planned as a new TDWG standard and being currently implemented in Map of Life (https://mol.org/humboldtcore/), ratification was not pursued at the time, thus limiting broader community adoption. In 2021 the TDWG Humboldt Core Task Group was established to review how to best integrate the terms proposed in the original publication with existing standards and implementation schemas. The first goal of the task group was to determine whether a new, separate standard was needed or if an extension to Darwin Core could accommodate the terms necessary to describe the relevant information elements. Since the different types of inventories can be thought of as Events with different nesting levels (events within events, e.g., plots within sites), and after an initial mapping to existing Darwin Core terms, it was deemed appropriate to start from a Darwin Core Event Core and build an extension to include Humboldt Core terms. The task group members are currently revising all original Humboldt Core terms, reformulating definitions, comments, and examples, and discarding or adding new terms where needed. We are also gathering real datasets to test the use of the extension once an initial list of revised terms is ready, before undergoing a public review period as established by the TDWG process. Through the ratification of Humboldt Core as a TDWG extension, we expect to provide the community with a solution to share and use inventory data, which improves biodiversity data discoverability, interoperability and reuse while lowering the reporting burden at different levels (data collection, integration and sharing).


Author(s):  
Dmitry Schigel ◽  
Anders Andersson ◽  
Andrew Bissett ◽  
Anders Finstad ◽  
Frode Fossøy ◽  
...  

Most users will foresee the use of genetic sequences in the context of molecular ecology or phylogenetic research, however, a sequence with coordinates and a timestamp is a valuable biodiversity occurrence that is useful in a much broader context than its original purpose. To uncover this potential, sequence-derived data need to become findable, accessible, interoperable, and reusable through generalist biodiversity data platforms. Stimulated by the Biodiversity_Next discussions in 2019, we have worked for about 10 months to put together practical data mapping and data publishing experiences in Norway, Australia, Sweden, and Denmark, as well as in the UNITE and the GBIF (Global Biodiversity Information Facility) networks. The resulting guide was put together to provide practical instruction for mapping sequence-derived data. Biodiversity data communities remain dominated by the macroscopic, easily detectable, morphologically identifiable species. This is not only true for citizen science and other forms of biodiversity popularization, but is also visible in the university and museum department structures, financial resource allocations, biodiversity legislation, and policy design. Recent decades of molecular advances have increased the power of genetic methods for detecting, describing, and documenting global biodiversity. We have yet to see the wide shift of data generating efforts from the traditional taxonomic foci of biodiversity assesments to the more balanced and inclusive systems focusing on all functionally important taxa and environments. These include soil, limnic and marine environments, decomposing plants and deadwood, and all life therein. Environmental DNA data enable recording of present and past presence of micro- and macroscopic organisms with minimal effort and by non-invasive methods. The apparent ease of these methods requires a cautious approach to the resulting data and their interpretation. It remains important to define and agree on the organism recording and reporting routines for genetic data. DNA data represent a major addition to the many ways in which GBIF and other biodiversity data platforms index the living world. Our guide is resting on the shoulders of those who have been developing and improving MIxS (Minimum Information about any (x) Sequence), GGBN (Global Genome Biodiversity Network) and other data standards. The added value of publishing sequence-derived data through non-genetic biodiversity discovery platforms relates to spatio-temporal occurrences and sequence-based names. Reporting sequence-derived occurrences in an open and reproducible way has a wide range of benefits: notably, it increases citability, highlights the taxa concerned in the context of biological conservation, and contributes to taxonomic and ecological knowledge.


Sign in / Sign up

Export Citation Format

Share Document