Slinging With Four Giants on a Quest to Credit Natural Historians for our Museums and Collections

Biodiversity Information Science and Standards ◽

10.3897/biss.4.59167 ◽

2020 ◽

Vol 4 ◽

Author(s):

David Shorthouse

Keyword(s):

Natural History ◽

Web Services ◽

Computation Time ◽

Lessons Learned ◽

Management Systems ◽

Collection Management ◽

Sources Of Information ◽

Global Biodiversity Information Facility ◽

Darwin Core ◽

Museums And Collections

Bionomia, https://bionomia.net previously called Bloodhound Tracker, was launched in August 2018 with the aim of illustrating the breadth and depth of expertise required to collect and identify natural history specimens represented in the Global Biodiversity Information Facility (GBIF). This required that specimens and people be uniquely identified and that a granular expression of actions (e.g. "collected", "identified") be adopted. The Darwin Core standard presently combines agents and their actions into the conflated terms recordedBy and identifiedBy whose values are typically unresolved and unlinked text strings. Bionomia consists of tools, web services, and a responsive website, which are all used to efficiently guide users to resolve and unequivocally link people to specimens via first-class actions collected or identified. It also shields users from the complexity of casting together and seamlessly integrating the services of four giant initiatives: ORCID, Wikidata, GBIF, and Zenodo. All of these initiatives are financially sustainable and well-used by many stakeholders, well-outside this narrow user-case. As a result, the links between person and specimen made by users of Bionomia are given every opportunity to persist, to represent credit for effort, and to flow into collection management systems as meaningful new entries. To date, 13M links between people and specimens have been made including 2M negative associations on 12.5M specimen records. These links were either made by the collectors themselves or by 84 people who have attributed specimen records to their peers, mentors and others they revere. Integration With ORCID and Wikidata People are identified in Bionomia through synchronization with ORCID and Wikidata by reusing their unique identifiers and drawing in their metadata. ORCID identifiers are used by living researchers to link their identites to their research outputs. ORCID services include OAuth2 pass-through authentication for use by developers and web services for programmatic access to its store of public profiles. These contain elements of metadata such as full name, aliases, keywords, countries, education, employment history, affiliations, and links to publications. Bionomia seeds its search directory of people by periodically querying ORCID for specific user-assigned keywords as well as directly though account creation via OAuth2 authentication. Deceased people are uniquely identified in Bionomia through integration with Wikidata by caching unique 'Q' numbers (identifiers), full names and aliases, countries, occupations, as well as birth and death dates. Profiles are seeded from Wikidata through daily queries for properties that are likely to be assigned to collectors of natural history specimens such as "Entomologists of the World ID" (= P5370) or "Harvard Index of Botanists ID" (= P6264). Because Wikidata items may be merged, Bionomia captures these merge events, re-associates previously made links to specimen records, and mirrors Wikidata's redirect behaviour. A Wikidata property called "Bionomia ID" (= P6944), whose values are either ORCID identifiers or Wikidata 'Q' numbers, helps facilitate additional integration and reuse. Integration with GBIF Specimen data are downloaded wholesale as Darwin Core Archives from GBIF every two weeks. The purpose of this schedule is to maintain a reasonable synchrony with source data that balances computation time with the expections of users who desire the most up-to-date view of their specimen records. Collectors with ORCID accounts who have elected to receive notice, are informed via email message when the authors of newly published papers have made use of their specimen records downloaded from GBIF. Integration with Zenodo Finally, users of Bionomia may integrate their ORCID OAuth2 authentication with Zenodo, an industry-recognized archive for research data, which enjoys support from the Conseil Européen pour la Recherche Nucléaire (CERN). At the user's request, their specimen data represented as CSV (comma-separated values) and JSON-LD (JavaScript Object Notation for Linked Data) documents are pushed into Zenodo, a DataCite DOI is assigned, and a formatted citation appears on their Bionomia profile. New versions of these files are pushed to Zenodo on the user's behalf when new specimen records are linked to them. If users have configured their ORCID account to listen for new entries in DataCite, a new work entry will also be made in their ORCID profile, thus sealing a perpetual, semi-automated loop betwen GBIF and ORCID that tidily showcases their efforts at collecting and identifying natural history specimens. Technologies Used Bionomia uses Apache Spark via scripts written in Scala, a human name parser written in Ruby called dwc_agent, queues of jobs executed through Sidekiq, scores of pairwise similarities in the structure of human names stored in Neo4j, data persistence in MySQL, and a search layer in Elasticsearch. Here, I expand on lessons learned in the construction and maintenance of Bionomia, emphasize the criticality of recognizing the early efforts made by a fledgling community of enthusiasts, and describe useful tools and services that may be integrated into collection management systems to help churn strings of unresolved, unlinked collector and determiner names into actionable identifiers that are gateways to rich sources of information.

Download Full-text

Rapid Creation of a Data Product for the World's Specimens of Horseshoe Bats and Relatives, a Known Reservoir for Coronaviruses

Biodiversity Information Science and Standards ◽

10.3897/biss.4.59067 ◽

2020 ◽

Vol 4 ◽

Author(s):

Erica Krimmel ◽

Austin Mast ◽

Deborah Paul ◽

Robert Bruhn ◽

Nelson Rios ◽

...

Keyword(s):

Natural History ◽

Life Histories ◽

Lessons Learned ◽

Reproducible Research ◽

State University ◽

Biodiversity Data ◽

Global Biodiversity Information Facility ◽

Data Product ◽

Natural History Collections ◽

Horseshoe Bats

Genomic evidence suggests that the causative virus of COVID-19 (SARS-CoV-2) was introduced to humans from horseshoe bats (family Rhinolophidae) (Andersen et al. 2020) and that species in this family as well as in the closely related Hipposideridae and Rhinonycteridae families are reservoirs of several SARS-like coronaviruses (Gouilh et al. 2011). Specimens collected over the past 400 years and curated by natural history collections around the world provide an essential reference as we work to understand the distributions, life histories, and evolutionary relationships of these bats and their viruses. While the importance of biodiversity specimens to emerging infectious disease research is clear, empowering disease researchers with specimen data is a relatively new goal for the collections community (DiEuliis et al. 2016). Recognizing this, a team from Florida State University is collaborating with partners at GEOLocate, Bionomia, University of Florida, the American Museum of Natural History, and Arizona State University to produce a deduplicated, georeferenced, vetted, and versioned data product of the world's specimens of horseshoe bats and relatives for researchers studying COVID-19. The project will serve as a model for future rapid data product deployments about biodiversity specimens. The project underscores the value of biodiversity data aggregators iDigBio and the Global Biodiversity Information Facility (GBIF), which are sources for 58,617 and 79,862 records, respectively, as of July 2020, of horseshoe bat and relative specimens held by over one hundred natural history collections. Although much of the specimen-based biodiversity data served by iDigBio and GBIF is high quality, it can be considered raw data and therefore often requires additional wrangling, standardizing, and enhancement to be fit for specific applications. The project will create efficiencies for the coronavirus research community by producing an enhanced, research-ready data product, which will be versioned and published through Zenodo, an open-access repository (see doi.org/10.5281/zenodo.3974999). In this talk, we highlight lessons learned from the initial phases of the project, including deduplicating specimen records, standardizing country information, and enhancing taxonomic information. We also report on our progress to date, related to enhancing information about agents (e.g., collectors or determiners) associated with these specimens, and to georeferencing specimen localities. We seek also to explore how much we can use the added agent information (i.e., ORCID iDs and Wikidata Q identifiers) to inform our georeferencing efforts and to support crediting those collecting and doing identifications. The project will georeference approximately one third of our specimen records, based on those lacking geospatial coordinates but containing textual locality descriptions. We furthermore provide an overview of our holistic approach to enhancing specimen records, which we hope will maximize the value of the bat specimens at the center of what has been recently termed the "extended specimen network" (Lendemer et al. 2020). The centrality of the physical specimen in the network reinforces the importance of archived materials for reproducible research. Recognizing this, we view the collections providing data to iDigBio and GBIF as essential partners, as we expect that they will be responsible for the long-term management of enhanced data associated with the physical specimens they curate. We hope that this project can provide a model for better facilitating the reintegration of enhanced data back into local specimen data management systems.

Download Full-text

Quantifying Institutional Reach Through the Human Network in Natural History Collections

Biodiversity Information Science and Standards ◽

10.3897/biss.3.35243 ◽

2019 ◽

Vol 3 ◽

Cited By ~ 1

Author(s):

David Shorthouse ◽

Roderic Page

Keyword(s):

Natural History ◽

Professional Relationships ◽

Global Biodiversity Information Facility ◽

Closed Loops ◽

Natural History Collections ◽

Darwin Core ◽

International Audience ◽

The Impact ◽

To Receive ◽

Formal Recognition

Through the Bloodhound proof-of-concept, https://bloodhound-tracker.net an international audience of collectors and determiners of natural history specimens are engaged in the emotive act of claiming their specimens and attributing other specimens to living and deceased mentors and colleagues. Behind the scenes, these claims build links between Open Researcher and Contributor Identifiers (ORCID, https://orcid.org) or Wikidata identifiers for people and Global Biodiversity Information Facility (GBIF) specimen identifiers, predicated by the Darwin Core terms, recordedBy (collected) and identifiedBy (determined). Here we additionally describe the socio-technical challenge in unequivocally resolving people names in legacy specimen data and propose lightweight and reusable solutions. The unique identifiers for the affiliations of active researchers are obtained from ORCID whereas the unique identifiers for institutions where specimens are actively curated are resolved through Wikidata. By constructing closed loops of links between person, specimen, and institution, an interesting suite of potential metrics emerges, all due to the activities of employees and their network of professional relationships. This approach balances a desire for individuals to receive formal recognition for their efforts in natural history collections with that of an institutional-level need to alter budgets in response to easily obtained numeric trends in national and international reach. If handled in a coordinating fashion, this reporting technique may be a significant new driver for specimen digitization efforts on par with Altmetric, https://www.altmetric.com, an important new tool that tracks the impact of publications and delights administrators and authors alike.

Download Full-text

DINA: Open Source and Open Services - A Modern Approach for Sustainable Natural History Collection Management Systems

Biodiversity Information Science and Standards ◽

10.3897/tdwgproceedings.1.20216 ◽

2017 ◽

Vol 1 ◽

pp. e20216

Author(s):

James Macklin ◽

Falko Glöckler ◽

Jana Hoffmann ◽

Fredrik Ronquist ◽

Stefan Daume ◽

...

Keyword(s):

Natural History ◽

Open Source ◽

Management Systems ◽

Collection Management ◽

Modern Approach ◽

Natural History Collection

Download Full-text

Engaging K-12 Audiences with Biodiversity Data through Advancing Digitization for Biodiversity Collections

Biodiversity Information Science and Standards ◽

10.3897/biss.2.26473 ◽

2018 ◽

Vol 2 ◽

pp. e26473

Author(s):

Molly Phillips ◽

Anne Basham ◽

Marc Cubeta ◽

Kari Harris ◽

Jonathan Hendricks ◽

...

Keyword(s):

Natural History ◽

The United States ◽

Lessons Learned ◽

Educational Resources ◽

Digital Data ◽

Science Standards ◽

Global Biodiversity Information Facility ◽

Digital Collections ◽

Natural History Collections ◽

K 12

Natural history collections around the world are currently being digitized with the resulting data and associated media now shared online in aggregators such as the Global Biodiversity Information Facility and Integrated Digitized Biocollections (iDigBio). These collections and their resources are accessible and discoverable through online portals to not only researchers and collections professionals, but to educators, students, and other potential downstream users. Primary and secondary education (K-12) in the United States is going through its own revolution with many states adopting Next Generation Science Standards (NGSS https://www.nextgenscience.org/). The new standards emphasize science practices for analyzing and interpreting data and connect to cross-cutting concepts such as cause and effect and patterns. NGSS and natural history collections data portals seem to complement each other. Nevertheless, many educators and students are unaware of the digital resources available or are overwhelmed with working in aggregated databases created by scientists. To better address this challenge, participants within the National Science Foundation Advancing Digitization for Biodiversity Collections program (ADBC) have been working to increase awareness of, and scaffold learning for, digitized collections with K-12 educators and learners. They are accomplishing this through individual programs at institutions across the country as part of the Thematic Collections Networks and collaboratively through the iDigBio Education and Outreach Working Group. ADBC partners have focused on incorporating digital data and resources into K-12 classrooms through training workshops and webinars for both educators and collections professionals, as well as through creating educational resources, websites, and applications that use digital collections data. This presentation includes lessons learned from engaging K-12 audiences with digital data, summarizes available resources for both educators and collections professionals, shares how to become involved, and provides ways to facilitate transfer of educational resources to the K-12 community.

Download Full-text

Multi-domain Collection Management Simplified — the Finnish National Collection Management System Kotka

Biodiversity Information Science and Standards ◽

10.3897/biss.4.59119 ◽

2020 ◽

Vol 4 ◽

Author(s):

Mikko Heikkinen ◽

Anniina Kuusijärvi ◽

Ville-Matti Riihikoski ◽

Leif Schulman

Keyword(s):

Natural History ◽

Coordinate System ◽

Collection Management ◽

Free Text ◽

Natural History Museums ◽

History Museums ◽

Darwin Core ◽

Corvus Corone ◽

Corvus Cornix ◽

Biodiversity Information

Many natural history museums share a common problem: a multitude of legacy collection management systems (CMS) and the difficulty of finding a new system to replace them. Kotka is a CMS developed starting in 2011 at the Finnish Museum of Natural History (Luomus) and Finnish Biodiversity Information Facility (FinBIF) (Heikkinen et al. 2019, Schulman et al. 2019) to solve this problem. It has grown into a national system used by all natural history museums in Finland, and currently contains over two million specimens from several domains (zoological, botanical, paleontological, microbial, tissue sample and botanic garden collections). Kotka is a web application where data can be entered, edited, searched and exported through a browser-based user interface. It supports designing and printing specimen labels, handling collection metadata and specimen transactions, and helps support Nagoya protocol compliance. Creating a shared system for multiple institutions and collection types is difficult due to differences in their current processes, data formats, future needs and opinions. The more independent actors there are involved, the more complicated the development becomes. Successful development requires some trade-offs. Kotka has chosen features and development principles that emphasize fast development into a multitude of different purposes. Kotka was developed using agile methods with a single person (a product owner) making development decisions, based on e.g., strategic objectives, customer value and user feedback. Technical design emphasizes efficient development and usage over completeness and formal structure of the data. It applies simple and pragmatic approaches and improves collection management by providing practical tools for the users. In these regards, Kotka differs in many ways from a traditional CMS. Kotka stores data in a mostly denormalized free text format and uses a simple hierarchical data model. This allows greater flexibility and makes it easy to add new data fields and structures based on user feedback. Data harmonization and quality assurance is a continuous process, instead of doing it before entering data into the system. For example, specimen data with a taxon name can be entered into Kotka before the taxon name has been entered into the accompanying FinBIF taxonomy database. Example: simplified data about two specimens in Kotka, which have not been fully harmonized yet. Taxon: Corvus corone cornix Country: FI Collector: Doe, John Coordinates: 668, 338 Coordinate system: Finnish uniform coordinate system Taxon: Corvus corone cornix Country: FI Collector: Doe, John Coordinates: 668, 338 Coordinate system: Finnish uniform coordinate system Taxon: Corvus cornix Country: Finland Collector: Doe, J. Coordinates: 60.2442, 25,7201 Coordinate system: WGS84 Taxon: Corvus cornix Country: Finland Collector: Doe, J. Coordinates: 60.2442, 25,7201 Coordinate system: WGS84 Kotka’s data model does not follow standards, but has grown organically to reflect practical needs from the users. This is true particularly of data collected in research projects, which are often unique and complicated (e.g. complex relationships between species), requiring new data fields and/or storing data as free text. The majority of the data can be converted into simplified standard formats (e.g. Darwin Core) for sharing. The main challenge with this has been vague definitions of many data sharing formats (e.g. Darwin Core, CETAF Specimen Preview Profile (CETAF 2020), allowing different interpretations. Kotka trusts its users: it places very few limitations on what users can do, and has very simple user role management. Kotka stores the full history of all data, which allows fixing any possible errors and prevents data loss. Kotka is open source software, but is tightly coupled with the infrastructure of the Finnish Biodiversity Information Facility (FinBIF). Currently, it is only offered as an online service (Software as a Service) hosted by FinBIF. However, it could be developed into a more modular system that could, for example, utilize multiple different database backends and taxonomy data sources.

Download Full-text

Plenary Discussion - Future of Collection Management Systems

Biodiversity Information Science and Standards ◽

10.3897/biss.2.25635 ◽

2018 ◽

Vol 2 ◽

pp. e25635

Author(s):

Mikko Heikkinen ◽

Falko Glöckler ◽

Markus Englund

Keyword(s):

White Paper ◽

Modern Technology ◽

Management Systems ◽

Collection Management ◽

Collection Management System ◽

Darwin Core ◽

Share Data ◽

Collection Data ◽

Ultimate Objective ◽

Diverse Data

The DINA Symposium (“DIgital information system for NAtural history data”, https://dina-project.net) ends with a plenary session involving the audience to discuss the interplay of collection management and software tools. The discussion will touch different areas and issues such as: (1) Collection management using modern technology: How should and could collections be managed using current technology – What is the ultimate objective of using a new collection management system? How should traditional management processes be changed? (2) Development and community Why are there so many collection management systems? Why is it so difficult to create one system that fits everyone’s requirements? How could the community of developers and collection staff be built around DINA project in the future? (3) Features and tools How to identify needs that are common to all collections? What are the new tools and technologies that could facilitate collection management? How could those tools be implemented as DINA compliant services? (4) Data What data must be captured about collections and specimens? What criteria need to be applied in order to distinguish essential and “nice-to-have” information? How should established data standards (e.g. Darwin Core & ABCD (Access to Biological Collection Data)) be used to share data from rich and diverse data models? In addition to the plenary discussion around these questions, we will agree on a streamlined format for continuing the discussion in order to write a white paper on these questions. The results and outcome of the session will constitute the basis of the paper and will be subsequently refined.

Download Full-text

Technical Considerations for a Transactional Model to Realize the Digital Extended Specimen

Biodiversity Information Science and Standards ◽

10.3897/biss.5.73812 ◽

2021 ◽

Vol 5 ◽

Author(s):

Nelson Rios ◽

Sharif Islam ◽

James Macklin ◽

Andrew Bentley

Keyword(s):

Natural Science ◽

Building Blocks ◽

Management Systems ◽

Collection Management ◽

Transactional Model ◽

Biodiversity Data ◽

Data Repositories ◽

Data Publication ◽

Darwin Core ◽

Data Elements

Technological innovations over the past two decades have given rise to the online availability of more than 150 million specimen and species-lot records from biological collections around the world through large-scale biodiversity data-aggregator networks. In the present landscape of biodiversity informatics, collections data are captured and managed locally in a wide variety of databases and collection management systems and then shared online as point-in-time Darwin Core archive snapshots. Data providers may publish periodic revisions to these data files, which are retrieved, processed and re-indexed by data aggregators. This workflow has resulted in data latencies and lags of months to years for some data providers. The Darwin Core Standard Wieczorek et al. (2012) provides guidelines for representing biodiversity information digitally, yet varying institutional practices and lack of interoperability between Collection Management Systems continue to limit semantic uniformity, particularly with regard to the actual content of data within each field. Although some initiatives have begun to link data elements, our ability to comprehensively link all of the extended data associated with a specimen, or related specimens, is still limited due to the low uptake and usage of persistent identifiers. The concept now under consideration is to create a Digital Extended Specimen (DES) that adheres to the tenets of Findable, Accessible, Interoperable and Reusable (FAIR) data management of stewardship principles and is the cumulative digital representation of all data, derivatives and products associated with a physical specimen, which are individually distinguished and linked by persistent identifiers on the Internet to create a web of knowledge. Biodiversity data aggregators that mobilize data across multiple institutions routinely perform data transformations in an attempt to provide a clean and consistent interpretation of the data. These aggregators are typically unable to interact directly with institutional data repositories, thereby limiting potentially fruitful opportunities for annotation, versioning, and repatriation. The ability to track such data transactions and satisfy the accompanying legal implications (e.g. Nagoya Protocol) is becoming a necessary component of data publication which existing standards do not adequately address. Furthermore, no mechanisms exist to assess the “trustworthiness” of data, critical to scientific integrity, reproducibility or to provide attribution metrics for collections to advocate for their contribution or effectiveness in supporting such research. Since the introduction of Darwin Core Archives Wieczorek et al. (2012) little has changed in the underlying mechanisms for publishing natural science collections data and we are now at a point where new innovations are required to meet current demand for continued digitization, access, research and management. One solution may involve changing the biodiversity data publication paradigm to one based on the atomized transactions relevant to each individual data record. These transactions, when summed over time, allows us us to realize the most recently accepted revision as well as historical and alternative perspectives. In order to realize the Digital Extended Specimen ideals and the linking of data elements, this transactional model combined with open and FAIR data protocols, application programming interfaces (APIs), repositories, and workflow engines can provide the building blocks for the next generation of natural science collections and biodiversity data infrastructures and services. These and other related topics have been the focus of phase 2 of the global consultation on converging Digital Specimens and Extended Specimens. Based on these discussions, this presentation will explore a conceptual solution leveraging elements from distributed version control, cryptographic ledgers and shared redundant storage to overcome many of the shortcomings of contemporary approaches.

Download Full-text

Enabling Digital Specimen and Extended Specimen Concepts in Current Tools and Services

Biodiversity Information Science and Standards ◽

10.3897/biss.4.59076 ◽

2020 ◽

Vol 4 ◽

Author(s):

Falko Glöckler

Keyword(s):

Panel Discussion ◽

Physical Object ◽

Digital Data ◽

Management Systems ◽

Collection Management ◽

Multiple Perspectives ◽

Local Data ◽

Global Biodiversity Information Facility ◽

Domain Specific ◽

Collection Data

Digital specimens (Hardisty 2018, Hardisty 2020) are the cyberspace equivalent to objects in a physical, often museum-based collection. They consist of references to data and metadata related to the collection object. Through the ongoing process of digitizing legacy data, gaining knowledge from new field collections or research, and annotating and linking to related resources, a digital specimen can evolve independently from the original physical object. Especially the provenance records cannot always be assigned to the physical object when the knowledge was gained solely from the digital representation. A physical specimen can also be understood as a physical preparation (or a set of multiple preparations, e.g. DNA samples taken from a preserved organism) accompanied by related digital and non-digital data sources (e.g. images, descriptions in fieldbooks, research data) rather than just a single object. This concept of an extended specimen has been described by Webster (2017) and is used in the initiative The Extended Specimen Network (Lendemer et al. 2019) to enhance the access and research potential of specimens. Digital specimens need to reflect both, eventual complexity of the physical object (extended specimen) and the knowledge gained from and linked to the digital object itself. In order to provide, track and make use of the digital specimens, the community of collection-holding institutions might need to think of digital specimens as standalone virtual collections that emanate from physical collections. Additionally, new versions of a digital specimen continuously derive from changes of the physical specimen as the (meta)data are being updated in collection management systems to document the state and treatment of the physical objects. Consequently, there is a challenge to enable the management of both: linked digital specimens in the World Wide Web and the local data of physical specimens in databases of collection-holding institutions and other tools and services. In this panel discussion, central questions about the requirements, obstacles and opportunities of implementing the concepts of digital specimens and extended specimens in software tools like collection management systems are discussed. The aim is to identify the major tasks and priorities regarding the transformation of tools and services from multiple perspectives: local collection data management, international data infrastructures like the Distributed System of Scientific Collections (DiSSCo) and the Global Biodiversity Information Facility (GBIF), and data usage outside of domain-specific subject areas.

Download Full-text