scholarly journals Towards Linked Open Molecular Data: Recommendations for researchers, collections, infrastructures and publishers

Author(s):  
Gabriele Droege ◽  
Ilene Karsch-Mizrachi ◽  
Katharine Barker ◽  
Jonathan Coddington ◽  
Ole Seberg

The variety of molecular methods used to analyze biosamples is continuously increasing, as is the need for the standardized deposition, documentation and citation of both the samples as well as the methods applied to them. Global initiatives such as the International Nucleotide Sequence Database Collaboration (INSDC, http://www.insdc.org), Barcode of Life Data System (BOLD, http://www.boldsystems.org), the Global Biodiversity Information Facility (GBIF, http://www.gbif.org) and the Global Genome Biodiversity Network (GGBN, http://www.ggbn.org), in addition to many others, have been working towards standardized access to biological data for many years. Collectively, these biodiversity data management platforms provide a considerable and indispensable infrastructure to the research community. However, cross-linking the massive amounts of protein and DNA sequence data submitted to these databases every year with standardized records of the underlying biological material remains challenging. Best practices for standardized data submissions and data citations are urgently needed. In the long run, two goals should be achieved above all else: all sequence data should be linked to natural history collections, and biological material that was used for molecular research, especially DNA sequencing, should be deposited and, thus, made accessible in public, well curated collections. all sequence data should be linked to natural history collections, and biological material that was used for molecular research, especially DNA sequencing, should be deposited and, thus, made accessible in public, well curated collections. Here we will provide recommendations both for researchers and collections how to cite underlying biological material at INSDC and in publications in a standardized way towards Linked Open Data. We will also address how the global infrastructures and publishers can improve their interoperability.

Author(s):  
Erica Krimmel ◽  
Austin Mast ◽  
Deborah Paul ◽  
Robert Bruhn ◽  
Nelson Rios ◽  
...  

Genomic evidence suggests that the causative virus of COVID-19 (SARS-CoV-2) was introduced to humans from horseshoe bats (family Rhinolophidae) (Andersen et al. 2020) and that species in this family as well as in the closely related Hipposideridae and Rhinonycteridae families are reservoirs of several SARS-like coronaviruses (Gouilh et al. 2011). Specimens collected over the past 400 years and curated by natural history collections around the world provide an essential reference as we work to understand the distributions, life histories, and evolutionary relationships of these bats and their viruses. While the importance of biodiversity specimens to emerging infectious disease research is clear, empowering disease researchers with specimen data is a relatively new goal for the collections community (DiEuliis et al. 2016). Recognizing this, a team from Florida State University is collaborating with partners at GEOLocate, Bionomia, University of Florida, the American Museum of Natural History, and Arizona State University to produce a deduplicated, georeferenced, vetted, and versioned data product of the world's specimens of horseshoe bats and relatives for researchers studying COVID-19. The project will serve as a model for future rapid data product deployments about biodiversity specimens. The project underscores the value of biodiversity data aggregators iDigBio and the Global Biodiversity Information Facility (GBIF), which are sources for 58,617 and 79,862 records, respectively, as of July 2020, of horseshoe bat and relative specimens held by over one hundred natural history collections. Although much of the specimen-based biodiversity data served by iDigBio and GBIF is high quality, it can be considered raw data and therefore often requires additional wrangling, standardizing, and enhancement to be fit for specific applications. The project will create efficiencies for the coronavirus research community by producing an enhanced, research-ready data product, which will be versioned and published through Zenodo, an open-access repository (see doi.org/10.5281/zenodo.3974999). In this talk, we highlight lessons learned from the initial phases of the project, including deduplicating specimen records, standardizing country information, and enhancing taxonomic information. We also report on our progress to date, related to enhancing information about agents (e.g., collectors or determiners) associated with these specimens, and to georeferencing specimen localities. We seek also to explore how much we can use the added agent information (i.e., ORCID iDs and Wikidata Q identifiers) to inform our georeferencing efforts and to support crediting those collecting and doing identifications. The project will georeference approximately one third of our specimen records, based on those lacking geospatial coordinates but containing textual locality descriptions. We furthermore provide an overview of our holistic approach to enhancing specimen records, which we hope will maximize the value of the bat specimens at the center of what has been recently termed the "extended specimen network" (Lendemer et al. 2020). The centrality of the physical specimen in the network reinforces the importance of archived materials for reproducible research. Recognizing this, we view the collections providing data to iDigBio and GBIF as essential partners, as we expect that they will be responsible for the long-term management of enhanced data associated with the physical specimens they curate. We hope that this project can provide a model for better facilitating the reintegration of enhanced data back into local specimen data management systems.


Author(s):  
David Shorthouse ◽  
Roderic Page

Through the Bloodhound proof-of-concept, https://bloodhound-tracker.net an international audience of collectors and determiners of natural history specimens are engaged in the emotive act of claiming their specimens and attributing other specimens to living and deceased mentors and colleagues. Behind the scenes, these claims build links between Open Researcher and Contributor Identifiers (ORCID, https://orcid.org) or Wikidata identifiers for people and Global Biodiversity Information Facility (GBIF) specimen identifiers, predicated by the Darwin Core terms, recordedBy (collected) and identifiedBy (determined). Here we additionally describe the socio-technical challenge in unequivocally resolving people names in legacy specimen data and propose lightweight and reusable solutions. The unique identifiers for the affiliations of active researchers are obtained from ORCID whereas the unique identifiers for institutions where specimens are actively curated are resolved through Wikidata. By constructing closed loops of links between person, specimen, and institution, an interesting suite of potential metrics emerges, all due to the activities of employees and their network of professional relationships. This approach balances a desire for individuals to receive formal recognition for their efforts in natural history collections with that of an institutional-level need to alter budgets in response to easily obtained numeric trends in national and international reach. If handled in a coordinating fashion, this reporting technique may be a significant new driver for specimen digitization efforts on par with Altmetric, https://www.altmetric.com, an important new tool that tracks the impact of publications and delights administrators and authors alike.


2018 ◽  
Vol 2 ◽  
pp. e26473
Author(s):  
Molly Phillips ◽  
Anne Basham ◽  
Marc Cubeta ◽  
Kari Harris ◽  
Jonathan Hendricks ◽  
...  

Natural history collections around the world are currently being digitized with the resulting data and associated media now shared online in aggregators such as the Global Biodiversity Information Facility and Integrated Digitized Biocollections (iDigBio). These collections and their resources are accessible and discoverable through online portals to not only researchers and collections professionals, but to educators, students, and other potential downstream users. Primary and secondary education (K-12) in the United States is going through its own revolution with many states adopting Next Generation Science Standards (NGSS https://www.nextgenscience.org/). The new standards emphasize science practices for analyzing and interpreting data and connect to cross-cutting concepts such as cause and effect and patterns. NGSS and natural history collections data portals seem to complement each other. Nevertheless, many educators and students are unaware of the digital resources available or are overwhelmed with working in aggregated databases created by scientists. To better address this challenge, participants within the National Science Foundation Advancing Digitization for Biodiversity Collections program (ADBC) have been working to increase awareness of, and scaffold learning for, digitized collections with K-12 educators and learners. They are accomplishing this through individual programs at institutions across the country as part of the Thematic Collections Networks and collaboratively through the iDigBio Education and Outreach Working Group. ADBC partners have focused on incorporating digital data and resources into K-12 classrooms through training workshops and webinars for both educators and collections professionals, as well as through creating educational resources, websites, and applications that use digital collections data. This presentation includes lessons learned from engaging K-12 audiences with digital data, summarizes available resources for both educators and collections professionals, shares how to become involved, and provides ways to facilitate transfer of educational resources to the K-12 community.


Author(s):  
Falko Glöckler ◽  
James Macklin ◽  
David Shorthouse ◽  
Christian Bölling ◽  
Satpal Bilkhu ◽  
...  

The DINA Consortium (DINA = “DIgital information system for NAtural history data”, https://dina-project.net) is a framework for like-minded practitioners of natural history collections to collaborate on the development of distributed, open source software that empowers and sustains collections management. Target collections include zoology, botany, mycology, geology, paleontology, and living collections. The DINA software will also permit the compilation of biodiversity inventories and will robustly support both observation and molecular data. The DINA Consortium focuses on an open source software philosophy and on community-driven open development. Contributors share their development resources and expertise for the benefit of all participants. The DINA System is explicitly designed as a loosely coupled set of web-enabled modules. At its core, this modular ecosystem includes strict guidelines for the structure of Web application programming interfaces (APIs), which guarantees the interoperability of all components (https://github.com/DINA-Web). Important to the DINA philosophy is that users (e.g., collection managers, curators) be actively engaged in an agile development process. This ensures that the product is pleasing for everyday use, includes efficient yet flexible workflows, and implements best practices in specimen data capture and management. There are three options for developing a DINA module: create a new module compliant with the specifications (Fig. 1), modify an existing code-base to attain compliance (Fig. 2), or wrap a compliant API around existing code that cannot be or may not be modified (e.g., infeasible, dependencies on other systems, closed code) (Fig. 3). create a new module compliant with the specifications (Fig. 1), modify an existing code-base to attain compliance (Fig. 2), or wrap a compliant API around existing code that cannot be or may not be modified (e.g., infeasible, dependencies on other systems, closed code) (Fig. 3). All three of these scenarios have been applied in the modules recently developed: a module for molecular data (SeqDB), modules for multimedia, documents and agents data and a service module for printing labels and reports: The SeqDB collection management and molecular tracking system (Bilkhu et al. 2017) has evolved through two of these scenarios. Originally, the required architectural changes were going to be added into the codebase, but after some time, the development team recognised that the technical debt inherent in the project wasn’t worth the effort of modification and refactoring. Instead a new codebase was created bringing forward the best parts of the system oriented around the molecular data model for Sanger Sequencing and Next Generation Sequencing (NGS) workflows. In the case of the Multimedia and Document Store module and the Agents module, a brand new codebase was established whose technology choices were aligned with the DINA vision. These two modules have been created from fundamental use cases for collection management and digitization workflows and will continue to evolve as more modules come online and broaden their scope. The DINA Labels & Reporting module is a generic service for transforming data in arbitrary printable layouts based on customizable templates. In order to use the module in combination with data managed in collection management software Specify (http://specifysoftware.org) for printing labels of collection objects, we wrapped the Specify 7 API with a DINA-compliant API layer called the “DINA Specify Broker”. This allows for using the easy-to-use web-based template engine within the DINA Labels & Reports module without changing Specify’s codebase. In our presentation we will explain the DINA development philosophy and will outline benefits for different stakeholders who directly or indirectly use collections data and related research data in their daily workflows. We will also highlight opportunities for joining the DINA Consortium and how to best engage with members of DINA who share their expertise in natural science, biodiversity informatics and geoinformatics.


Author(s):  
Marcus De Almeida ◽  
Ângelo Pinto ◽  
Alcimar Carvalho

Natural history collections (NHC) are guardians of biodiversity (Lane 1996) and essential to understand the natural world and its evolutionary processes. They hold samples of morphological and genetic heritages of living and extinct biotas, helping to reconstruct the timeline of life over the centuries (Gardner 2014). Primary data from specimens in NHC are crucial elements for research in many areas of biological sciences, considered the “bricks” of systematics and therefore one of the pillars for evolutionary studies (Troudet 2018). For this reason, studies carried out in NHC are essential for the development of the scientific knowledge and are pivotal for the scientific-technological progress of a nation (Camargo 2015). The digitization and availability of primary data on biodiversity from NHC represents a inexpensive, practical and secure means of exchanging information, allowing collaboration between institutions and researchers. In this sense, initiatives such as the Sistema de Informação sobre a Biodiversidade Brasileira (SiBBr), a country-level branch of the Global Biodiversity Information Facility (GBIF) platform, aim to encourage and establish ways for the informatization of biological collections and their type specimens. Known for housing one of the largest and oldest collections of insects in the world focused on Neotropical fauna, the Entomological Collection of the Museu Nacional of Federal University of Rio de Janeiro (MNRJ) had more than 3,000 primary types and approximately 12,005,000 specimens, of which about 96% were lost in the tragic fire occurred at the institution on September 2, 2018. The SiBBr project was active in that collection from 2016 to 2019 and enabled the digitization and preservation of data from the type material of many insect orders, including the charismatic dragonflies (order Odonata). Due to the end of the agreement between SiBBr and the Museu Nacional, most of the obtained primary data are pending full curation and, therefore, are not yet available to the public and researchers. The MNRJ housed the biggest and most important collection of dragonflies among all Central and South American institutions. It assembled most of the physical records of neotropical dragonfly fauna gathered over the last 80 years, many of which are of undescribed taxa. Unfortunately, almost all material was permanently lost. This study aims to gather, analyze and publicize primary data of the type material of dragonflies housed in the MNRJ, ensuring the preservation of its history, as well as providing data on the taxonomy and diversity of this marvelous group of insects. A total of 11 families, 50 genera and 131 species were recorded, belonging to the suborders Anisoptera and Zygoptera with distributional records widespread in South America. The MNRJ housed 105 holotypes of dragonflies' nomina representing 11.7% of the richness of the Brazilian Odonata fauna (901 spp.), a country with the highest number of species of the biosphere. The impact of the loss of this collection to studies of these insects is unprecedented, since some enigmatic and monotypic genera such as Brasiliogomphus, Fluminagrion and Roppaneura lost 100% of their type series, while others most diverse such as Lauromacromia, Oxyagrion and Neocordulia lost 50%, 35% and 31% of their holotypes. Therefore, due to the registration and preservation of primary biodiversity data, this work reiterates the importance of curating and digitizing biological scientific collections. Furthermore, it shows extreme relevance for preserving information on existing biodiversity permanently and providing support for future research. Digitization and interconnecting digital extended specimen data proves to be one of the main and most effective ways to protect NHC heritage and their primary data against catastrophic events.


2018 ◽  
Vol 2 ◽  
pp. e25839
Author(s):  
Lise Stork ◽  
Andreas Weber ◽  
Eulàlia Miracle ◽  
Katherine Wolstencroft

Geographical and taxonomical referencing of specimens and documented species observations from within and across natural history collections is vital for ongoing species research. However, much of the historical data such as field books, diaries and specimens, are challenging to work with. They are computationally inaccessable, refer to historical place names and taxonomies, and are written in a variety of languages. In order to address these challenges and elucidate historical species observation data, we developed a workflow to (i) crowd-source semantic annotations from handwritten species observations, (ii) transform them into RDF (Resource Description Framework) and (iii) store and link them in a knowledge base. Instead of full-transcription we directly annotate digital field books scans with key concepts that are based on Darwin Core standards. Our workflow stresses the importance of verbatim annotation. The interpretation of the historical content, such a resolving a historical taxon to a current one, can be done by individual researchers after the content is published as linked open data. Through the storage of annotion provenance, who created the annotation and when, we allow multiple interpretations of the content to exist in parallel, stimulating scientific discourse. The semantic annotation process is supported by a web application, the Semantic Field Book (SFB)-Annotator, driven by an application ontology. The ontology formally describes the content and meta-data required to semantically annotate species observations. It is based on the Darwin Core standard (DwC), Uberon and the Geonames ontology. The provenance of annotations is stored using the Web Annotation Data Model. Adhering to the principles of FAIR (Findable, Accessible, Interoperable & Reusable) and Linked Open Data, the content of the specimen collections can be interpreted homogeneously and aggregated across datasets. This work is part of the Making Sense project: makingsenseproject.org. The project aims to disclose the content of a natural history collection: a 17,000 page account of the exploration of the Indonesian Archipelago between 1820 and 1850 (Natuurkundige Commissie voor Nederlands-Indie) With a knowledge base, researchers are given easy access to the primary sources of natural history collections. For their research, they can aggregate species observations, construct rich queries to browse through the data and add their own interpretations regarding the meaning of the historical content.


2018 ◽  
Vol 2 ◽  
pp. e26060
Author(s):  
Pamela Soltis

Digitized natural history data are enabling a broad range of innovative studies of biodiversity. Large-scale data aggregators such as Global Biodiversity Information facility (GBIF) and Integrated Digitized Biocollections (iDigBio) provide easy, global access to millions of specimen records contributed by thousands of collections. A developing community of eager users of specimen data – whether locality, image, trait, etc. – is perhaps unaware of the effort and resources required to curate specimens, digitize information, capture images, mobilize records, serve the data, and maintain the infrastructure (human and cyber) to support all of these activities. Tracking of specimen information throughout the research process is needed to provide appropriate attribution to the institutions and staff that have supplied and served the records. Such tracking may also allow for annotation and comment on particular records or collections by the global community. Detailed data tracking is also required for open, reproducible science. Despite growing recognition of the value and need for thorough data tracking, both technical and sociological challenges continue to impede progress. In this talk, I will present a brief vision of how application of a DOI to each iteration of a data set in a typical research project could provide attribution to the provider, opportunity for comment and annotation of records, and the foundation for reproducible science based on natural history specimen records. Sociological change – such as journal requirements for data deposition of all iterations of a data set – can be accomplished using community meetings and workshops, along with editorial efforts, as were applied to DNA sequence data two decades ago.


2018 ◽  
Vol 2 ◽  
pp. e25882
Author(s):  
Maarten Schermer ◽  
Daphne Duin

The value of data present in natural history collections for research and collection management cannot be overstated. Naturalis Biodiversity Center, home to one of the largest natural history collections in the world, completed a large-scale digitisation project resulting in the registration of more than 38 million objects, many of them annotated with descriptive metadata, such as geographic coordinates and multimedia content. While digitisation is ongoing, we are now also looking for ways to leverage our digital collection, both for the benefit of collection management and that of networking with other natural history collections. To this end, we developed the Netherlands Biodiversity Data Services, providing centralized access to our collection data via state of the art, open access interfaces. Full, centralized access to the digital collection allows us to combine the data with other sources, such as collection scans focusing on the physical condition and accessibility of the collection. But also with data from external sources, such as the collection information of sister institutions, allowing for combining and comparing data, and exploring areas where collections can reinforce each other. Focusing on availability and accessibility, the services were deliberately designed as a versatile, low-level API to allow the use of our data with a broad variety of applications and services. These applications range from scientific research and remote mobile access to collection information, to “mash ups” with other data sources, apps and application in our own museum. We will demonstrate this range of applications through several examples, including the embedding of data in websites (example, Dutch Caribbean Species Register: http://www.dutchcaribbeanspecies.org/linnaeus_ng/app/views/species/nsr_taxon.php?id=177968&cat=165), use in the development of deep learning models, thematic portals (example, Naturalis meteorite collection: http://bioportal.naturalis.nl/result?theme=meteorites&language=en) and the development of Java- and R-clients. This presentation ties in with Max Caspers' presentation “Advancing collections management with the Netherlands Biodiversity Data Services“, in which he will demonstratie the potential of the services described in this presentation for the area of collections management, specifically.


Zootaxa ◽  
2011 ◽  
Vol 3138 (1) ◽  
pp. 1 ◽  
Author(s):  
JOSEPH R. MENDELSON III ◽  
DANIEL G. MULCAHY ◽  
TYLER S. WILLIAMS ◽  
JACK W. SITES JR.

We combine mitochondrial and nuclear DNA sequence data with non-molecular (morphological and natural history) data to conduct phylogenetic analyses and generate an evolutionary hypothesis for the relationships among nearly every species of Mesoamerican bufonid in the genus Incilius. We collected a total of 5,898 aligned base-pairs (bp) of sequence data from mitochondrial (mtDNA: 12S–16S, cyt b, ND2–CO1, including tRNAsTRP–TYR and the origin of light strand replication; total 4,317 bp) and nuclear (CXCR4 and RAG1; total 1,581 bp) loci from 52 individual toads representing 37 species. For the non-molecular data, we collected 44 characters from 29 species. We also include Crepidophryne, a genus that has not previously been included in molecular analyses. We present results of parsimony and Bayesian analyses for these data separately and combined. Relationships based on the non-molecular data were poorly supported and did not resolve a monophyletic Incilius (Rhinella marina was nested within). Our molecular data provide significant support to most of the relationships. Our combined analyses demonstrate that inclusion of a considerably smaller dataset (44 vs. 5,898 characters) of non-molecular characters can provide significant support where the molecular relationships were lacking support. Our combined results indicate that Crepidophryne is nested within Incilius; therefore, we place the former in the synonymy of the latter taxon. Our study provides the most comprehensive evolutionary framework for Mesoamerican bufonids (Incilius), which we use as a starting point to invoke discussion on the evolution of their unique natural history traits.


Author(s):  
Tim Robertson ◽  
Marcos Gonzalez ◽  
Morten Høfft ◽  
Marie Grosjean

The Global Biodiversity Information Facility (GBIF) was established by governments in 2001, largely through the initiative and leadership of the natural history collections community, following the 1999 recommendation by a working group under the Megascience Forum (predecessor of the Global Science Forum) of the Organization for Economic Cooperation and Development (OECD). Over 20 years, GBIF has helped develop standards and convened a global community of data-publishing institutions, aggregating over one billion species occurrence records freely and openly available for use in research and policy making. Of these more than 150 million records originate from specimens preserved by the collections community. The recent adoption of the Global Registry of Scientific Collections by GBIF (https://www.gbif.org/news/5kyAslpqTVxYqZTwYn1cub) is the first step by GBIF to better enable a picture of the natural history collections of the world along with the associated science that they have and continue to enable. Recognising that other collection metadata initiatives exists, GBIF aims to discuss with the community and progress topics such as: Synchronising with existing metadata catalogues to ensure accurate, up-to-date information is available without unnecessary burden for authors Defining, testing and formalizing the Collection Descriptions standard (https://github.com/tdwg/cd) Providing clear guidelines of citation practice for collections, potentially building on the success of the Digital Object Identifier (DOI) approach used for datasets mediated through GBIF.org. Tracking citations of use through both data downloads and through references in literature, such as materials examined in a taxonomic publication Improving the linkages and discoverability of specimen records derived from the same collecting event but preserved in multiple institutions Improving the linkages between the people involved in collecting, preserving, and identifying specimen records through the use of Open Researcher and Contributor IDs (ORCID) Lowering the technical threshold to deploy tools such as “data dashboards” and specimen search/download on collection related websites Synchronising with existing metadata catalogues to ensure accurate, up-to-date information is available without unnecessary burden for authors Defining, testing and formalizing the Collection Descriptions standard (https://github.com/tdwg/cd) Providing clear guidelines of citation practice for collections, potentially building on the success of the Digital Object Identifier (DOI) approach used for datasets mediated through GBIF.org. Tracking citations of use through both data downloads and through references in literature, such as materials examined in a taxonomic publication Improving the linkages and discoverability of specimen records derived from the same collecting event but preserved in multiple institutions Improving the linkages between the people involved in collecting, preserving, and identifying specimen records through the use of Open Researcher and Contributor IDs (ORCID) Lowering the technical threshold to deploy tools such as “data dashboards” and specimen search/download on collection related websites The progress made to date will be summarised and a roadmap for the future will be introduced.


Sign in / Sign up

Export Citation Format

Share Document