scholarly journals The Standards behind the Scenes: Explaining data from the Plazi workflow

Author(s):  
Donat Agosti ◽  
Marcus Guidoti ◽  
Terry Catapano ◽  
Alexandros Ioannidis-Pantopikos ◽  
Guido Sautter

As part of the CETAF COVID19 task force, Plazi liberated taxonomic treatments, figures, observation records, biotic interactions, taxonomic names, and collection and specimen codes involving bats and viruses from scholarly publications with the intention to create open access, findable, accessible, interoperable and reusable data (FAIR). The data is accessible via TreatmentBank and the Biodiversity Literature Repository (BLR) and it is continually harvested and reused by the Global Biodiversity Information Facility (GBIF) and Global Biotic Interactions (GloBI). This data was processed, enhanced and liberated by the Plazi workflow, which involves a dedicated infrastructure including a desktop application (GoldenGate Imagine) that converts portable document format files (PDF) to a dedicated open compressed file format (Image Markup File (IMF)) that is responsible for the data enhancement. To enhance the data contained in the publications, including the biological interactions, a series of standards and vocabularies are used. To the exception of TaxPub, which is a taxonomic specific extension of the U.S. National Center for Biotechnology Information's (NCBI) Journal Article Tag Suite (JATS), all other used vocabulary were previously proposed. This goes along with Plazi’s mission to reuse standards unless they are not available. The following standards of vocabularies are used: Metadata Object Description Schema (MODS) to model article metadata information on Plazi’s XMLs; Darwin Core for taxonomic ranks and materials citation related data; Open Biological and Biomedical Ontology (OBO); Relations Ontology for biological interactions between organisms. The latter two are also used in the custom metadata in the Biodiversity Literature Repository at Zenodo. In this presentation we will provide an overview of the different types of data followed by the standards or vocabularies applied for every and each one of them and their parts. The goal is to provide the context on how the data liberated by Plazi is described, which is extensively reused by third-party applications such as GBIF or GloBI. The use of the standards allows fully automated, daily data ingests by GBIF.

Author(s):  
Donat Agosti ◽  
Marcus Guidoti ◽  
Guido Sautter

The growing corpus of hundreds of millions of pages of taxonomic literature reporting research results based on specimens is very rich in facts. In order to make them reusable, Plazi, Pensoft and Zenodo are building and maintaining the Biodiversity Literature Repository which includes a workflow to discover, describe, store, in order to making these facts open access, findable, accesible, interoperable and reusable (FAIR). Currently, 43,000 articles have 406,000 material citations, and around 50% of annually new described species are made accessible and immediately reused by the Global Biodiversity Information Facility (GBIF). All the images are deposited at the Biodiversity Literature Repository (BLR), as well as the taxonomic treatments. For each of these deposits enriched metadata is added and a Digital Object Identifier (DOI) is minted. Through this process, Plazi is the single largest data set provider to GBIF and continues to provide ca. 45,000 unique taxonomic names at GBIF. The workflow is optimized for born digital portable data format (PDF) based publications, but other formats can also be ingested, including TaxPub, a taxonomy specific version of the Journal Article Tag Suit (JATS) XML. After ingestion, the PDF is readily converted to an open-access, proprietary format called Image Markup File (IMF). IMF is a compressed file format that consists of the enhanced information contained in the PDF, with figures and tables properly extracted. The IMFs are then housed at TreatmentBank, with associated exported files, including DwC-A for each parent article and their respective taxonomic treatments, XMLs of treatments and GBIF datasets of their parent articles. Taxonomic treatments, in addition to figures and the original PDFs, are also deposited on Zenodo, where a DOI is minted if none is already available. These Zenodo deposits include in the metadata links back to the different data and file formats, including the treatments XMLs, maintaining the system connected and up-to-date. Third-party players, like GBIF, Global Biotic Interactions (GloBI), Ocellus, OpenBiodiv and Synospecies are constantly fed by system hookups, which guarantees data consistency after further edits. The PDF-IMF conversion and data enhancement is possible due to Plazi’s open-source software called GoldenGate Imagine. Ingested XMLs, that are validated against the TaxPub scheme, follow a similar path into the system and the many third-party applications. This operation is supported by the Arcadia Fund as well by service contracts from publishers to disseminate their data. In addition, the workflow has been contributing treatments, images from numerous publications relevant to understanding the virus spillover as part of the CETAF COVID19 task force. In this lecture this workflow is described and explained, including the associated infrastructure and its ongoing changes and upcoming steps of development.


Author(s):  
Donat Agosti

Biodiversity sciences, including taxonomy, are empirical sciences where all results are published in scholarly publications as part of the research life cycle. This creates a corpus of an estimated 500 million printed pages (Kalfatovic 2010) including billions of facts such as traits, biotic interactions, observations characterizing all the estimated 1.9 million known species (Costello et al. 2013). This library is continually reused, cited and extended, for example with more than an estimated 15,000–20,000 new species annually (Polaszek 2005). All of these figures are estimates because we neither know how many species have been discovered, nor how many are being discovered every day, let alone what we know about them. Following standard scientific practice, previous publications, specimens, gene sequences, or taxonomic treatments (Catapano 2019) are cited more or less explicitly. In the pre-digital age, these links were meant for the human reader to be understood. For example, "L. 1758" is an established reference and links to both, Carolus Linnaeus and Linnaeus 1758, understandable at least by an expert human, and in the digital age, provides access to the respective digital representation. These data within the hundreds of millions of printed and now increasingly digitally published pages form a seamless, albeit implicit knowledge graph. Unfortunately, most of these publications are in print—the Biodiversity Heritage Library digitized about 50 million pages (Kalfatovic 2010)—or in many cases, closed access publications, and thus this knowledge is not readily accessible in the digital age. However, in today's digital age, each of these kinds of implicit links is an expensive stumbling block to access and reuse of the referenced data, its parent publications and the cited referenced data therein. Inadequate formats, language and access to taxonomic information were already recognized in 1992 at the Rio Summit (Taxonomic Impediment). The consequences of these impediments are only now obvious with the realization of the daunting amount of human resources needed to digitally catalogue and index this unknown (not discoverable and inaccessible) known knowledge, let alone making the data itself findable, accessible, interoperable and reusable (FAIR). This is a formidable and complex scientific challenge. Plazi is taking on this challenge. Its vision is to promote and enable the discovery and liberation of data to transform the unknown known data into digitally accessible knowledge, i.e., to build a digital knowledge base aimed at discovering all the species (and other taxa) we know, and what we know about them. Taxonomic publications with their highly standardized taxonomic names, taxonomic treatments, treatment citations, material citations and illustrations are well suited to machine extraction. Together they include the entire catalogue of life with all the discovered species and their synonyms, often tens to hundreds of treatments, and figures that depict the myriad forms that comprise the world’s biodiversity. Once these data are FAIR, it allows bidirectional linking, for example of taxonomic names to the referenced taxonomic treatment, other digital resources such as gene sequences or digital specimens. At the same time, each datum is an entry point to the wealth of information that can be followed by the human user by clicking the links, but more importantly, analysed by machines. Here, digitally accessible knowledge will be defined in the context of discovering known biodiversity, including strategies of how to approach the challenge, which then will be detailed in subsequent talks in this symposium. This symposium is based on Plazi’s ongoing data liberation and discovery supported by the European Union (e.g. Biodiversity Community Integrated Knowledge Library BiCIKL), United States (e.g. NIH) and Swiss research funding (e.g. e-BioDiv and the Arcadia Fund), collaboration with publishers (e.g. Pensoft, Muséum national d'Histoire naturelle, Consortium of European Taxonomic Facilities Publications, the Zenodo repository, Biodiversity Heritage Library), and data reusers like the Global Biodiversity Information Facility, Ocellus, Synospecies and openBiodiv. Currently, over 500,000 taxonomic treatments and 300,000 illustrations have been liberated and are accessible through TreatmentBank and the Biodiversity Literature Repository.


1983 ◽  
Vol 37 ◽  
pp. 2-2
Author(s):  
Charles W. Dunn

Why is authorship of a textbook generally considered less of a scholarly contribution than authorship of a “scholarly” publication, such as a journal article or a university press book?Certainly both are needed, but is it right for a political science department to reward faculty who author “scholarly” publications more than those who author textbooks?Whether stipulated in the criteria for departmental evaluation of faculty performance or in other less overt ways, the bias is prevalent throughout our discipline.This essay states five reasons why the bias should not exist: 1) ignorance of impact, 2) ignorance of values, 3) ignorance of the review process, 4) ignorance of purpose, and 5) ignorance of time and Scope.


2020 ◽  
Vol 10 (1) ◽  
pp. 59-69
Author(s):  
Edmund C. Levin

Background: Screening adolescents for depression has recently been advocated by two major national organizations. However, this practice is not without controversy. Objective: To review diagnostic, clinical, and conflict of interest issues associated with the calls for routine depression screening in adolescents. Method: The evaluation of depression screening by the US Preventive Services Task Force is compared and contrasted with those of comparable agencies in the UK and Canada, and articles arguing for and against screening are reviewed. Internal pharmaceutical industry documents declassified through litigation are examined for conflicts of interest. A case is presented that illustrates the substantial diagnostic limitations of self-administered mental health screening tools. Discussion: The value of screening adolescents for psychiatric illness is questionable, as is the validity of the screening tools that have been developed for this purpose. Furthermore, many of those advocating depression screening are key opinion leaders, who are in effect acting as third-party advocates for the pharmaceutical industry. The evidence suggests that a commitment to marketing rather than to science is behind their recommendations, although their conflicts of interest are hidden in what seem to be impartial third-party recommendations.


2019 ◽  
Author(s):  
Rachel S. Meyer ◽  
Teia M. Schweizer ◽  
Wai-Yin Kwan ◽  
Emily Curd ◽  
Adam Wall ◽  
...  

Abstract:Environmental DNA (eDNA) metabarcoding is emerging as a biomonitoring tool available to the citizen science community that promises to augment or replace photographic observation. However, eDNA results and photographic observations have rarely been compared to document their individual or combined power. Here, we use eDNA multilocus metabarcoding, a method deployed by the CALeDNA Program, to inventory and evaluate biodiversity variation along the Pillar Point headland near Half Moon Bay, California. We describe variation in presence of 13,000 taxa spanning 82 phyla, analyze spatiotemporal patterns of beta diversity, and identify metacommunities. Inventory and measures of turnover across space and time from eDNA analysis are compared to the same measures from Global Biodiversity Information Facility (GBIF) data, which contain information largely contributed by iNaturalist photographic observations. We find eDNA depicts local signals with high seasonal turnover, especially in prokaryotes. We find a diverse community dense with pathogens and parasites in the embayment, and a State Marine Conservation Area (SMCA) with lower species richness than the rest of the beach peninsula, but with beta diversity signals showing resemblance to adjacent unprotected tidepools. The SMCA differs in observation density, with higher density of protozoans, and animals in Ascidiacea, Echinoidea, and Polycladida. Local contributions to beta diversity are elevated in a section of East-facing beach. GBIF observations are mostly from outside the SMCA, limiting some spatial comparisons. However, our findings suggest eDNA samples can link the SMCA sites to sites with better GBIF inventory, which may be useful for imputing species from one site given observations from another. Results additionally support >3800 largely novel biological interactions. This research, and accompanying interactive website support eDNA as a gap-filling tool to measure biodiversity that is available to community and citizen scientists.


EP Europace ◽  
2020 ◽  
Vol 22 (11) ◽  
pp. 1742-1758 ◽  
Author(s):  
Jens Cosedis Nielsen ◽  
Josef Kautzner ◽  
Ruben Casado-Arroyo ◽  
Haran Burri ◽  
Stefaan Callens ◽  
...  

Abstract The European Union (EU) General Data Protection Regulation (GDPR) imposes legal responsibilities concerning the collection and processing of personal information from individuals who live in the EU. It has particular implications for the remote monitoring of cardiac implantable electronic devices (CIEDs). This report from a joint Task Force of the European Heart Rhythm Association and the Regulatory Affairs Committee of the European Society of Cardiology (ESC) recommends a common legal interpretation of the GDPR. Manufacturers and hospitals should be designated as joint controllers of the data collected by remote monitoring (depending upon the system architecture) and they should have a mutual contract in place that defines their respective roles; a generic template is proposed. Alternatively, they may be two independent controllers. Self-employed cardiologists also are data controllers. Third-party providers of monitoring platforms may act as data processors. Manufacturers should always collect and process the minimum amount of identifiable data necessary, and wherever feasible have access only to pseudonymized data. Cybersecurity vulnerabilities have been reported concerning the security of transmission of data between a patient’s device and the transceiver, so manufacturers should use secure communication protocols. Patients need to be informed how their remotely monitored data will be handled and used, and their informed consent should be sought before their device is implanted. Review of consent forms in current use revealed great variability in length and content, and sometimes very technical language; therefore, a standard information sheet and generic consent form are proposed. Cardiologists who care for patients with CIEDs that are remotely monitored should be aware of these issues.


2015 ◽  
Vol 47 (3) ◽  
pp. 724-732 ◽  
Author(s):  
Hans Pasterkamp ◽  
Paul L.P. Brand ◽  
Mark Everard ◽  
Luis Garcia-Marcos ◽  
Hasse Melbye ◽  
...  

Auscultation of the lung remains an essential part of physical examination even though its limitations, particularly with regard to communicating subjective findings, are well recognised. The European Respiratory Society (ERS) Task Force on Respiratory Sounds was established to build a reference collection of audiovisual recordings of lung sounds that should aid in the standardisation of nomenclature. Five centres contributed recordings from paediatric and adult subjects. Based on pre-defined quality criteria, 20 of these recordings were selected to form the initial reference collection. All recordings were assessed by six observers and their agreement on classification, using currently recommended nomenclature, was noted for each case. Acoustical analysis was added as supplementary information. The audiovisual recordings and related data can be accessed online in the ERS e-learning resources. The Task Force also investigated the current nomenclature to describe lung sounds in 29 languages in 33 European countries. Recommendations for terminology in this report take into account the results from this survey.


1970 ◽  
Vol 15 (1) ◽  
pp. 13
Author(s):  
Susan Borda

In 2018, the Deep Blue Repositories and Research Data Services (DBRRDS) team at the University of Michigan Library began working with the University of Michigan Museum of Zoology (UMMZ) to provide a persistent and sustainable (i.e., non-grant funded, institutionally supported) solution for their part of the National Science Foundation’s (NSF) openVertebrate (oVert) initiative. The objective of oVert is to the digitize scientific collections of thousands of vertebrate specimens stored in jars on museum shelves and make the data freely accessible to researchers, students, classrooms, and the general public anywhere in the world. The University of Michigan (U-M) is one of five scanning centers working on oVert and will contribute scans of more than 3,500 specimens from the UMMZ collections (Erickson 2017). In addition to ingesting scans, the project involved developing methods to work around several significant system constraints: Deep Blue Data’s file structure (flat files only, no folders) and the closed use of Specify, UMMZ’s specimen database, for specimen metadata. DBRRDS had to create a completely new workflow for handling batch deposits at regular intervals, develop scripts to reorganize the data (according to a third-party data model) and augment the metadata using a third-party resource, Global Biodiversity Information Facility (GBIF). This paper will describe the following aspects of the UMMZ CT Scanning Project partnership in greater detail: data generation, metadata requirements, workflows, code development, lessons learned, and next steps.  


2021 ◽  
Vol 2 (1) ◽  
pp. 1-19
Author(s):  
Harshdeep Singh ◽  
Robert West ◽  
Giovanni Colavizza

Abstract Wikipedia’s content is based on reliable and published sources. To this date, relatively little is known about what sources Wikipedia relies on, in part because extracting citations and identifying cited sources is challenging. To close this gap, we release Wikipedia Citations, a comprehensive data set of citations extracted from Wikipedia. We extracted29.3 million citations from 6.1 million English Wikipedia articles as of May 2020, and classified as being books, journal articles, or Web content. We were thus able to extract 4.0 million citations to scholarly publications with known identifiers—including DOI, PMC, PMID, and ISBN—and further equip an extra 261 thousand citations with DOIs from Crossref. As a result, we find that 6.7% of Wikipedia articles cite at least one journal article with an associated DOI, and that Wikipedia cites just 2% of all articles with a DOI currently indexed in the Web of Science. We release our code to allow the community to extend upon our work and update the data set in the future.


Sign in / Sign up

Export Citation Format

Share Document