FAIR Data and Services in Biodiversity Science and                     Geoscience

Larry Lannom; Dimitris Koureas; Alex R. Hardisty

doi:10.1162/dint_a_00034

FAIR Data and Services in Biodiversity Science and Geoscience

Data Intelligence ◽

10.1162/dint_a_00034 ◽

2020 ◽

Vol 2 (1-2) ◽

pp. 122-130 ◽

Cited By ~ 14

Author(s):

Larry Lannom ◽

Dimitris Koureas ◽

Alex R. Hardisty

Keyword(s):

Integral Characteristic ◽

Heterogeneous Data ◽

Digital Object ◽

Biodiversity Science ◽

Technical Vision ◽

Challenges And Opportunities ◽

Fair Principles ◽

Additional Level ◽

Scientific Collections ◽

Data Typing

We examine the intersection of the FAIR principles (Findable, Accessible, Interoperable and Reusable), the challenges and opportunities presented by the aggregation of widely distributed and heterogeneous data about biological and geological specimens, and the use of the Digital Object Architecture (DOA) data model and components as an approach to solving those challenges that offers adherence to the FAIR principles as an integral characteristic. This approach will be prototyped in the Distributed System of Scientific Collections (DiSSCo) project, the pan-European Research Infrastructure which aims to unify over 110 natural science collections across 21 countries. We take each of the FAIR principles, discuss them as requirements in the creation of a seamless virtual collection of bio/geo specimen data, and map those requirements to Digital Object components and facilities such as persistent identification, extended data typing, and the use of an additional level of abstraction to normalize existing heterogeneous data structures. The FAIR principles inform and motivate the work and the DO Architecture provides the technical vision to create the seamless virtual collection vitally needed to address scientific questions of societal importance.

Download Full-text

The Power of Matrix Factorization: Methods for Deconvoluting Genetic Heterogeneous Data at Expression Level

Current Bioinformatics ◽

10.2174/1574893615666200120110205 ◽

2021 ◽

Vol 15 (8) ◽

pp. 841-853

Author(s):

Yuan Liu ◽

Zhining Wen ◽

Menglong Li

Keyword(s):

Artificial Intelligence ◽

Matrix Factorization ◽

In Silico ◽

Genetic Data ◽

Heterogeneous Data ◽

Expression Level ◽

Mathematical Methods ◽

Challenges And Opportunities ◽

Factorization Methods ◽

Genetic Profiles

Background:: The utilization of genetic data to investigate biological problems has recently become a vital approach. However, it is undeniable that the heterogeneity of original samples at the biological level is usually ignored when utilizing genetic data. Different cell-constitutions of a sample could differentiate the expression profile, and set considerable biases for downstream research. Matrix factorization (MF) which originated as a set of mathematical methods, has contributed massively to deconvoluting genetic profiles in silico, especially at the expression level. Objective: With the development of artificial intelligence algorithms and machine learning, the number of computational methods for solving heterogeneous problems is also rapidly abundant. However, a structural view from the angle of using MF to deconvolute genetic data is quite limited. This study was conducted to review the usages of MF methods on heterogeneous problems of genetic data on expression level. Methods: MF methods involved in deconvolution were reviewed according to their individual strengths. The demonstration is presented separately into three sections: application scenarios, method categories and summarization for tools. Specifically, application scenarios defined deconvoluting problem with applying scenarios. Method categories summarized MF algorithms contributed to different scenarios. Summarization for tools listed functions and developed web-servers over the latest decade. Additionally, challenges and opportunities of relative fields are discussed. Results and Conclusion: Based on the investigation, this study aims to present a relatively global picture to assist researchers to achieve a quicker access of deconvoluting genetic data in silico, further to help researchers in selecting suitable MF methods based on the different scenarios.

Download Full-text

FAIR Principles for Digital Repositories: Essence and Applications for Heritage Objects

Cultural and Historical Heritage: Preservation, Representation, Digitalization ◽

10.26615/issn.2367-8038.2020_2_006 ◽

2020 ◽

Vol 6 (1) ◽

pp. 64-76

Author(s):

Kalina Sotirova-Valkova ◽

Keyword(s):

Scientific Data ◽

Digital Object ◽

Industry Funding ◽

Scientific Publications ◽

Digital Repositories ◽

Good Management ◽

Funding Agencies ◽

International Image ◽

Fair Principles ◽

Disparate Data

The emergence of the FAIR initiative in 2016 is based on the need for good management of disparate data, and improving the functionality of digital repositories and e-infrastructures. The aim is to promote the re-use of (scientific) data, a need recognized by academia, industry, funding agencies and memory institutions. This paper discusses the nature of the FAIR principles, its` technologies, the concept of FAIR digital object, FAIR ecosystem and persistent identifiers, a possible solution to the images-publication in scientific publications and in museum digital repositories through the International Image Interoperability Framework (IIIF), and all these through the focus of possible digital vision of the Bulgarian memory institutions. Keywords: FAIR principles, heritage, Persistent Identifiers, LOD

Download Full-text

Machine Learning for Clinical Decision-Making: Challenges and Opportunities

10.20944/preprints201911.0278.v1 ◽

2019 ◽

Author(s):

Sergio Sanchez-Martinez ◽

Oscar Camara ◽

Gemma Piella ◽

Maja Cikes ◽

Miguel Angel Gonzalez Ballester ◽

...

Keyword(s):

Machine Learning ◽

Decision Making ◽

Clinical Decision Making ◽

Clinical Status ◽

Clinical Decision ◽

Heterogeneous Data ◽

Response To Treatment ◽

Classical Pathway ◽

Challenges And Opportunities ◽

Four Levels

The use of machine learning (ML) approaches to target clinical problems is called to revolutionize clinical decision-making. The success of these tools is subjected to the understanding of the intrinsic processes being used during the classical pathway by which clinicians make decisions. In a parallelism with this pathway, ML can have an impact at four levels: for data acquisition, predominantly by extracting standardized, high-quality information with the smallest possible learning curve; for feature extraction, by discharging healthcare practitioners from performing tedious measurements on raw data; for interpretation, by digesting complex, heterogeneous data in order to augment the understanding of the patient status; and for decision support, by leveraging the previous step to predict clinical outcomes, response to treatment or to recommend a specific intervention. This paper discusses the state-of-the-art, as well as the current clinical status and challenges associated with each of these tasks, together with the challenges related to the learning process, the auditability/traceability, the system infrastructure and the integration within clinical processes.

Download Full-text

Causal Discovery from Nonstationary/Heterogeneous Data: Skeleton Estimation and Orientation Determination

Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2017/187 ◽

2017 ◽

Cited By ~ 8

Author(s):

Kun Zhang ◽

Biwei Huang ◽

Jiji Zhang ◽

Clark Glymour ◽

Bernhard Schölkopf

Keyword(s):

Causal Model ◽

Causal Structure ◽

Heterogeneous Data ◽

Causal Discovery ◽

Data Sets ◽

Real World Data ◽

Experimental Conditions ◽

Challenges And Opportunities ◽

Changes Over Time ◽

Orientation Determination

It is commonplace to encounter nonstationary or heterogeneous data, of which the underlying generating process changes over time or across data sets (the data sets may have different experimental conditions or data collection conditions). Such a distribution shift feature presents both challenges and opportunities for causal discovery. In this paper we develop a principled framework for causal discovery from such data, called Constraint-based causal Discovery from Nonstationary/heterogeneous Data (CD-NOD), which addresses two important questions. First, we propose an enhanced constraint-based procedure to detect variables whose local mechanisms change and recover the skeleton of the causal structure over observed variables. Second, we present a way to determine causal orientations by making use of independence changes in the data distribution implied by the underlying causal model, benefiting from information carried by changing distributions. Experimental results on various synthetic and real-world data sets are presented to demonstrate the efficacy of our methods.

Download Full-text

Digital Object Cloud for linking natural science collections information; The case of DiSSCo

Biodiversity Information Science and Standards ◽

10.3897/biss.2.25474 ◽

2018 ◽

Vol 2 ◽

pp. e25474

Author(s):

Dimitrios Koureas ◽

Wouter Addink ◽

Alex Hardisty

Keyword(s):

Natural Science ◽

Large Scale ◽

Digital Object ◽

Science Museums ◽

Large Scale Data ◽

Governance Model ◽

Different Types ◽

Types Of Information ◽

Scientific Collections ◽

Fragmented Data

DiSSCo(The Distributed System of Scientific Collections) is a Research Infrastructure (RI) aiming at providing unified physical (transnational), remote (loans) and virtual (digital) access to the approximately 1.5 billion biological and geological specimens in collections across Europe. DiSSCo represents the largest ever formal agreement between natural science museums (114 organisations across 21 European countries). With political and financial support across 14 European governments and a robust governance model DiSSCo will deliver, by 2025, a series of innovative end-user discovery, access, interpretation and analysis services for natural science collections data. As part of DiSSCo's developing data model, we evaluate the application of Digital Objects (DOs), which can act as the centrepiece of its architecture. DOs have bit-sequences representing some content, are identified by globally unique persistent identifiers (PIDs) and are associated with different types of metadata. The PIDs can be used to refer to different types of information such as locations, checksums, types and other metadata to enable immediate operations. In the world of natural science collections, currently fragmented data classes (inter alia genes, traits, occurrences) that have derived from the study of physical specimens, can be re-united as parts in a virtual container (i.e., as components of a Digital Object). These typed DOs, when combined with software agents that scan the data offered by repositories, can act as complete digital surrogates of the physical specimens. In this paper we: investigate the architectural and technological applicability of DOs for large scale data RIs for bio- and geo-diversity, identify benefits and challenges of a DO approach for the DiSSCo RI and describe key specifications (incl. metadata profiles) for a specimen-based new DO type.

Download Full-text

Making Heterogeneous Specimen Data ‘FAIR’: Implementing a digital specimen repository

Biodiversity Information Science and Standards ◽

10.3897/biss.3.37163 ◽

2019 ◽

Vol 3 ◽

Cited By ~ 1

Author(s):

Abraham Nieva de la Hidalga ◽

Alex Hardisty

Keyword(s):

Dna Sequences ◽

Heterogeneous Data ◽

Primary Data ◽

Object Model ◽

Digital Object ◽

Data Types ◽

Resource Manager ◽

Digital Objects ◽

Definition Of ◽

Rapid Deployment

The definition of a digital specimen is proposed to encompass the digital representation(s) of physical specimens from natural science collections. The digital specimen concept is intended to define a representation (digital object) that brings together an array of heterogeneous data types, which are themselves alternative physical specimen representations. In this case, the digital specimen (DS) holds references to specimen data from a collection management system, images, 3D models, research articles, DNA sequences, collector information, among many other data types. The proposal is to create persistent relationships between the DS and other categories of digital objects (e.g. resource types mentioned above, collections, storage platforms, organisations, databases, and provenance data). Complying with FAIR data principles (findability, accessibility, interoperability, and reuse), i.e., achieving data ‘FAIRness’, eases data integration, which is needed for cross-disciplinary linking and combination of data from different domains, making the DS as a comprehensive package of information about a specimen. Implementation and access to a digital specimen repository (DSR) as a Digital Object Architecture (Sharp 2016) component demonstrates the alignment of the DS concept and FAIR data principles (Wilkinson et al. 2016, Kahn and Wilensky 2006). The DSR fulfills four roles: data producer, resource manager, data publisher, and collaboration space. As data producer, the DSR allows acquisition and curation (indexing, storage) of DSs linking primary data, models, analyses, and other digital object types. As resource manager, the DSR manages access to distributed platforms, ranging from acquisition networks (digitisation stations, museums, herbariums) to processing services, advanced computational resources, data asset storage systems, and specialised servers. As data publisher, the DSR provides access to data assets from national and transnational data archives. As collaboration space, the DSR supports users’ accessing, sharing and (re)using data assets, and derived data products and services. Adopting the collaboration space and data publisher roles, the DSR implements interfaces that expose the DSs to the research community, fulfilling the FAIR findability, accessibility, and reuse principles. Adopting the data producer and resource manager roles, the DSR creates meaningful and persistent relationships required to link DSs and other types of digital objects, fulfilling the FAIR interoperability principle. A prototype DSR based on the Cordra digital object repository has been deployed (Corporation for National Research Initiatives (CNRI) 2018, Reilly and Tupelo-Schneck 2010). The advantages of Cordra are: rapid deployment, customisable object model, creation of relations between digital objects, and application program interfaces for programmatic access. Rapid deployment of the DSR provides a tangible target for discussing the implementation of the DS concept. The customisable object model enables the refinement and enhancing of the definition of DS in response to feedback from colleagues who have accessed the DSR and used its contents. Creating relations between digital objects enables flexible linking to digital objects stored in different repositories. Accessing the DSR programmatically through APIs enables extending the use of the repository in different platforms (e.g. mobile devices) as well as integration with other repositories and services. As well as supporting a HTTP-oriented API, Cordra implements Digital Object Interface Protocol (DONA Foundation 2018), allowing the definition of operations to act directly on selected DSs in the repository. The DSR prototype has been demonstrated by providing access to the repository administrative interface and with a custom interface designed to facilitate access by different user groups, such as collection curators, researchers, teachers, and students. The client interface has been designed to demonstrate a subset of the functionalities derived from user stories, which describe software features from the end-user perspective. Demonstrating the DSR capabilities as proposed, will inform the refinement of the design of the DS model and provide early feedback about the needed software features.

Download Full-text

Towards a COST MOBILISE Guideline for Long Term Preservation and Archiving of Data Constructs from Scientific Collections Facilities

Biodiversity Information Science and Standards ◽

10.3897/biss.5.73901 ◽

2021 ◽

Vol 5 ◽

Author(s):

Dagmar Triebel ◽

Dragan Ivanovic ◽

Gila Kahila Bar-Gal ◽

Sven Bingert ◽

Tanja Weibulat

Keyword(s):

Data Storage ◽

Reference Model ◽

Supplementary Information ◽

Cost Models ◽

Digital Object ◽

Open Archival Information System ◽

Data Products ◽

Scientific Collections ◽

Long Term Preservation

COST (European Cooperation in Science and Technology) is a funding organisation for research and innovation networks. One of the objectives of the COSTAction called “Mobilising Data, Policies and Experts in Scientific Collections“ (MOBILISE) is to work on documents for expert training with broad involvement of professionals from the participating European countries. The guideline presented here in its general concept will address principles, strategies and standards for long term preservation and archiving of data constructs (data packages, data products) as addressed by and under control of the scientific collections community. The document is being developed as part of the MOBILISE Action targeted towards primarily scientific staff at natural scientific collection facilities, as well as management bodies of collections like museums, herbaria and information technology personnel less familiar with data archiving principles and routines. The challenges of big data storage and (distributed, cloud-based) storage solutions as well as that of data mirroring, backing up, synchronisation and publication in productive data environments are well addressed by documents, guidelines and online platforms, e.g., in the DISSCo knowledge base (see Hardisty et al. (2020)) and as part of concepts of the European Open Science Cloud (EOSC). Archival processes and the resulting data constructs, however, are often left outside of the considerations. This is a large gap because archival issues are not only simple technical ones as addressed by the term “bit preservation” but also envisage a number of logical, functional, normative, administrative and semantic issues as addressed by the term “functional long-term archiving”. The main target digital object types addressed by this COST MOBILISE Guideline are data constructs called Digital or Digital Extended Specimens and data products with the persistent identifier assignment lying under the authority of scientific collections facilities. Such digital objects are specified according to the Digital Object Architecture (DOA , see Wittenburg et al. 2018) and similar abstract models introduced by Harjes et al. (2020) and Lannom et al. (2020). The scientific collection-specific types are defined following evolving concepts in the context of the Consortium of European Taxonomic Facilities (CETAF), the research infrastructure DiSSCo (Distributed System of Scientific Collections), and the Biodiversity Information Standards (TDWG). Archival processes are described following the OAIS (Open Archival Information System) reference model. The archived objects should be reusable in the sense of the FAIR (Findable, Accessible, Interoperable, and Reusable) guiding principles. Organisations like national (digital) archives, computing or professional (domain-specific) data centers as well as libraries might offer specific archiving services and act as partner organisations of scientific collections facilities. The guideline consists of key messages that have been defined. They address the collection community, especially the staff and leadership of taxonomic facilities. Aspects of several groups of stakeholders are discussed as well as cost models. The guideline does not recommend specific solutions for archiving software and workflows. Supplementary information is delivered via a wiki-based platform for the COST MOBILISE Archiving Working Group WG4.

Download Full-text

Documenting Natural History Collections in GBIF

Biodiversity Information Science and Standards ◽

10.3897/biss.3.37216 ◽

2019 ◽

Vol 3 ◽

Author(s):

Tim Robertson ◽

Marcos Gonzalez ◽

Morten Høfft ◽

Marie Grosjean

Keyword(s):

Natural History ◽

Data Publishing ◽

Digital Object Identifier ◽

Digital Object ◽

Global Biodiversity Information Facility ◽

Global Science ◽

Natural History Collections ◽

The People ◽

Citation Practice ◽

Scientific Collections

The Global Biodiversity Information Facility (GBIF) was established by governments in 2001, largely through the initiative and leadership of the natural history collections community, following the 1999 recommendation by a working group under the Megascience Forum (predecessor of the Global Science Forum) of the Organization for Economic Cooperation and Development (OECD). Over 20 years, GBIF has helped develop standards and convened a global community of data-publishing institutions, aggregating over one billion species occurrence records freely and openly available for use in research and policy making. Of these more than 150 million records originate from specimens preserved by the collections community. The recent adoption of the Global Registry of Scientific Collections by GBIF (https://www.gbif.org/news/5kyAslpqTVxYqZTwYn1cub) is the first step by GBIF to better enable a picture of the natural history collections of the world along with the associated science that they have and continue to enable. Recognising that other collection metadata initiatives exists, GBIF aims to discuss with the community and progress topics such as: Synchronising with existing metadata catalogues to ensure accurate, up-to-date information is available without unnecessary burden for authors Defining, testing and formalizing the Collection Descriptions standard (https://github.com/tdwg/cd) Providing clear guidelines of citation practice for collections, potentially building on the success of the Digital Object Identifier (DOI) approach used for datasets mediated through GBIF.org. Tracking citations of use through both data downloads and through references in literature, such as materials examined in a taxonomic publication Improving the linkages and discoverability of specimen records derived from the same collecting event but preserved in multiple institutions Improving the linkages between the people involved in collecting, preserving, and identifying specimen records through the use of Open Researcher and Contributor IDs (ORCID) Lowering the technical threshold to deploy tools such as “data dashboards” and specimen search/download on collection related websites Synchronising with existing metadata catalogues to ensure accurate, up-to-date information is available without unnecessary burden for authors Defining, testing and formalizing the Collection Descriptions standard (https://github.com/tdwg/cd) Providing clear guidelines of citation practice for collections, potentially building on the success of the Digital Object Identifier (DOI) approach used for datasets mediated through GBIF.org. Tracking citations of use through both data downloads and through references in literature, such as materials examined in a taxonomic publication Improving the linkages and discoverability of specimen records derived from the same collecting event but preserved in multiple institutions Improving the linkages between the people involved in collecting, preserving, and identifying specimen records through the use of Open Researcher and Contributor IDs (ORCID) Lowering the technical threshold to deploy tools such as “data dashboards” and specimen search/download on collection related websites The progress made to date will be summarised and a roadmap for the future will be introduced.

Download Full-text

Electronic Health Records Aggregators (EHRagg )

Methods of Information in Medicine ◽

10.1055/s-0040-1714395 ◽

2020 ◽

Vol 59 (02/03) ◽

pp. 096-103

Author(s):

Belén Prados-Suárez ◽

Carlos Molina Fernández ◽

Carmen Peña Yañez

Keyword(s):

Heterogeneous Data ◽

Health Data ◽

Data Sources ◽

Data Systems ◽

Related Data ◽

Heterogeneous Data Sources ◽

Health Related ◽

Long Time ◽

Unified View ◽

Fair Principles

Abstract Background Integration of health data systems is an open problem. Most of the active initiatives are based on the use of standards. However, achieving a widely and generalized compliment of such standards still seems a costly task that will take a long time to be completed. Even more, most of the standards are proposed for a specific use, without integrating other needs. Objectives We propose an alternative to get a unified view of health-related data, valid for several uses, that unites heterogeneous data sources. Methods Our proposal integrates developments made so far to automatically learn how to extract and convert data from different health-related systems. It enables the creation of a single multipurpose point of access. Results We present the EhRagg notion and its related concepts. EHRagg is defined as a middleware that, following the FAIR principles, integrates health data sources offering a unified view over them.

Download Full-text

Continuous hydrogenolysis of acetal-stabilized lignin in flow

Green Chemistry ◽

10.1039/d0gc02928a ◽

2021 ◽

Author(s):

Wu Lan ◽

Yuan Peng Du ◽

Songlan Sun ◽

Jean Behaghel de Bueren ◽

Florent Héroguel ◽

...

Keyword(s):

Steady State ◽

Challenges And Opportunities

We performed a steady state high-yielding depolymerization of soluble acetal-stabilized lignin in flow, which offered a window into challenges and opportunities that will be faced when continuously processing this feedstock.

Download Full-text