Implementing and Managing a Data Curation Workflow in the Cloud

Fernando Rios; Chun Ly

doi:10.7191/jeslib.2021.1205

Implementing and Managing a Data Curation Workflow in the Cloud

Journal of eScience Librarianship ◽

10.7191/jeslib.2021.1205 ◽

2021 ◽

Vol 10 (3) ◽

Author(s):

Fernando Rios ◽

Chun Ly

Keyword(s):

Best Practices ◽

Academic Library ◽

Application Programming Interface ◽

Data Curation ◽

Data Repository ◽

Data Repositories ◽

Institutional Data ◽

High Resource ◽

Service Oriented ◽

The Cost

Objective: To increase data quality and ensure compliance with appropriate policies, many institutional data repositories curate data that is deposited into their systems. Here, we present our experience as an academic library implementing and managing a semi-automated, cloud-based data curation workflow for a recently launched institutional data repository. Based on our experiences we then present management observations intended for data repository managers and technical staff looking to move some or all of their curation services to the cloud. Methods: We implemented tooling for our curation workflow in a service-oriented manner, making significant use of our data repository platform’s application programming interface (API). With an eye towards sustainability, a guiding development philosophy has been to automate processes following industry best practices while avoiding solutions with high resource needs (e.g., maintenance), and minimizing the risk of becoming locked-in to specific tooling. Results: The initial barrier for implementing a data curation workflow in the cloud was high in comparison to on-premises curation, mainly due to the need to develop in-house cloud expertise. However, compared to the cost for on-premises servers and storage, infrastructure costs have been substantially lower. Furthermore, in our particular case, once the foundation had been established, a cloud approach resulted in increased agility allowing us to quickly automate our workflow as needed. Conclusions: Workflow automation has put us on a path toward scaling the service and a cloud based-approach has helped with reduced initial costs. However, because cloud-based workflows and automation come with a maintenance overhead, it is important to build tooling that follows software development best practices and can be decoupled from curation workflows to avoid lock-in.

Download Full-text

Repository Approaches to Improving the Quality of Shared Data and Code

Data ◽

10.3390/data6020015 ◽

2021 ◽

Vol 6 (2) ◽

pp. 15

Author(s):

Ana Trisovic ◽

Katherine Mika ◽

Ceilyn Boyd ◽

Sebastian Feger ◽

Mercè Crosas

Keyword(s):

Secondary Data ◽

Scientific Work ◽

Data Curation ◽

Data Repository ◽

Data Repositories ◽

Research Dissemination ◽

Design Elements ◽

Shared Data ◽

The Past

Sharing data and code for reuse has become increasingly important in scientific work over the past decade. However, in practice, shared data and code may be unusable, or published results obtained from them may be irreproducible. Data repository features and services contribute significantly to the quality, longevity, and reusability of datasets. This paper presents a combination of original and secondary data analysis studies focusing on computational reproducibility, data curation, and gamified design elements that can be employed to indicate and improve the quality of shared data and code. The findings of these studies are sorted into three approaches that can be valuable to data repositories, archives, and other research dissemination platforms.

Download Full-text

Designing the Cyberinfrastructure for Spatial Data Curation, Visualization, and Sharing

IASSIST Quarterly ◽

10.29173/iq11 ◽

2017 ◽

Vol 41 (1-4) ◽

pp. 15 ◽

Cited By ~ 1

Author(s):

Yue Li ◽

Nicole Kong ◽

Stanislav Pejša

Keyword(s):

Spatial Data ◽

Spatial Information ◽

Data Curation ◽

Data Repository ◽

Health Humanities ◽

Web Mapping ◽

Web Map ◽

Institutional Data ◽

The Impact ◽

The Web

Widely used across disciplines such as natural resources, social sciences, public health, humanities, and economics, spatial data is an important component in many studies and has promoted interdisciplinary research development. Though an institutional data repository provides a great solution for data curation, preservation, and sharing, it usually lacks the spatial visualization capability, which limits the use of spatial data to professionals. To increase the impact of research-generated spatial data and truly turn them into digital maps for a broader user base, we have designed and developed the workflow and cyberinfrastructure to extend the current capability of our institutional data repository by visualizing the spatial data on the web. In this project, we added a GIS server to the original institutional data repository cyberinfrastructure, which enables web map services. Then, through a web mapping API, we visualized the spatial data as an interactive web map and embedded in the data repository web page. From the user’s perspective, researchers can still identify, cite and reuse the dataset by downloading the data and metadata and the DOI offered by the data repository. General information users can also browse the web maps to find location-based information. In addition, these data was ingested into the spatial data portal to increase the discoverability for spatial information users. Initial usage statistics suggest that this cyberinfrastructure has greatly improved the spatial data usage and extended the institutional data repository to facilitate spatial data sharing.

Download Full-text

Why is getting credit for your data so hard?

ITM Web of Conferences ◽

10.1051/itmconf/20203301003 ◽

2020 ◽

Vol 33 ◽

pp. 01003

Author(s):

Wouter Haak ◽

Alberto Zigoni ◽

Helen Kardinaal-de Mooij ◽

Elena Zudilova-Seinstra

Keyword(s):

Data Sharing ◽

Research Data ◽

Data Repository ◽

Institutional Research ◽

Public Research ◽

Data Repositories ◽

Institutional Repositories ◽

Institutional Data ◽

Research Organizations ◽

Existing Data

Institutions, funding bodies, and national research organizations are pushing for more data sharing and FAIR data. Institutions typically implement data policies, frequently supported by an institutional data repository. Funders typically mandate data sharing. So where does this leave the researcher? How can researchers benefit from doing the additional work to share their data? In order to make sure that researchers and institutions get credit for sharing their data, the data needs to be tracked and attributed first. In this paper we investigated where the research data ended up for 11 research institutions, and how this data is currently tracked and attributed. Furthermore, we also analysed the gap between the research data that is currently in institutional repositories, and where their researchers truly share their data. We found that 10 out of 11 institutions have most of their public research data hosted outside of their own institution. Combined, they have 12% of their institutional research data published in the institutional data repositories. According to our data, the typical institution had 5% of their research data (median) published in the institutional repository, but there were 4 universities for which it was 10% or higher. By combining existing data-to-article graphs with existing article-to- researcher and article-to-institution graphs it becomes possible to increase tracking of public research data and therefore the visibility of researchers sharing their data typically by 17x. The tracking algorithm that was used to perform analysis and report on potential improvements has subsequently been implemented as a standard method in the Mendeley Data Monitor product. The improvement is most likely an under-estimate because, while the recall for datasets in institutional repositories is 100%, that is not the case for datasets published outside the institutions, so there are even more datasets still to be discovered.

Download Full-text

Creating Guidance for Canadian Dataverse Curators: Portage Network’s Dataverse Curation Guide

Journal of eScience Librarianship ◽

10.7191/jeslib.2021.1201 ◽

2021 ◽

Vol 10 (3) ◽

Author(s):

Alexandra Cooper ◽

Michael Steeleworthy ◽

Ève Paquette-Bigras ◽

Erin Clary ◽

Erin MacPherson ◽

...

Keyword(s):

Best Practices ◽

International Institutions ◽

Research Data ◽

Data Curation ◽

Data Repository ◽

National Network ◽

Research Data Management ◽

National Organizations ◽

French And English ◽

Support Research

Purpose: This paper introduces the Portage Network’s Dataverse Curation Guide and the new bilingual curation framework developed to support it. Brief Description: Canadian academic institutions and national organizations have been building infrastructure, staffing, and programming to support research data management. Amidst this work, a notable gap emerged between requirements for data curation in general repositories like Dataverse and the requisite workflows and guidance materials needed by curators to meet them. In response, Portage, a national network of data experts, organized a working group to develop a Dataverse curation guide built upon the Data Curation Network’s CURATED workflow. To create a bilingual resource, the original CURATE(D) acronym was modified to CURATION—which has the same meaning in both French and English—and steps were augmented with Dataverse-specific guidance and mapped to three conceptualized levels of curation to assist curators in prioritizing curation actions. Methods: An environmental scan of relevant deposit and curation guidance materials from Canadian and international institutions identified the need for a comprehensive Dataverse Curation Guide, as most existing resources were either depositor-focused or contained only partial workflows. The resulting Guide synthesized these guidance materials into the CURATION steps and mapped actions to various theoretical levels of data repository services and levels of curation. Resources: The following documents are supplemental to the Dataverse Curation Guide: the Portage Dataverse North Metadata Best Practices Guide, the Scholars Portal Dataverse Guide, and the Data Curation Network CURATED Workflow and Data Curation Primers.

Download Full-text

Social science data repositories in data deluge

The Electronic Library ◽

10.1108/el-11-2016-0243 ◽

2017 ◽

Vol 35 (4) ◽

pp. 626-649 ◽

Cited By ~ 3

Author(s):

Wei Jeng ◽

Daqing He ◽

Yu Chi

Keyword(s):

Social Science ◽

Data Sharing ◽

Qualitative Data ◽

Data Curation ◽

Data Repository ◽

Science Data ◽

Data Repositories ◽

Content Type ◽

Social Science Data ◽

Current Practices

Purpose Owing to the recent surge of interest in the age of the data deluge, the importance of researching data infrastructures is increasing. The open archival information system (OAIS) model has been widely adopted as a framework for creating and maintaining digital repositories. Considering that OAIS is a reference model that requires customization for actual practice, this paper aims to examine how the current practices in a data repository map to the OAIS environment and functional components. Design/methodology/approach The authors conducted two focus-group sessions and one individual interview with eight employees at the world’s largest social science data repository, the Interuniversity Consortium for Political and Social Research (ICPSR). By examining their current actions (activities regarding their work responsibilities) and IT practices, they studied the barriers and challenges of archiving and curating qualitative data at ICPSR. Findings The authors observed that the OAIS model is robust and reliable in actual service processes for data curation and data archives. In addition, a data repository’s workflow resembles digital archives or even digital libraries. On the other hand, they find that the cost of preventing disclosure risk and a lack of agreement on the standards of text data files are the most apparent obstacles for data curation professionals to handle qualitative data; the maturation of data metrics seems to be a promising solution to several challenges in social science data sharing. Originality/value The authors evaluated the gap between a research data repository’s current practices and the adoption of the OAIS model. They also identified answers to questions such as how current technological infrastructure in a leading data repository such as ICPSR supports their daily operations, what the ideal technologies in those data repositories would be and the associated challenges that accompany these ideal technologies. Most importantly, they helped to prioritize challenges and barriers from the data curator’s perspective and to contribute implications of data sharing and reuse in social sciences.

Download Full-text

Data Stewardship: Environmental Data Curation and a Web-of-Repositories

International Journal of Digital Curation ◽

10.2218/ijdc.v4i2.90 ◽

2009 ◽

Vol 4 (2) ◽

pp. 12-27 ◽

Cited By ~ 30

Author(s):

Karen S. Baker ◽

Lynn Yarmey

Keyword(s):

Scientific Communication ◽

Research Data ◽

Environmental Data ◽

Data Curation ◽

Data Repository ◽

Data Repositories ◽

Reference Collection ◽

Inclusive Approach ◽

Data Stewardship ◽

Organizational Element

Scientific researchers today frequently package measurements and associated metadata as digital datasets in anticipation of storage in data repositories. Through the lens of environmental data stewardship, we consider the data repository as an organizational element central to data curation. One aspect of non-commercial repositories, their distance-from-origin of the data, is explored in terms of near and remote categories. Three idealized repository types are distinguished – local, center, and archive - paralleling research, resource, and reference collection categories respectively. Repository type characteristics such as scope, structure, and goals are discussed. Repository similarities in terms of roles, activities and responsibilities are also examined. Data stewardship is related to care of research data and responsible scientific communication supported by an infrastructure that coordinates curation activities; data curation is defined as a set of repeated and repeatable activities focusing on tending data and creating data products within a particular arena. The concept of “sphere-of-context” is introduced as an aid to distinguishing repository types. Conceptualizing a “web-of-repositories” accommodates a variety of repository types and represents an ecologically inclusive approach to data curation.

Download Full-text

How Are We Doing? Data Access and Replication in Political Science

Political Science and Politics ◽

10.1017/s1049096516000184 ◽

2016 ◽

Vol 49 (02) ◽

pp. 268-272 ◽

Cited By ~ 4

Author(s):

Ellen M. Key

Keyword(s):

Political Science ◽

Data Access ◽

Data Availability ◽

Data Repositories ◽

Institutional Data ◽

Technological Advances ◽

Research Transparency ◽

The Cost ◽

The Impact ◽

Mandatory Provision

ABSTRACTData access and research transparency (DA-RT) is a growing concern for the discipline. Technological advances have greatly reduced the cost of sharing data, enabling full replication archives consisting of data and code to be shared on individual websites, as well as journal archives and institutional data repositories. But how do we ensure that scholars take advantage of these resources to share their replication archives? Moreover, are the costs of research transparency borne by individuals or by journals? This article assesses the impact of journal replication policies on data availability and finds that articles published in journals with mandatory provision policies are 24 times more likely to have replication materials available than articles those with no requirements.

Download Full-text

Best Practices: The Value and Dilemma of Domain Repositories

10.5194/egusphere-egu2020-22533 ◽

2020 ◽

Author(s):

Kerstin Lehnert ◽

Lucia Profeta ◽

Annika Johansson ◽

Lulin Song

Keyword(s):

Best Practices ◽

Scientific Research ◽

Data Access ◽

Data Curation ◽

Geochemical Data ◽

Data Types ◽

Scientific Enterprise ◽

Data Repositories ◽

Specific Data ◽

Domain Specific

Modern scientific research requires open and efficient access to well-documented data to ensure transparency and reproducibility, and to build on existing resources to solve scientific questions of the future. Open access to the results of scientific research - publications, data, samples, code - is now broadly advocated and implemented in policies of funding agencies and publishers because it helps build trust in science, galvanizes the scientific enterprise, and accelerates the pace of discovery and creation of new knowledge. Domain specific data facilities offer specialized services for data curation that are tailored to the needs of scientists in a given domain, ensuring rich, relevant, and consistent metadata for meaningful discovery and reuse of data, as well as data formats and encodings that facilitate data access, data integration, and data analysis for disciplinary and interdisciplinary applications. Domain specific data facilities are uniquely poised to implement best practices that ensure not only the Findability and Accessibility of data under their stewardship, but also their Interoperability and Reusability, which requires detailed data type specific documentation of methods, including data acquisition and processing steps, uncertainties, and other data quality measures.&#160;The dilemma for domain repositories is that the rigorous implementation of such Best Practices requires substantial effort and expertise, which becomes a challenge when usage of the repository outgrows its resources. Rigorous implementation of Best Practices can also cause frustration of users, who are asked to revise and improve their data submissions, and may make them deposit their data in other, often general repositories that do not perform such rigorous review and therefore minimize the burden of data deposition.&#160;We will report on recent experiences of EarthChem, a domain specific data facility for the geochemical and petrological science community. EarthChem is recommended by publishers as a trusted repository for the preservation and open sharing of geochemical data. With the implementation of the FAIR Data principles at multiple journals that publish geochemical and petrological research over the past year, the number, volume, and diversity of data submitted to the EarthChem Library has grown dramatically and is challenging existing procedures and resources that do not scale to the new level of usage. Curators are challenged to meet expectations of users for immediate data publication and DOI assignment, and to process submissions that include new data types, are poorly documented, or contain code, images, and other digital content that is outside the scope of the repository. We will discuss possible solutions ranging from tiered data curation support, collaboration with other data repositories, and engagement with publishers and editors to enhance guidance and education of authors.&#160;&#160;

Download Full-text

Citation and credit: The role of researchers, journals, and repositories to ensure data, software and samples are linked to publications with proper attribution.

10.5194/egusphere-egu21-3648 ◽

2021 ◽

Author(s):

Shelley Stall ◽

Helen Glaves ◽

Brooks Hanson ◽

Kerstin Lehnert ◽

Erin Robinson ◽

...

Keyword(s):

Best Practices ◽

Data Repository ◽

Software Management ◽

Significant Progress ◽

Policy And Practice ◽

Data Repositories ◽

The Earth ◽

The Status ◽

Space Sciences

The Earth, space, and environmental sciences have made significant progress in awareness and implementation of policy and practice around the sharing of data, software, and samples.&#160; In specific, the Coalition for Publishing Data in the Earth and Space Sciences (https://copdess.org/) brings together data repositories and journals to discuss and address common challenges in support of more transparent and discoverable research and the supporting data.&#160; Since the inception of COPDESS in 2014 and the completion of the Enabling FAIR Data Project in 2019, work has continued on the improvement of availability statements for data and software as well as corresponding citations.&#160;&#160;As the broad research community continues to make progress around data and software management and sharing, COPDESS is focused on several key efforts. These include 1) supporting authors in identifying the most appropriate data repository for preservation, 2) validating that all manuscripts have data and software availability statements, 3) ensuring data and software citations are properly included and linked to the publication to support credit, 4) encouraging adoption of best practices.&#160;We will review the status of these current efforts around data and software sharing, the important role that repositories and researchers have to ensure that automated credit and attribution elements are in place, and the recent publications on software citation guidance from the FORCE11 Software Implementation Working Group.

Download Full-text

Research data stewardship at the University of Hong Kong

Library Management ◽

10.1108/lm-09-2021-0079 ◽

2021 ◽

Vol ahead-of-print (ahead-of-print) ◽

Author(s):

SiZhe Xiao ◽

Tsz Yan Ng ◽

Tao T. Yang

Keyword(s):

Hong Kong ◽

Academic Library ◽

Research Data ◽

Data Curation ◽

Data Repository ◽

Content Type ◽

Data Infrastructure ◽

Academic Communities ◽

Data Stewardship ◽

The University

PurposeThe purpose of this paper is to look at the journey and experience of the University of Hong Kong (HKU) Research Data Management (RDM) practice to respond to the needs of researchers in an academic library.Design/methodology/approachThe research data services (RDS) practice is based on the FAIR data principle. And the authors designed the RDM Stewardship framework to implement the RDS step by step.FindingsThe HKU Libraries developed and implemented a set of RDS under a research data stewardship framework in response to the recent evolving research needs for RDM amongst the academic communities. The services cover policy and procedure settings for research data planning, research data infrastructure establishment, data curation services and provision of online resources and instructional guidelines.Originality/value This study provides an example of an approach to respond to the needs of the academic libraries about how to start the RDS including the data policy, data repository, data librarianship and data curation.

Download Full-text