Repository Approaches to Improving the Quality of Shared Data and Code

Ana Trisovic; Katherine Mika; Ceilyn Boyd; Sebastian Feger; Mercè Crosas

doi:10.3390/data6020015

Repository Approaches to Improving the Quality of Shared Data and Code

Data ◽

10.3390/data6020015 ◽

2021 ◽

Vol 6 (2) ◽

pp. 15

Author(s):

Ana Trisovic ◽

Katherine Mika ◽

Ceilyn Boyd ◽

Sebastian Feger ◽

Mercè Crosas

Keyword(s):

Secondary Data ◽

Scientific Work ◽

Data Curation ◽

Data Repository ◽

Data Repositories ◽

Research Dissemination ◽

Design Elements ◽

Shared Data ◽

The Past

Sharing data and code for reuse has become increasingly important in scientific work over the past decade. However, in practice, shared data and code may be unusable, or published results obtained from them may be irreproducible. Data repository features and services contribute significantly to the quality, longevity, and reusability of datasets. This paper presents a combination of original and secondary data analysis studies focusing on computational reproducibility, data curation, and gamified design elements that can be employed to indicate and improve the quality of shared data and code. The findings of these studies are sorted into three approaches that can be valuable to data repositories, archives, and other research dissemination platforms.

Download Full-text

Shared Data Science Infrastructure for Genomics Data

10.1101/307777 ◽

2018 ◽

Author(s):

Hamid Bagher ◽

Usha Muppiral ◽

Andrew J Severin ◽

Hridesh Rajan

Keyword(s):

Data Science ◽

Gene Annotation ◽

Large Data ◽

Biological Data ◽

Genomic Research ◽

Data Repository ◽

Small Data ◽

Data Repositories ◽

Shared Data ◽

Genome Assemblies

AbstractBackgroundCreating a computational infrastructure to analyze the wealth of information contained in data repositories that scales well is difficult due to significant barriers in organizing, extracting and analyzing relevant data. Shared Data Science Infrastructures like Boa can be used to more efficiently process and parse data contained in large data repositories. The main features of Boa are inspired from existing languages for data intensive computing and can easily integrate data from biological data repositories.ResultsHere, we present an implementation of Boa for Genomic research (BoaG) on a relatively small data repository: RefSeq’s 97,716 annotation (GFF) and assembly (FASTA) files and metadata. We used BoaG to query the entire RefSeq dataset and gain insight into the RefSeq genome assemblies and gene model annotations and show that assembly quality using the same assembler varies depending on species.ConclusionsIn order to keep pace with our ability to produce biological data, innovative methods are required. The Shared Data Science Infrastructure, BoaG, can provide greater access to researchers to efficiently explore data in ways previously not possible for anyone but the most well funded research groups. We demonstrate the efficiency of BoaG to explore the RefSeq database of genome assemblies and annotations to identify interesting features of gene annotation as a proof of concept for much larger datasets.

Download Full-text

DAMPAK KEBIJAKAN IMPOR DAN KELEMBAGAAN TERHADAP KINERJA INDUSTRI GARAM NASIONAL

Jurnal Kebijakan Sosial Ekonomi Kelautan dan Perikanan ◽

10.15578/jksekp.v3i1.7 ◽

2015 ◽

Vol 3 (1) ◽

pp. 1

Author(s):

Tikkyrino Kurniawan ◽

Ahmad Azizi

Keyword(s):

Descriptive Analysis ◽

Secondary Data ◽

Public Institutions ◽

Data Repository ◽

Salt Production ◽

Valid Data ◽

Salt Industry ◽

The Impact ◽

Official Data

Perbedaan data produksi dan rendahnya kualitas pada industri pegaraman menyebabkan tingginya jumlah impor sehingga merugikan pegaraman rakyat. Padahal penghasilan dari sektor garam digunakan untuk hidup selama satu tahun. Tujuan penelitian ini adalah menganalisis kinerja impor garam dengan kelembagaan industri garam yang kaitanya dengan stabilitas harga dan kesejahteraan petani garam. Penelitian ini dilakukan pada bulan April sampai dengan Mei tahun 2012 dengan menggunakan gabungan data primer dan sekunder. Analisis data menggunakan analisis deskriptif. Hasil studi menyebutkan bahwa adanya perbedaan perhitungan data produksi garam antara KKP dan data nasional berdampak terhadap ketepatan impor garam. Industrialisasi produksi dan impor berdampak terhadap inisiatif berproduksi dan kesejahteraan petani garam. Kondisi ini diperparah oleh belum kondusifnya implementasi kelembagaan bagi hasil dengan kondisi industri garam rakyat. Perlu perbaikan perhitungan data impor garam, baik dengan kerjasama antar instansi maupun ada lembaga tersendiri sebagai pengumpul sehingga data produksi garam bisa lebih valid. Title: The Impact of Import Policy and Institutions to the National Salt Industry PerformanceThe existence of different official data repository salt production and low quality of salt in the leads to a huge amount of impor. This, in turn, will worse off public salt production. Mean while, for salt farmers, income generated from producing public salt production were used for their household life for the entire year fiscal. The aim of this research is to analyse import performances with salt industry’s institution related to price stability and salt farmers’ welfare. This research was conducted duracy April to May 2012 and used both primary and secondary data. Descriptive analysis was used in this study. Result of the study indicatea gap occurs in calculation between Ministry for Marine and Fisheries (MMAF) and National salt production affected to salt import accuracy. Production and import’s industrialization affected the production initiatives and salt farmers’ welfare. It is worsening by implementation for sharing of production still unstable with public-salt productions’ condition. The data is still need to recalculation with collaboration of several public institutions or new institution as a collector to achieve more valid data of salt production the production data could be more valid.

Download Full-text

Implementing and Managing a Data Curation Workflow in the Cloud

Journal of eScience Librarianship ◽

10.7191/jeslib.2021.1205 ◽

2021 ◽

Vol 10 (3) ◽

Author(s):

Fernando Rios ◽

Chun Ly

Keyword(s):

Best Practices ◽

Academic Library ◽

Application Programming Interface ◽

Data Curation ◽

Data Repository ◽

Data Repositories ◽

Institutional Data ◽

High Resource ◽

Service Oriented ◽

The Cost

Objective: To increase data quality and ensure compliance with appropriate policies, many institutional data repositories curate data that is deposited into their systems. Here, we present our experience as an academic library implementing and managing a semi-automated, cloud-based data curation workflow for a recently launched institutional data repository. Based on our experiences we then present management observations intended for data repository managers and technical staff looking to move some or all of their curation services to the cloud. Methods: We implemented tooling for our curation workflow in a service-oriented manner, making significant use of our data repository platform’s application programming interface (API). With an eye towards sustainability, a guiding development philosophy has been to automate processes following industry best practices while avoiding solutions with high resource needs (e.g., maintenance), and minimizing the risk of becoming locked-in to specific tooling. Results: The initial barrier for implementing a data curation workflow in the cloud was high in comparison to on-premises curation, mainly due to the need to develop in-house cloud expertise. However, compared to the cost for on-premises servers and storage, infrastructure costs have been substantially lower. Furthermore, in our particular case, once the foundation had been established, a cloud approach resulted in increased agility allowing us to quickly automate our workflow as needed. Conclusions: Workflow automation has put us on a path toward scaling the service and a cloud based-approach has helped with reduced initial costs. However, because cloud-based workflows and automation come with a maintenance overhead, it is important to build tooling that follows software development best practices and can be decoupled from curation workflows to avoid lock-in.

Download Full-text

Physical Environmental Assessment of ‘Colorful Village’ on the Banks of the Code River, Yogyakarta

KnE Social Sciences ◽

10.18502/kss.v3i21.5005 ◽

2019 ◽

Author(s):

Retta Ida Lumongga

Keyword(s):

Physical Environment ◽

Research Method ◽

Secondary Data ◽

The Past ◽

Weight Analysis ◽

New Policies ◽

The Village ◽

Level Analysis ◽

Collection Methods

In the past, the village on the bank of the Code River looked shabby but was renovated into a colorful village. The problem is whether colorful villages have become ideal settlements or remain slums. The novelty of this study is to find out the level of slums in the colorful village associated with the new policies of improving the quality of slums. The scope of the material is on the assessment of the physical environment in a colorful village. The study aimed to assess the physical environment of the village and to find the cause of slums. The research method for assessment is slum level analysis using weight analysis tools. Data collection methods used are observation, interviews and secondary data processing results. The scope of the area is the Sayidan village area which is on the bank of the Code River, Yogyakarta. As a result, even though the colorful village looks aesthetically beautiful, their physical environment is actually still included in high slums. In conclusion, the level of high slums is due to, the condition of buildings and the condition of infrastructure, that is still not fully following the applicable requirements.

Download Full-text

Social science data repositories in data deluge

The Electronic Library ◽

10.1108/el-11-2016-0243 ◽

2017 ◽

Vol 35 (4) ◽

pp. 626-649 ◽

Cited By ~ 3

Author(s):

Wei Jeng ◽

Daqing He ◽

Yu Chi

Keyword(s):

Social Science ◽

Data Sharing ◽

Qualitative Data ◽

Data Curation ◽

Data Repository ◽

Science Data ◽

Data Repositories ◽

Content Type ◽

Social Science Data ◽

Current Practices

Purpose Owing to the recent surge of interest in the age of the data deluge, the importance of researching data infrastructures is increasing. The open archival information system (OAIS) model has been widely adopted as a framework for creating and maintaining digital repositories. Considering that OAIS is a reference model that requires customization for actual practice, this paper aims to examine how the current practices in a data repository map to the OAIS environment and functional components. Design/methodology/approach The authors conducted two focus-group sessions and one individual interview with eight employees at the world’s largest social science data repository, the Interuniversity Consortium for Political and Social Research (ICPSR). By examining their current actions (activities regarding their work responsibilities) and IT practices, they studied the barriers and challenges of archiving and curating qualitative data at ICPSR. Findings The authors observed that the OAIS model is robust and reliable in actual service processes for data curation and data archives. In addition, a data repository’s workflow resembles digital archives or even digital libraries. On the other hand, they find that the cost of preventing disclosure risk and a lack of agreement on the standards of text data files are the most apparent obstacles for data curation professionals to handle qualitative data; the maturation of data metrics seems to be a promising solution to several challenges in social science data sharing. Originality/value The authors evaluated the gap between a research data repository’s current practices and the adoption of the OAIS model. They also identified answers to questions such as how current technological infrastructure in a leading data repository such as ICPSR supports their daily operations, what the ideal technologies in those data repositories would be and the associated challenges that accompany these ideal technologies. Most importantly, they helped to prioritize challenges and barriers from the data curator’s perspective and to contribute implications of data sharing and reuse in social sciences.

Download Full-text

Data Stewardship: Environmental Data Curation and a Web-of-Repositories

International Journal of Digital Curation ◽

10.2218/ijdc.v4i2.90 ◽

2009 ◽

Vol 4 (2) ◽

pp. 12-27 ◽

Cited By ~ 30

Author(s):

Karen S. Baker ◽

Lynn Yarmey

Keyword(s):

Scientific Communication ◽

Research Data ◽

Environmental Data ◽

Data Curation ◽

Data Repository ◽

Data Repositories ◽

Reference Collection ◽

Inclusive Approach ◽

Data Stewardship ◽

Organizational Element

Scientific researchers today frequently package measurements and associated metadata as digital datasets in anticipation of storage in data repositories. Through the lens of environmental data stewardship, we consider the data repository as an organizational element central to data curation. One aspect of non-commercial repositories, their distance-from-origin of the data, is explored in terms of near and remote categories. Three idealized repository types are distinguished – local, center, and archive - paralleling research, resource, and reference collection categories respectively. Repository type characteristics such as scope, structure, and goals are discussed. Repository similarities in terms of roles, activities and responsibilities are also examined. Data stewardship is related to care of research data and responsible scientific communication supported by an infrastructure that coordinates curation activities; data curation is defined as a set of repeated and repeatable activities focusing on tending data and creating data products within a particular arena. The concept of “sphere-of-context” is introduced as an aid to distinguishing repository types. Conceptualizing a “web-of-repositories” accommodates a variety of repository types and represents an ecologically inclusive approach to data curation.

Download Full-text

Are data repositories fettered? A survey of current practices, challenges and future technologies

Online Information Review ◽

10.1108/oir-04-2021-0204 ◽

2021 ◽

Vol ahead-of-print (ahead-of-print) ◽

Author(s):

Nushrat Khan ◽

Mike Thelwall ◽

Kayvan Kousha

Keyword(s):

Peer Review ◽

Online Survey ◽

Secondary Data ◽

Perceived Usefulness ◽

Data Reuse ◽

Data Repository ◽

Data Repositories ◽

Content Type ◽

Quality Checks ◽

Current Practices

PurposeThe purpose of this study is to explore current practices, challenges and technological needs of different data repositories.Design/methodology/approachAn online survey was designed for data repository managers, and contact information from the re3data, a data repository registry, was collected to disseminate the survey.FindingsIn total, 189 responses were received, including 47% discipline specific and 34% institutional data repositories. A total of 71% of the repositories reporting their software used bespoke technical frameworks, with DSpace, EPrint and Dataverse being commonly used by institutional repositories. Of repository managers, 32% reported tracking secondary data reuse while 50% would like to. Among data reuse metrics, citation counts were considered extremely important by the majority, followed by links to the data from other websites and download counts. Despite their perceived usefulness, repository managers struggle to track dataset citations. Most repository managers support dataset and metadata quality checks via librarians, subject specialists or information professionals. A lack of engagement from users and a lack of human resources are the top two challenges, and outreach is the most common motivator mentioned by repositories across all groups. Ensuring findable, accessible, interoperable and reusable (FAIR) data (49%), providing user support for research (36%) and developing best practices (29%) are the top three priorities for repository managers. The main recommendations for future repository systems are as follows: integration and interoperability between data and systems (30%), better research data management (RDM) tools (19%), tools that allow computation without downloading datasets (16%) and automated systems (16%).Originality/valueThis study identifies the current challenges and needs for improving data repository functionalities and user experiences.Peer reviewThe peer review history for this article is available at: https://publons.com/publon/10.1108/OIR-04-2021-0204

Download Full-text

PhenomeCentral: 7 years of rare disease matchmaking

10.22541/au.163389956.69631312/v1 ◽

2021 ◽

Author(s):

Matthew Osmond ◽

Taila Hartley ◽

Brittney Johnstone ◽

Sasha Andijc ◽

Marta Girdea ◽

...

Keyword(s):

Genomic Sequence ◽

Sequence Data ◽

Clinical Information ◽

Data Repository ◽

Sequencing Data ◽

Phenotypic Data ◽

Data Repositories ◽

Multiple Gene ◽

The Past ◽

To Come

A major challenge in validating genetic causes for patients with rare diseases (RDs) is the difficulty in identifying other RD patients with overlapping phenotypes and variants in the same candidate gene. This process, known as matchmaking, requires robust data sharing solutions in order to be effective. In 2014 we launched PhenomeCentral, a RD data repository capable of collecting computer-readable genotypic and phenotypic data for the purposes of RD matchmaking. Over the past 7 years PhenomeCentral’s features have been expanded and its dataset has consistently grown. There are currently 1,615 users registered on PhenomeCentral, which have contributed over 12,000 patient cases. Most of these cases contain detailed phenotypic terms, with a significant portion also providing genomic sequence data or other forms of clinical information. Matchmaking within PhenomeCentral, and with connections to other data repositories in the Matchmaker Exchange, have collectively resulted in over 60,000 matches, which have facilitated multiple gene discoveries. The collection of deep phenotypic and genotypic data has also positioned PhenomeCentral well to support next generation of matchmaking initiatives that utilize genome sequencing data, ensuring that PhenomeCentral will remain a useful tool in solving undiagnosed RD cases in the years to come.

Download Full-text

Kemampuan Net Income, Other Comprehensive Income terhadap Return dengan Moderasi Kualitas Laba (Studi pada Perusahaan Bank Umum di Bursa Efek Indonesia Periode 2013-2016)

Jurnal Riset Akuntansi & Perpajakan (JRAP) ◽

10.35838/jrap.v6i01.397 ◽

2019 ◽

Vol 6 (01) ◽

Author(s):

Yohana Pala Juni Damanik

Keyword(s):

Sample Size ◽

Commercial Banks ◽

Stock Exchange ◽

Secondary Data ◽

Net Income ◽

The Past ◽

Comprehensive Income ◽

Other Comprehensive Income ◽

Future Income

ABSTRACT This research is to test and analyze the ability of net income and other comprehensive income to future profit with moderation of profit quality of commercial Banks listed on Indonesia stock exchange. This study uses secondary data. The research sample is a mommercial Bank listed on the Indonesia stock exchange in the period 2013 to 2016. The sample size is 22. The results show that net income past influential and significant to future income, other comprehensive income past influential and significant to future profit, net income year 2015 has no effect and not significant to return 2016, other comprehensive income 2015 influential and significant to return 2016, net income interaction of the past and quality of profit influential, significant and quasi-moderate to future profit, other comprehensive income interaction past and quality of earnings of the past has no significant and insignificant impact on future profit, net income 2015 interaction and 2015 profit quality effect, significant and quasi-moderate to return, other past comprehensive income interactions and past profit quality have a significant, significant and aerated effect on return. ABSTRAK Penelitian ini adalah menguji dan menganalisis kemampuan net income dan other comprehensive income terhadap return dengan moderasi kualitas laba bank umum yang terdaftar di Bursa Efek Indonesia. Penelitian ini menggunakan data sekunder. Sampel penelitian adalah bank umum yang terdaftar di Bursa Efek Indonesia pada periode 2013 sampai 2016. Jumlah sampel adalah 22. Hasil penelitian menunjukkan bahwa net income masa lalu berpengaruh dan signifikan terhadap return, other comprehensive income masa lalu berpengaruh dan signifikan terhadap return, net income tahun 2015 tidak berpengaruh dan tidak signifikan terhadap return 2016, other comprehensive income 2015 berpengaruh dan signifikan terhadap return 2016, interaksi net income masa lalu dan kualitas laba berpengaruh, signifikan dan bermoderasi semu terhadap return, interaksi other comprehensive income masa lalu dan kualitas laba masa lalu tidak berpengaruh dan tidak signifikan terhadap return, interaksi net income 2015 dan kualitas laba 2015 berpengaruh, signifikan dan bermoderasi semu terhadap return, interaksi other comprehensive income masa lalu dan kualitas laba masa lalu berpengaruh, signifikan dan bermoderasi semu terhadap return.

Download Full-text

Managing Research Data at the University of Porto

Innovations in XML Applications and Metadata Management ◽

10.4018/978-1-4666-2669-0.ch010 ◽

2013 ◽

pp. 174-197

Author(s):

João Rocha da Silva ◽

Cristina Ribeiro ◽

João Correia Lopes

Keyword(s):

Research Data ◽

Data Curation ◽

Data Repository ◽

Small Scale ◽

Domain Specific ◽

Xml Documents ◽

Research Institution ◽

The University ◽

Scale Data

This chapter consists of a solution for the management of research data at a higher education and research institution. The chapter is based on a small-scale data audit study, which included contacts with researchers and yielded some preliminary requirements and use cases. These requirements led to the design of a data curation workflow involving the researcher, the curator, and a data repository. The authors describe the features of the data repository prototype, which is an extension to the widely used DSpace repository platform and introduced a set of features mentioned by the majority of the interviewed researchers as relevant for a data repository. The data repository platform contributes to the curation workflow at the university, with XML technology at its core—data is stored using XML documents, which can be systematically processed and queried unlike its original-format counterpart. This system is capable of indexing, querying, and retrieving, in whole or in part, datasets represented in tabular form. There is also the possibility of using elements from domain-specific XML schemas for the cataloguing process, improving the interoperability and quality of the deposited data.

Download Full-text