Guidelines for a Standardized Filesystem Layout for Scientific Data

Florian Spreckelsen; Baltasar Rüchardt; Jan Lebert; Stefan Luther; Ulrich Parlitz; Alexander Schlemmer

doi:10.3390/data5020043

Guidelines for a Standardized Filesystem Layout for Scientific Data

Data ◽

10.3390/data5020043 ◽

2020 ◽

Vol 5 (2) ◽

pp. 43

Author(s):

Florian Spreckelsen ◽

Baltasar Rüchardt ◽

Jan Lebert ◽

Stefan Luther ◽

Ulrich Parlitz ◽

...

Keyword(s):

Graphical User Interface ◽

Software Tool ◽

Original Data ◽

Scientific Data ◽

Command Line ◽

Scientific Data Management ◽

File Structure ◽

Widespread Acceptance ◽

Broad Variety ◽

Existing Data

Storing scientific data on the filesystem in a meaningful and transparent way is no trivial task. In particular, when the data have to be accessed after their originator has left the lab, the importance of a standardized filesystem layout cannot be underestimated. It is desirable to have a structure that allows for the unique categorization of all kinds of data from experimental results to publications. They have to be accessible to a broad variety of workflows, e.g., via graphical user interface as well as via command line, in order to find widespread acceptance. Furthermore, the inclusion of already existing data has to be as simple as possible. We propose a three-level layout to organize and store scientific data that incorporates the full chain of scientific data management from data acquisition to analysis to publications. Metadata are saved in a standardized way and connect original data to analyses and publications as well as to their originators. A simple software tool to check a file structure for compliance with the proposed structure is presented.

Download Full-text

Guidelines for a Standardized File Structure for Scientific Data

10.20944/preprints202004.0035.v1 ◽

2020 ◽

Author(s):

Florian Spreckelsen ◽

Baltasar Rüchardt ◽

Jan Lebert ◽

Stefan Luther ◽

Ulrich Parlitz ◽

...

Keyword(s):

Level Structure ◽

Software Tool ◽

Original Data ◽

Scientific Data ◽

Command Line ◽

Scientific Data Management ◽

File Structure ◽

Widespread Acceptance ◽

Broad Variety ◽

Existing Data

Storing scientific data on the file system in a meaningful and transparent way is no trivial task. In particular when the data have to be accessed after their originator has left the lab the importance of a standardized file structure cannot be underestimated. It is desirable to have a structure that allows for the unique categorization of all kinds of data from experimental results to publications. It has to be accessible to a broad variety of workflows, e.g., via graphical user interface as well as via command line, in order to find widespread acceptance. Furthermore, the inclusion of already existing data has to be as simple as possible. We propose a three-level structure to organize and store scientific data that incorporates the full chain of scientific data management from data acquisition to analysis to publications. Metadata are saved in a standardized way and connect original data to analyses and publication as well as to their originators. A simple software tool to check a file structure for compliance with the proposed structure is presented.

Download Full-text

Alview: Portable Software for Viewing Sequence Reads in BAM Formatted Files

Cancer Informatics ◽

10.4137/cin.s26470 ◽

2015 ◽

Vol 14 ◽

pp. CIN.S26470 ◽

Cited By ~ 2

Author(s):

Richard P. Finney ◽

Qing-Rong Chen ◽

Cu V. Nguyen ◽

Chih Hao Hsu ◽

Chunhua Yan ◽

...

Keyword(s):

Graphical User Interface ◽

Reference Genome ◽

Source Code ◽

Software Tool ◽

Command Line ◽

Sequencing Data ◽

Genome Data ◽

Command Line Tool ◽

Portable Software ◽

Microsoft Windows

The name Alview is a contraction of the term Alignment Viewer. Alview is a compiled to native architecture software tool for visualizing the alignment of sequencing data. Inputs are files of short-read sequences aligned to a reference genome in the SAM/BAM format and files containing reference genome data. Outputs are visualizations of these aligned short reads. Alview is written in portable C with optional graphical user interface (GUI) code written in C, C++, and Objective-C. The application can run in three different ways: as a web server, as a command line tool, or as a native, GUI program. Alview is compatible with Microsoft Windows, Linux, and Apple OS X. It is available as a web demo at https://cgwb.nci.nih.gov/cgi-bin/alview . The source code and Windows/Mac/Linux executables are available via https://github.com/NCIP/alview .

Download Full-text

A Política de Governança de Dados, Informação e Conhecimento da Embrapa como mecanismo para a gestão de dados de pesquisa agropecuários | Embrapa’s Data, Information and Knowledge Governance Policy as a framework for agricultural research data management

Liinc em Revista ◽

10.18617/liinc.v15i2.4798 ◽

2019 ◽

Vol 15 (2) ◽

Author(s):

Patrícia Rocha Bello Bertin ◽

Juliana Meireles Fortaleza ◽

Adriana Cristina Da Silva ◽

Massayuki Franco Okawachi ◽

Márcia De Oliveira Cardoso

Keyword(s):

Big Data ◽

Data Management ◽

Agricultural Research ◽

Original Data ◽

Scientific Data ◽

Research Data ◽

Data Governance ◽

Data Intensive ◽

Scientific Data Management ◽

Knowledge Policy

RESUMO O fenômeno Big Data e o quarto paradigma da ciência – a e-Science – demandam das instituições de ciência e tecnologia um apropriado gerenciamento e preservação dos dados de pesquisa, de modo a possibilitar o acesso, uso e compartilhamento dos dados originais e assim alcançar sustentabilidade e competitividade no sistema científico e tecnológico moderno. O presente trabalho comenta e analisa a Política de Governança de Dados, Informação e Conhecimento da Embrapa, com foco nas questões relacionadas à gestão de dados de pesquisa. Espera-se que essa Política possa ser instrumental para outras organizações do sistema de C&T nacional no desenvolvimento de seus próprios normativos.Palavras-chave: Dados Científicos; Ciência Intensiva em Dados; Acesso; Compartilhamento; Preservação; Gerenciamento.ABSTRACT The Big Data phenomenon and the fourth science paradigm - e-Science - demand from science and technology institutions proper management and preservation of research data, for access, use and sharing of original data and thus achieve sustainability. and competitiveness in the modern scientific and technological system. This paper comments and analyzes Embrapa’s Data Governance, Information and Knowledge Policy, focusing on issues related to scientific data management. It is hoped that this Policy can be instrumental to other organizations in the national S&T system in developing their own standards.Keywords: Scientific Data; Data Intensive Science; Access; Sharing; Preservation; Management.

Download Full-text

rawDiag - an R package supporting rational LC-MS method optimization for bottom-up proteomics

10.1101/304485 ◽

2018 ◽

Author(s):

Christian Trachsel ◽

Christian Panse ◽

Tobias Kockmann ◽

Witold E. Wolski ◽

Jonas Grossmann ◽

...

Keyword(s):

Mass Spectrometry ◽

Liquid Chromatography ◽

User Interface ◽

Graphical User Interface ◽

Real World ◽

Software Tool ◽

R Package ◽

Command Line ◽

Diagnostic Plots ◽

Method Optimization

AbstractOptimizing methods for liquid chromatography coupled to mass spectrometry (LC-MS) is a non-trivial task. Here we present rawDiag, a software tool supporting rational method optimization by providing MS operator-tailored diagnostic plots of scan level metadata. rawDiag is implemented as R package and can be executed on the command line, or through a graphical user interface (GUI) for less experienced users. The code runs platform independent and can process a hundred raw files in less than three minutes on current consumer hardware as we show by our benchmark. In order to demonstrate the functionality of our package, we included a real-world example taken from our daily core facility business.

Download Full-text

Best Paper Selection

Yearbook of Medical Informatics ◽

10.1055/s-0037-1606505 ◽

2017 ◽

Vol 26 (01) ◽

pp. 212-213

Keyword(s):

Logistic Regression ◽

Private Information ◽

Statistical Models ◽

Heart Sound ◽

Information Leakage ◽

Scientific Data ◽

Training Data ◽

Genotype Data ◽

Scientific Data Management ◽

Heart Sound Segmentation

Agarwal V, Podchiyska T, Banda JM, Goel V, Leung TI, Minty EP, Sweeney TE, Gyang E, Shah NH. Learning statistical models of phenotypes using noisy labeled training data. J Am Med Inform Assoc 2016;23(6):1166-73 https://academic.oup.com/jamia/article-lookup/doi/10.1093/jamia/ocw028 Harmanci A, Gerstein M. Quantification of private information leakage from phenotype-genotype data: linking attacks. Nat Methods 2016;13(3):251-6 https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4834871/ Pfiffner PB, Pinyol I, Natter MD, Mandl KD. C3-PRO: Connecting ResearchKit to the Health System Using i2b2 and FHIR. PloS One 2016;11(3):e0152722 https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4816293/ Wilkinson MD, Dumontier M, Aalbersberg IJJ, Appleton G, Axton M, Baak A, Blomberg N, Boiten JW, da Silva Santos LB, Bourne PE, Bouwman J, Brookes AJ, Clark T, Crosas M, Dillo I, Dumon O, Edmunds S, Evelo CT, Finkers R, Gonzalez-Beltran A, Gray AJ, Groth P, Goble C, Grethe JS, Heringa J, ‘t Hoen PA, Hooft R, Kuhn T, Kok R, Kok J, Lusher SJ, Martone ME, Mons A, Packer AL, Persson B, Rocca-Serra P, Roos M, van Schaik R, Sansone SA, Schultes E, Sengstag T, Slater T, Strawn G, Swertz MA, Thompson M, van der Lei J, van Mulligen E, Velterop J, Waagmeester A, Wittenburg P, Wolstencroft K, Zhao J, Mons B. The FAIR Guiding Principles for scientific data management and stewardship. Sci Data 2016;3:160018 https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4792175/ Springer DB, Tarassenko L, Clifford GD. Logistic regression-HSMM-based heart sound segmentation. IEEE Trans Biomed Eng 2016 Apr;63(4):822-32

Download Full-text

Data Types in Scientific Data Management

Encyclopedia of Database Systems ◽

10.1007/978-1-4614-8265-9_1277 ◽

2018 ◽

pp. 860-863

Author(s):

Amarnath Gupta

Keyword(s):

Data Management ◽

Scientific Data ◽

Data Types ◽

Scientific Data Management

Download Full-text

EventDB: A Large-Scale Semi-structured Scientific Data Management System

Big Scientific Data Management - Lecture Notes in Computer Science ◽

10.1007/978-3-030-28061-1_12 ◽

2019 ◽

pp. 105-115

Author(s):

Wenjia Zhao ◽

Yong Qi ◽

Di Hou ◽

Peijian Wang ◽

Xin Gao ◽

...

Keyword(s):

Data Management ◽

Management System ◽

Large Scale ◽

Scientific Data ◽

Data Management System ◽

Scientific Data Management

Download Full-text

Scientific Data Management and Application in High Energy Physics

Big Scientific Data Management - Lecture Notes in Computer Science ◽

10.1007/978-3-030-28061-1_11 ◽

2019 ◽

pp. 92-104

Author(s):

Gang Chen ◽

Yaodong Cheng

Keyword(s):

Data Management ◽

High Energy Physics ◽

High Energy ◽

Scientific Data ◽

Scientific Data Management ◽

Energy Physics

Download Full-text

Integrating Parallel File I/O and Database Support for High-Performance Scientific Data Management

ACM/IEEE SC 2000 Conference (SC'00) ◽

10.1109/sc.2000.10048 ◽

2000 ◽

Author(s):

Jaechun No ◽

R. Thakur ◽

A. Choudhary

Keyword(s):

Data Management ◽

High Performance ◽

Scientific Data ◽

Scientific Data Management ◽

Database Support ◽

Parallel File

Download Full-text

Towards Archaeo-informatics: Scientific Data Management for Archaeobiology

Lecture Notes in Computer Science - Scientific and Statistical Database Management ◽

10.1007/978-3-642-13818-8_14 ◽

2010 ◽

pp. 169-177 ◽

Cited By ~ 2

Author(s):

Hans-Peter Kriegel ◽

Peer Kröger ◽

Christiaan Hendrikus van der Meijden ◽

Henriette Obermaier ◽

Joris Peters ◽

...

Keyword(s):

Data Management ◽

Scientific Data ◽

Scientific Data Management

Download Full-text