A Reproducible Data Analysis Workflow

A Reproducible Data Analysis Workflow with R Markdown, Git, Make, and Docker

10.31234/osf.io/8xzqy ◽

2019 ◽

Author(s):

Aaron Peikert ◽

Andreas Markus Brandmaier

Keyword(s):

Scientific Productivity ◽

Seamless Integration ◽

Report Generation ◽

Version Management ◽

Computing Platform ◽

Analysis Workflow ◽

Cross Platform ◽

Document Generation ◽

Statistical Results

In this tutorial, we describe a workflow to ensure long-term reproducibility of R-based data analyses. The workflow leverages established tools and practices from software engineering. It combines the benefits of various open-source software tools including R Markdown, Git, Make, and Docker, whose interplay ensures seamless integration of version management, dynamic report generation conforming to various journal styles, and full cross-platform and long-term computational reproducibility. The workflow ensures meeting the primary goals that 1) the reporting of statistical results is consistent with the actual statistical results (dynamic report generation), 2) the analysis exactly reproduces at a later point in time even if the computing platform or software is changed (computational reproducibility), and 3) changes at any time (during development and post-publication) are tracked, tagged, and documented while earlier versions of both data and code remain accessible. While the research community increasingly recognizes dynamic document generation and version management as tools to ensure reproducibility, we demonstrate with practical examples that these alone are not sufficient to ensure long-term computational reproducibility. Combining containerization, dependence management, version management, and dynamic document generation, the proposed workflow increases scientific productivity by facilitating later reproducibility and reuse of code and data.

Download Full-text

BISR-RNAseq: an efficient and scalable RNAseq analysis workflow with interactive report generation

BMC Bioinformatics ◽

10.1186/s12859-019-3251-1 ◽

2019 ◽

Vol 20 (S24) ◽

Author(s):

Venkat Sundar Gadepalli ◽

Hatice Gulcin Ozer ◽

Ayse Selen Yilmaz ◽

Maciej Pietrzak ◽

Amy Webb

Keyword(s):

Web Application ◽

High Performance ◽

Expression Patterns ◽

Quality Analysis ◽

Seamless Integration ◽

Report Generation ◽

Link Type ◽

R Shiny ◽

Analysis Workflow ◽

The Ohio State University

Abstract Background RNA sequencing has become an increasingly affordable way to profile gene expression patterns. Here we introduce a workflow implementing several open-source softwares that can be run on a high performance computing environment. Results Developed as a tool by the Bioinformatics Shared Resource Group (BISR) at the Ohio State University, we have applied the pipeline to a few publicly available RNAseq datasets downloaded from GEO in order to demonstrate the feasibility of this workflow. Source code is available here: workflow: https://code.bmi.osumc.edu/gadepalli.3/BISR-RNAseq-ICIBM2019 and shiny: https://code.bmi.osumc.edu/gadepalli.3/BISR_RNASeq_ICIBM19. Example dataset is demonstrated here: https://dataportal.bmi.osumc.edu/RNA_Seq/. Conclusion The workflow allows for the analysis (alignment, QC, gene-wise counts generation) of raw RNAseq data and seamless integration of quality analysis and differential expression results into a configurable R shiny web application.

Download Full-text

Análisis del tostado del grano de café

Revista de Ingeniería Industrial ◽

10.35429/jie.2019.10.3.1.16 ◽

2019 ◽

pp. 1-16

Author(s):

Adolfo RAMÍREZ-ROMÁN ◽

Ángel SUÁREZ-ÁLVAREZ ◽

Jacqueline CHABAT-URANGA ◽

Francisco ORTIZ-MARTÍNEZ

Keyword(s):

Quality Management ◽

Case Studies ◽

Continuous Improvement ◽

Educational Program ◽

Agricultural Products ◽

Environmental Care ◽

Roasting Process ◽

Statistical Results ◽

Ground Powder

Analyzing coffee grain roasting operations - Veracruz Region will contribute to the distribution of equipment and machinery in the Industrial Engineering workshop of the Educational Program. It aims to obtain improvements in the roasting process through the study of the work (January–June 2019), to lay the foundation for the proposals of Systems of Quality Management, Safety and Environmental Care in the following phases of the project (2019-2020) with effects on continuous improvement. In the coffee industry in Mexico has in the medium and long term, opportunities to grow and consolidate. As with most agricultural products from smallholders, such as coffee, the prices paid for the input ("cherry coffee") They are far from what the processed coffee comes to obtain in presentations of soluble and ground powder for coffee makers, and the process of improvement, quality, efficiency in roasting activities is necessary. The method analyzed the case studies of the operations of companies, producers and research related to the coffee and coffee industry and the interpretation of the statistical results.

Download Full-text

Combining Remote-Sensing-Derived Data and Historical Maps for Long-Term Back-Casting of Urban Extents

Remote Sensing ◽

10.3390/rs13183672 ◽

2021 ◽

Vol 13 (18) ◽

pp. 3672

Author(s):

Johannes H. Uhl ◽

Stefan Leyk ◽

Zekun Li ◽

Weiwei Duan ◽

Basel Shbita ◽

...

Keyword(s):

Remote Sensing ◽

Urban Areas ◽

Color Space ◽

Area Under The Curve ◽

Land Development ◽

Historical Maps ◽

Human Settlement ◽

Historical Building ◽

Seamless Integration

Spatially explicit, fine-grained datasets describing historical urban extents are rarely available prior to the era of operational remote sensing. However, such data are necessary to better understand long-term urbanization and land development processes and for the assessment of coupled nature–human systems (e.g., the dynamics of the wildland–urban interface). Herein, we propose a framework that jointly uses remote-sensing-derived human settlement data (i.e., the Global Human Settlement Layer, GHSL) and scanned, georeferenced historical maps to automatically generate historical urban extents for the early 20th century. By applying unsupervised color space segmentation to the historical maps, spatially constrained to the urban extents derived from the GHSL, our approach generates historical settlement extents for seamless integration with the multi-temporal GHSL. We apply our method to study areas in countries across four continents, and evaluate our approach against historical building density estimates from the Historical Settlement Data Compilation for the US (HISDAC-US), and against urban area estimates from the History Database of the Global Environment (HYDE). Our results achieve Area-under-the-Curve values >0.9 when comparing to HISDAC-US and are largely in agreement with model-based urban areas from the HYDE database, demonstrating that the integration of remote-sensing-derived observations and historical cartographic data sources opens up new, promising avenues for assessing urbanization and long-term land cover change in countries where historical maps are available.

Download Full-text

One Step Chip Attach Materials (OSCA) for Conventional Mass Reflow Processing

International Symposium on Microelectronics ◽

10.4071/isom-tp35 ◽

2014 ◽

Vol 2014 (1) ◽

pp. 000262-000267

Author(s):

Daniel J. Duffy ◽

Lin Xin ◽

Jean Liu ◽

Bruno Tolla

Keyword(s):

Flip Chip ◽

Polymeric Materials ◽

Cure Kinetics ◽

Filler Loading ◽

Single Step ◽

Seamless Integration ◽

Device Reliability ◽

One Step ◽

Traditional Processing

One step chip attach (OSCA) materials are dispensable polymeric materials for flip chip assembly, which are designed to flux metallic interconnections and subsequently turn into an underfill upon curing. OSCA materials enable a drastic simplification of the assembly process by combining the reflow (fluxing/soldering), defluxing and capillary underfilling steps used in traditional processing into a single step. One key challenge for the design of OSCA materials is timing the cure kinetics with fluxing activity and solder reflow during processing. A second key challenge is to factor a process-friendly rheological design into the formulation. The OSCA material rheology must allow for high filler loading levels, seamless integration with standard dispensing equipment, flow control during and after dispense (avoid keep out zones), flow during die placement (elimination of voids), after placement (fillet formation) and during reflow. The final key requirements for a functional device are defect-free interconnections combined with optimal thermo-mechanical and water resistant properties of the final underfill to guarantee the long-term reliability of the assembly in various environmental conditions. This paper presents the properties of materials designed by Kester for use in mass reflow processing (OSCA-R). The rheological design principles behind a seamless integration into customer-friendly processes will be presented In addition results illustrating the timing of cure kinetics with fluxing and soldering events during processing will be discussed. Preliminary device reliability results will also be presented for several types of test vehicles including; Si-Si and Si-FR4.

Download Full-text

A43 Translational research: NGS metagenomics into clinical diagnostics

Virus Evolution ◽

10.1093/ve/vez002.042 ◽

2019 ◽

Vol 5 (Supplement_1) ◽

Author(s):

D Schmitz ◽

S Nooij ◽

T Janssens ◽

J Cremer ◽

H Vennema ◽

...

Keyword(s):

Clinical Diagnostics ◽

Patient Privacy ◽

User Friendliness ◽

Unique Identifier ◽

Audit Trail ◽

Long Term Storage ◽

Analysis Workflow ◽

User Friendly ◽

Ngs Data

Abstract As research next-generation sequencing (NGS) metagenomic pipelines transition to clinical diagnostics, the user-base changes from bioinformaticians to biologists, medical doctors, and lab-technicians. Besides the obvious need for benchmarking and assessment of diagnostic outcomes of the pipelines and tools, other focus points remain: reproducibility, data immutability, user-friendliness, portability/scalability, privacy, and a clear audit trail. We have a research metagenomics pipeline that takes raw fastq files and produces annotated contigs, but it is too complicated for non-bioinformaticians. Here, we present preliminary findings in adapting this pipeline for clinical diagnostics. We used information available on relevant fora (www.bioinfo-core.org) and experiences and publications from colleague bioinformaticians in other institutes (COMPARE, UBC, and LUMC). From this information, a robust and user-friendly storage and analysis workflow was designed for non-bioinformaticians in a clinical setting. Via Conda [https://conda.io] and Docker containers [http://www.docker.com], we made our disparate pipeline processes self-contained and reproducible. Furthermore, we moved all pipeline settings into a separate JSON file. After every analysis, the pipeline settings and virtual-environment recipes will be archived (immutably) under a persistent unique identifier. This allows long-term precise reproducibility. Likewise, after every run the raw data and final products will be automatically archived, complying with data retention laws/guidelines. All the disparate processes in the pipeline are parallelized and automated via Snakemake1 (i.e. end-users need no coding skills). In addition, interactive web-reports such as MultiQC [http://multiqc.info] and Krona2 are generated automatically. By combining Snakemake, Conda, and containers, our pipeline is highly portable and easily scaled up for outbreak situations, or scaled down to reduce costs. Since patient privacy is a concern, our pipeline automatically removes human genetic data. Moreover, all source code will be stored on an internal Gitlab server, and, combined with the archived data, ensures a clear audit trail. Nevertheless, challenges remain: (1) reproducible reference databases, e.g. being able to revert to an older version to reproduce old analyses. (2) A user-friendly GUI. (3) Connecting the pipeline and NGS data to in-house LIMS. (4) Efficient long-term storage, e.g. lossless compression algorithms. Nevertheless, this work represents a step forward in making user-friendly clinical diagnostic workflows.

Download Full-text

CALCULATION OF OIL TANK VOLUME AND REPORT GENERATION SYSTEM WITH TRIM AND LIST CORRECTIONS

Transactions of the Canadian Society for Mechanical Engineering ◽

10.1139/tcsme-2016-0068 ◽

2016 ◽

Vol 40 (5) ◽

pp. 835-845

Author(s):

Ming-Sen Hu ◽

Chia-Rei Tao

Keyword(s):

Carrying Capacity ◽

Software System ◽

Generation System ◽

Report Generation ◽

Volume Calculation ◽

Oil Tank ◽

Oil Tanks ◽

Oil Tanker

The capacity of ship’s oil tanks is usually designed as a tabled form in order to obtain oil volumes by using the measured ullage heights. However, the tank walls easily deform or distort due to long-term heavy loading. This phenomenon may cause serious errors that the carrying capacity in oil tanker does not match with the values of the tabled form. In this paper, we perform an oil tank volume calibration project that aims to develop a tank volume calculation and report a generation software with trim and list corrections. The current internal specification for each tank is measured first, and then all specification data measured can be input to this software system to calculate each tank’s volume. These calculated results will be verified by actual delivery volume tests. This software system has been applied to the Der-Yun Oil Tanker of CPC Corp. The result shows that the overall error of calibrated volume for all tanks is under 0.1%. It is proved that this system highly improves the correctness of the vessel’s carrying capacity.

Download Full-text

Analysis of Differences in Gene Expression Associated with Variation in Biomass Composition in Sugarcane

Proceedings ◽

10.3390/proceedings2019036164 ◽

2020 ◽

Vol 36 (1) ◽

pp. 164

Author(s):

Virginie Perlo ◽

Agnelo Furtado ◽

Frikkie Botha ◽

Robert Henry

Keyword(s):

Gene Expression ◽

Biomass Composition ◽

Phenotypic Traits ◽

Food Industries ◽

Measurement And Analysis ◽

Analysis Workflow ◽

Relevant Explanation ◽

Profiling Analysis ◽

Second Generation Ethanol

Sugarcane has a high potential to support second-generation ethanol production and environmentally friendly by-products for use in chemical, pharmaceutical, medical, cosmetic and food industries. A crucial challenge for a long-term economic viability is to optimise the crop for production of a biomass composition that will ensure maximum economic benefit. Transcriptome data analysis provides a relevant explanation of phenotypic variances and gives a more accurate prediction of phenotypes than genomic information. This multi-omic approach, with an integrated transcriptomics and metabolomics analysis may reveal details of biological mechanisms and pathways. A global view of transcriptional regulation and the identification differentially expressed genes (DEGs) and metabolites may help the feasibility of tailoring engineering targeted biosynthetic pathways to improve the production of these bio-products from sugarcane. We propose a profiling analysis workflow (pipeline) to generate empirical correlations between gene expression, metabolites, proteins and phenotypic traits and pathway analysis, with a highlight focus on data visualisation. This study of genetic variation in gene expression and correlations with metabolic and protein phenotype relies on high-throughput methodology, measurement and analysis of 360 samples, 24 commercial sugarcane cultivars with different phenotypic characteristics at 5 different development stages with 3 replicates.

Download Full-text

A Trust Model of P2P in Cloud Computing Environment

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.278-280.1962 ◽

2013 ◽

Vol 278-280 ◽

pp. 1962-1965

Author(s):

Song Fei ◽

Xiao Jing Wang ◽

Zhe Cui

Keyword(s):

Cloud Computing ◽

Trust Model ◽

Computing Environment ◽

Trust Mechanism ◽

Cloud Computing Environment ◽

Computing Platform ◽

Cloud Computing Service ◽

Cross Platform ◽

Cloud Computing Platform ◽

P2p Technology

Proposed a new trust model based on P2P technology in the cloud computing environment. The model takes into account more than one cloud computing platform, that is, considering the different cloud computing service provider provide the service of a cross-cloud platform. Such cross-platform cloud (Cross Cloud) can be called the composite cloud computing platform or cloud associated cloud computing platform.The nodes in the cloud computing environment are divided into two categories: customers and providers. According to the different roles of these two nodes, we designed a different trust mechanism, to divide the trust domain with independent single cloud, considered node independence and manageability of domain to process trust choice and trust update, and proposed a new kind of cloud computing service - trust recommendation service.

Download Full-text

Association study of long-term kidney transplant rejection using whole-exome sequencing

10.1101/444497 ◽

2018 ◽

Author(s):

Louis Gioia ◽

Sunil Kurian ◽

Tony S. Mondala ◽

Laia Bassaganyas ◽

Pui-Yan Kwok ◽

...

Keyword(s):

Interstitial Fibrosis ◽

Transplant Rejection ◽

Study Data ◽

First Year ◽

Post Transplantation ◽

Tubular Atrophy ◽

Association Analyses ◽

Analysis Workflow ◽

Meta Analyses

AbstractLong-term renal allograft rejection is the most common outcome in kidney transplantation. Continuing the crusade to extend allograft function after the first year post-transplantation, we attempted to associate genetic factors that might contribute to long-term allograft outcomes by sequencing the exomes of patients diagnosed with chronic allograft nephropathy/interstitial fibrosis and tubular atrophy. A variety of association analyses were employed, but these analyses failed to identify statistically significant associations. The study was underpowered to detect the association of rare genomic variants with small effect sizes. However, it confirmed previous reports of the absence of large effects from common variants. We have made both the study data and analysis workflow available for public use, and we hope that these resources will help to power future meta-analyses that may detect smaller effects.

Download Full-text