RASflow: An RNA-Seq Analysis Workflow with Snakemake

Mapping Intimacies ◽

10.1101/839191 ◽

2019 ◽

Author(s):

Xiaokang Zhang ◽

Inge Jonassen

Keyword(s):

Gene Expression ◽

Management System ◽

Workflow Management ◽

Model Organisms ◽

Gene Transcript ◽

Rna Seq ◽

Public Data ◽

Wide Range ◽

Analysis Workflow ◽

Programming Skills

AbstractBackgroundWith the cost of DNA sequencing decreasing, increasing amounts of RNA-Seq data are being generated giving novel insight into gene expression and regulation. Prior to analysis of gene expression, the RNA-Seq data has to be processed through a number of steps resulting in a quantification of expression of each gene / transcript in each of the analyzed samples. A number of workflows are available to help researchers perform these steps on their own data, or on public data to take advantage of novel software or reference data in data re-analysis. However, many of the existing workflows are limited to specific types of studies. We therefore aimed to develop a maximally general workflow, applicable to a wide range of data and analysis approaches and at the same time support research on both model and non-model organisms. Furthermore, we aimed to make the workflow usable also for users with limited programming skills.ResultsUtilizing the workflow management system Snakemake and the package management system Conda, we have developed a modular, flexible and user-friendly RNA-Seq analysis pipeline: RNA-Seq Analysis Snakemake Workflow (RASflow). Utilizing Snakemake and Conda alleviates challenges with library dependencies and version conflicts and also supports reproducibility. To be applicable for a wide variety of applications, RASflow supports mapping of reads to both genomic and transcriptomic assemblies. RASflow has a broad range of potential users: it can be applied by researchers interested in any organism and since it requires no programming skills, it can be used by researchers with different backgrounds. RASflow is an open source tool and source code as well as documentation, tutorials and example data sets can be found on GitHub: https://github.com/zhxiaokang/RASflowConclusionsRASflow is a simple and reliable RNA-Seq analysis workflow which is a full pack of RNA-Seq analysis.

Download Full-text

Zea mays RNA-seq estimated transcript abundances are strongly affected by read mapping bias

BMC Genomics ◽

10.1186/s12864-021-07577-3 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Shuhua Zhan ◽

Cortland Griswold ◽

Lewis Lukens

Keyword(s):

Gene Expression ◽

Zea Mays ◽

Reference Genome ◽

Transcript Abundance ◽

Gene Transcript ◽

Rna Seq ◽

Individual Genome ◽

Abundance Estimates ◽

Mapping Bias ◽

Quantify Gene Expression

Abstract Background Genetic variation for gene expression is a source of phenotypic variation for natural and agricultural species. The common approach to map and to quantify gene expression from genetically distinct individuals is to assign their RNA-seq reads to a single reference genome. However, RNA-seq reads from alleles dissimilar to this reference genome may fail to map correctly, causing transcript levels to be underestimated. Presently, the extent of this mapping problem is not clear, particularly in highly diverse species. We investigated if mapping bias occurred and if chromosomal features associated with mapping bias. Zea mays presents a model species to assess these questions, given it has genotypically distinct and well-studied genetic lines. Results In Zea mays, the inbred B73 genome is the standard reference genome and template for RNA-seq read assignments. In the absence of mapping bias, B73 and a second inbred line, Mo17, would each have an approximately equal number of regulatory alleles that increase gene expression. Remarkably, Mo17 had 2–4 times fewer such positively acting alleles than did B73 when RNA-seq reads were aligned to the B73 reference genome. Reciprocally, over one-half of the B73 alleles that increased gene expression were not detected when reads were aligned to the Mo17 genome template. Genes at dissimilar chromosomal ends were strongly affected by mapping bias, and genes at more similar pericentromeric regions were less affected. Biased transcript estimates were higher in untranslated regions and lower in splice junctions. Bias occurred across software and alignment parameters. Conclusions Mapping bias very strongly affects gene transcript abundance estimates in maize, and bias varies across chromosomal features. Individual genome or transcriptome templates are likely necessary for accurate transcript estimation across genetically variable individuals in maize and other species.

Download Full-text

RNA-Seq Data-Mining Allows the Discovery of Two Long Non-Coding RNA Biomarkers of Viral Infection in Humans

International Journal of Molecular Sciences ◽

10.3390/ijms21082748 ◽

2020 ◽

Vol 21 (8) ◽

pp. 2748 ◽

Cited By ~ 1

Author(s):

Ruth Barral-Arca ◽

Alberto Gómez-Carballa ◽

Miriam Cebey-López ◽

María José Currás-Tuala ◽

Sara Pischedda ◽

...

Keyword(s):

Gene Expression ◽

Viral Infections ◽

Umbilical Vein ◽

Cell Types ◽

Dermal Fibroblasts ◽

Learning Approaches ◽

Rna Seq ◽

Wide Range ◽

Healthy Control ◽

Umbilical Vein Endothelial Cells

There is a growing interest in unraveling gene expression mechanisms leading to viral host invasion and infection progression. Current findings reveal that long non-coding RNAs (lncRNAs) are implicated in the regulation of the immune system by influencing gene expression through a wide range of mechanisms. By mining whole-transcriptome shotgun sequencing (RNA-seq) data using machine learning approaches, we detected two lncRNAs (ENSG00000254680 and ENSG00000273149) that are downregulated in a wide range of viral infections and different cell types, including blood monocluclear cells, umbilical vein endothelial cells, and dermal fibroblasts. The efficiency of these two lncRNAs was positively validated in different viral phenotypic scenarios. These two lncRNAs showed a strong downregulation in virus-infected patients when compared to healthy control transcriptomes, indicating that these biomarkers are promising targets for infection diagnosis. To the best of our knowledge, this is the very first study using host lncRNAs biomarkers for the diagnosis of human viral infections.

Download Full-text

Cumate-Inducible Gene Expression System for Sphingomonads and Other Alphaproteobacteria

Applied and Environmental Microbiology ◽

10.1128/aem.02296-13 ◽

2013 ◽

Vol 79 (21) ◽

pp. 6795-6802 ◽

Cited By ~ 40

Author(s):

Andreas Kaczmarczyk ◽

Julia A. Vorholt ◽

Anne Francez-Charlot

Keyword(s):

Gene Expression ◽

Caulobacter Crescentus ◽

Paracoccus Denitrificans ◽

Expression System ◽

Model Organisms ◽

Inducible Gene Expression ◽

Gene Expression System ◽

Content Type ◽

Wide Range ◽

Inducible Gene

ABSTRACTTunable promoters represent a pivotal genetic tool for a wide range of applications. Here we present such a system for sphingomonads, a phylogenetically diverse group of bacteria that have gained much interest for their potential in bioremediation and their use in industry and for which no dedicated inducible gene expression system has been described so far. A strong, constitutive synthetic promoter was first identified through a genetic screen and subsequently combined with the repressor and the operator sites of thePseudomonas putidaF1cym/cmtsystem. The resulting promoter, termed PQ5, responds rapidly to the inducer cumate and shows a maximal induction ratio of 2 to 3 orders of magnitude in the different sphingomonads tested. Moreover, it was also functional in otherAlphaproteobacteria, such as the model organismsCaulobacter crescentus,Paracoccus denitrificans, andMethylobacterium extorquens. In the noninduced state, expression from PQ5is low enough to allow gene depletion analysis, as demonstrated with the essential genephyPofSphingomonassp. strain Fr1. A set of PQ5-based plasmids has been constructed allowing fusions to affinity tags or fluorescent proteins.

Download Full-text

The role of KMT2D and KDM6A in cardiac development: A cross-species analysis in humans, mice, and zebrafish

10.1101/2020.04.03.024646 ◽

2020 ◽

Author(s):

Rwik Sen ◽

Ezra Lencer ◽

Elizabeth A. Geiger ◽

Kenneth L. Jones ◽

Tamim H. Shaikh ◽

...

Keyword(s):

Gene Expression ◽

Heart Development ◽

Target Genes ◽

Cardiac Development ◽

Heart Defects ◽

Neural Crest Cell ◽

Kabuki Syndrome ◽

Rna Seq ◽

Wide Range

AbstractCongenital Heart Defects (CHDs) are the most common form of birth defects, observed in 4-10/1000 live births. CHDs result in a wide range of structural and functional abnormalities of the heart which significantly affect quality of life and mortality. CHDs are often seen in patients with mutations in epigenetic regulators of gene expression, like the genes implicated in Kabuki syndrome – KMT2D and KDM6A, which play important roles in normal heart development and function. Here, we examined the role of two epigenetic histone modifying enzymes, KMT2D and KDM6A, in the expression of genes associated with early heart and neural crest cell (NCC) development. Using CRISPR/Cas9 mediated mutagenesis of kmt2d, kdm6a and kdm6al in zebrafish, we show cardiac and NCC gene expression is reduced, which correspond to affected cardiac morphology and reduced heart rates. To translate our results to a human pathophysiological context and compare transcriptomic targets of KMT2D and KDM6A across species, we performed RNA sequencing (seq) of lymphoblastoid cells from Kabuki Syndrome patients carrying mutations in KMT2D and KDM6A. We compared the human RNA-seq datasets with RNA-seq datasets obtained from mouse and zebrafish. Our comparative interspecies analysis revealed common targets of KMT2D and KDM6A, which are shared between species, and these target genes are reduced in expression in the zebrafish mutants. Taken together, our results show that KMT2D and KDM6A regulate common and unique genes across humans, mice, and zebrafish for early cardiac and overall development that can contribute to the understanding of epigenetic dysregulation in CHDs.

Download Full-text

SPARTA: Simple Program for Automated reference-based bacterial RNA-seq Transcriptome Analysis

10.1101/021915 ◽

2015 ◽

Author(s):

Benjamin K Johnson ◽

Matthew B Scholz ◽

Tracy K Teal ◽

Robert B Abramovitch

Keyword(s):

Gene Expression ◽

Differential Gene Expression ◽

Quality Analysis ◽

Rna Seq ◽

Sequencing Data ◽

Data Set ◽

Bacterial Rna ◽

Analysis Workflow ◽

Differential Gene ◽

Reference Counting

Summary: SPARTA is a reference-based bacterial RNA-seq analysis workflow application for single-end Illumina reads. SPARTA is turnkey software that simplifies the process of analyzing RNA-seq data sets, making bacterial RNA-seq analysis a routine process that can be undertaken on a personal computer or in the classroom. The easy-to-install, complete workflow processes whole transcriptome shotgun sequencing data files by trimming reads and removing adapters, mapping reads to a reference, counting gene features, calculating differential gene expression, and, importantly, checking for potential batch effects within the data set. SPARTA outputs quality analysis reports, gene feature counts and differential gene expression tables and scatterplots. The workflow is implemented in Python for file management and sequential execution of each analysis step and is available for Mac OS X, Microsoft Windows, and Linux. To promote the use of SPARTA as a teaching platform, a web-based tutorial is available explaining how RNA-seq data are processed and analyzed by the software. Availability and Implementation: Tutorial and workflow can be found at sparta.readthedocs.org. Teaching materials are located at sparta-teaching.readthedocs.org. Source code can be downloaded at www.github.com/abramovitchMSU/, implemented in Python and supported on Mac OS X, Linux, and MS Windows. Contact: Robert B. Abramovitch ([email protected]) Supplemental Information: Supplementary data are available online

Download Full-text

SPsimSeq: semi-parametric simulation of bulk and single cell RNA sequencing data

10.1101/677740 ◽

2019 ◽

Cited By ~ 1

Author(s):

Alemu Takele Assefa ◽

Jo Vandesompele ◽

Olivier Thas

Keyword(s):

Gene Expression ◽

Single Cell ◽

Rna Sequencing ◽

Empirical Distribution ◽

Supplementary Information ◽

Rna Seq ◽

Sequencing Data ◽

Actual Distribution ◽

Wide Range ◽

Single Cell Rna Sequencing

SummarySPsimSeq is a semi-parametric simulation method for bulk and single cell RNA sequencing data. It simulates data from a good estimate of the actual distribution of a given real RNA-seq dataset. In contrast to existing approaches that assume a particular data distribution, our method constructs an empirical distribution of gene expression data from a given source RNA-seq experiment to faithfully capture the data characteristics of real data. Importantly, our method can be used to simulate a wide range of scenarios, such as single or multiple biological groups, systematic variations (e.g. confounding batch effects), and different sample sizes. It can also be used to simulate different gene expression units resulting from different library preparation protocols, such as read counts or UMI counts.Availability and implementationThe R package and associated documentation is available from https://github.com/CenterForStatistics-UGent/SPsimSeq.Supplementary informationSupplementary data are available at bioRχiv online.

Download Full-text

Selection of suitable reference genes for qRT-PCR studies during SE initial dedifferentiation in cotton of different SE capability

10.7287/peerj.preprints.27138 ◽

2018 ◽

Author(s):

Cao Ai Ping ◽

Shao Dong Nan ◽

Cui Bai Ming ◽

Zheng Yin Ying ◽

Sun jie

Keyword(s):

Gene Expression ◽

Reference Genes ◽

Gene Expression Level ◽

Expression Level ◽

Rna Seq ◽

Qrt Pcr ◽

Cotton Species ◽

Wide Range ◽

The Stability ◽

Suitable Reference

Analysis of gene expression level by RNA sequencing (RNA-seq ) has a wide range of biological purposes in various species. Real-time fluorescent quantitative PCR (qRT-PCR) evaluated gene expression levels and validated transcriptomic, which will depend on the stably expressed reference genes for normalization of the gene expression level under specific situations. In this study, 15 candidate genes were selected from transcriptome datasets during somatic embryogenesis (SE) initial dedifferentiation in Gossypium hirsutum L. of different SE capability. To evaluate the stability of those genes, geNorm, NormFinder and BestKeeper were used. The results revealed that ENDO4 and 18srRNA could be as appropriate reference genes under all conditions. The stability and reliability of the reference genes were further tested through comparison of qRT-PCR results and RNA-seq data, as well as evaluation of the expression profiles of auxin-responsive protein (AUX22) and ethylene-responsive transcription factor (ERF17). In summary, the results of our study indicate the most suitable reference genes for qRT-PCR during three induction stages in four cotton species.

Download Full-text

Gene expression estimates: Influence of sequencing library construction, fish sampling methods, and tissue harvesting time

10.22541/au.161142130.04599849/v1 ◽

2021 ◽

Author(s):

Nickolas Moreno ◽

Leif Howard ◽

Scott Relyea ◽

James Dunnigan ◽

Matthew Boyer ◽

...

Keyword(s):

Gene Expression ◽

Sampling Methods ◽

Cutthroat Trout ◽

Rna Degradation ◽

Model Organisms ◽

Rna Seq ◽

Gene Expression Variation ◽

Oncorhynchus Clarkii ◽

Expression Variation ◽

Study Gene Expression

RNA sequencing (RNA-Seq) is becoming a popular method for measuring gene expression in non-model organisms, including wild populations sampled in the field. While RNA-Seq can be used to measure gene expression variation among wild-caught individuals and can yield important biological insights into organismal function, technical variables may also influence gene expression estimates. We examined the influence of multiple technical variables on estimated gene expression in a non-model fish species, the westslope cutthroat trout (Oncorhynchus clarkii lewisi), using two RNA-Seq methods: 3’ RNA-Seq and whole mRNA-Seq. We evaluated the effects of dip netting versus electrofishing, and of harvesting tissue immediately versus 5 minutes after euthanasia on estimated gene expression in blood, gill, muscle, and liver. We found higher RNA degradation in the liver compared to the other tissues. There were fewer expressed genes in blood compared to gill and muscle. We found no difference in gene expression among sampling methods or due to a delay in tissue collection. However, we detected fewer genes with 3’ RNA-Seq than with whole mRNA-Seq and found statistically significant differences in gene expression between 3’ RNA-Seq and whole mRNA-Seq. The magnitude and direction of these differences does not appear to be dependent on gene type or length. Our findings indicate that RNA-Seq is robust to the technical variables related to the field sampling techniques tested here but varies based on the tissue sampled and the RNA-Seq library used. This study advances understanding of usefulness of RNA-Seq to study gene expression variation in evolution, ecology, and conservation.

Download Full-text

FungiExp: A Comprehensive Platform For Exploring Fungal Gene Expression and Alternative Splicing Based On 35,821 RNA-Seq Experiments From 220 Fungi

10.21203/rs.3.rs-618004/v1 ◽

2021 ◽

Author(s):

Jinding Liu ◽

Fei Yin ◽

Kun Lang ◽

Wencai Jie ◽

Suxu Tan ◽

...

Keyword(s):

Gene Expression ◽

Alternative Splicing ◽

Sequence Similarity ◽

Expression Regulation ◽

Rna Seq ◽

Specific Expression ◽

Data Accessibility ◽

Wide Range ◽

Fungal Gene Expression ◽

Fungal Gene

Abstract Background: RNA-seq has become a standard tool in biology and has produced large and diverse transcriptomic datasets for users to explore fungal expression regulation. Fungal alternative splicing, which is attracting increasing attention because of evolutionary adaptations to changing external conditions has not been thoroughly investigated in previous studies, unlike that of animals and plants. However, the analyses of RNA-seq datasets are made difficult by the heterogeneity of study design and complex bioinformatics approaches. Comprehensive analyses of these published datasets should contribute new insights into fungal expression regulation.Results: We have developed a web-based platform called FungiExp hosting fungal gene expression levels and alternative splicing profiles in 35,821 curated RNA-seq experiments from 220 species. It allows users to perform retrieval via diverse terms and sequence similarity. Moreover, users can customize experimental groups to perform differential and specific expression analyses. The wide range of data visualization is an additional important feature that should help users intuitively understand retrieval and analysis results.Conclusions: With its uniform data processing, easy data accessibility, convenient retrieval, and analysis functions, FungiExp is a valuable resource and tool that allows users to (re)use published RNA-seq datasets. It is accessible at http://bioinfo.njau.edu.cn/fungiExp.

Download Full-text

Glutton: large-scale integration of non-model organism transcriptome data for comparative analysis

10.1101/077511 ◽

2016 ◽

Cited By ~ 2

Author(s):

Alan Medlar ◽

Laura Laakso ◽

Andreia Miraldo ◽

Ari Löytynoja

Keyword(s):

Comparative Analysis ◽

Large Scale ◽

De Novo ◽

Sequence Data ◽

Model Organism ◽

Model Organisms ◽

Rna Seq ◽

Reference Species ◽

Wide Range ◽

The Impact

AbstractHigh-throughput RNA-seq data has become ubiquitous in the study of non-model organisms, but its use in comparative analysis remains a challenge. Without a reference genome for mapping, sequence data has to be de novo assembled, producing large numbers of short, highly redundant contigs. Preparing these assemblies for comparative analyses requires the removal of redundant isoforms, assignment of orthologs and converting fragmented transcripts into gene alignments. In this article we present Glutton, a novel tool to process transcriptome assemblies for downstream evolutionary analyses. Glutton takes as input a set of fragmented, possibly erroneous transcriptome assemblies. Utilising phylogeny-aware alignment and reference data from a closely related species, it reconstructs one transcript per gene, finds orthologous sequences and produces accurate multiple alignments of coding sequences. We present a comprehensive analysis of Glutton’s performance across a wide range of divergence times between study and reference species. We demonstrate the impact choice of assembler has on both the number of alignments and the correctness of ortholog assignment and show substantial improvements over heuristic methods, without sacrificing correctness. Finally, using inference of Darwinian selection as an example of downstream analysis, we show that Glutton-processed RNA-seq data give results comparable to those obtained from full length gene sequences even with distantly related reference species. Glutton is available from http://wasabiapp.org/software/glutton/ and is licensed under the GPLv3.

Download Full-text