GEO2RNAseq: An easy-to-use R pipeline for complete pre-processing of RNA-seq data

Mapping Intimacies ◽

10.1101/771063 ◽

2019 ◽

Cited By ~ 2

Author(s):

Bastian Seelbinder ◽

Thomas Wolf ◽

Steffen Priebe ◽

Sylvie McNamara ◽

Silvia Gerber ◽

...

Keyword(s):

Gene Expression ◽

Single Species ◽

Gene Expression Omnibus ◽

Rna Seq ◽

Sequencing Data ◽

Interacting Species ◽

Link Type ◽

Fastq Format ◽

Standard Tool ◽

Processing Steps

ABSTRACTIn transcriptomics, the study of the total set of RNAs transcribed by the cell, RNA sequencing (RNA-seq) has become the standard tool for analysing gene expression. The primary goal is the detection of genes whose expression changes significantly between two or more conditions, either for a single species or for two or more interacting species at the same time (dual RNA-seq, triple RNA-seq and so forth). The analysis of RNA-seq can be simplified as many steps of the data pre-processing can be standardised in a pipeline.In this publication we present the “GEO2RNAseq” pipeline for complete, quick and concurrent pre-processing of single, dual, and triple RNA-seq data. It covers all pre-processing steps starting from raw sequencing data to the analysis of differentially expressed genes, including various tables and figures to report intermediate and final results. Raw data may be provided in FASTQ format or can be downloaded automatically from the Gene Expression Omnibus repository. GEO2RNAseq strongly incorporates experimental as well as computational metadata. GEO2RNAseq is implemented in R, lightweight, easy to install via Conda and easy to use, but still very flexible through using modular programming and offering many extensions and alternative workflows.GEO2RNAseq is publicly available at https://anaconda.org/xentrics/r-geo2rnaseq and https://bitbucket.org/thomas_wolf/geo2rnaseq/overview, including source code, installation instruction, and comprehensive package documentation.

Download Full-text

GREIN: An Interactive Web Platform for Reanalyzing GEO RNA-seq Data

10.1101/326223 ◽

2018 ◽

Cited By ~ 1

Author(s):

Naim Al Mahi ◽

Mehdi Fazel Najafabadi ◽

Marcin Pilarczyk ◽

Michal Kouril ◽

Mario Medvedovic

Keyword(s):

Gene Expression ◽

User Interfaces ◽

Web Application ◽

Statistical Power ◽

Functional Characterization ◽

Gene Expression Omnibus ◽

Rna Seq ◽

Link Type ◽

Front End ◽

User Friendly

ABSTRACTThe vast amount of RNA-seq data deposited in Gene Expression Omnibus (GEO) and Sequence Read Archive (SRA) is still a grossly underutilized resource for biomedical research. To remove technical roadblocks for reusing these data, we have developed a web-application GREIN (GEO RNA-seq Experiments Interactive Navigator) which provides user-friendly interfaces to manipulate and analyze GEO RNA-seq data. GREIN is powered by the back-end computational pipeline for uniform processing of RNA-seq data and the large number (>6,500) of already processed datasets. The front-end user interfaces provide a wealth of user-analytics options including sub-setting and downloading processed data, interactive visualization, statistical power analyses, construction of differential gene expression signatures and their comprehensive functional characterization, and connectivity analysis with LINCS L1000 data. The combination of the massive amount of back-end data and front-end analytics options driven by user-friendly interfaces makes GREIN a unique open-source resource for re-using GEO RNA-seq data. GREIN is accessible at: https://shiny.ilincs.org/grein, the source code at: https://github.com/uc-bd2k/grein, and the Docker container at: https://hub.docker.com/r/ucbd2k/grein.

Download Full-text

XenoCP: Cloud-based BAM cleansing tool for RNA and DNA from Xenograft

10.1101/843250 ◽

2019 ◽

Cited By ~ 2

Author(s):

Michael Rusch ◽

Liang Ding ◽

Sasi Arunachalam ◽

Andrew Thrasher ◽

Hongjian Jin ◽

...

Keyword(s):

Gene Expression ◽

Tumor Heterogeneity ◽

Next Generation Sequencing Data ◽

Rna Seq ◽

Sequencing Data ◽

Link Type ◽

Gene Expression Quantification ◽

Expression Quantification ◽

Generation Sequencing ◽

Rna And Dna

ABSTRACTSummaryXenografts are important models for cancer research and the presence of mouse reads in xenograft next generation sequencing data can potentially confound interpretation of experimental results. We present an efficient, cloud-based BAM-to-BAM cleaning tool called XenoCP to remove mouse reads from xenograft BAM files. We show application of XenoCP in obtaining accurate gene expression quantification in RNA-seq and tumor heterogeneity in WGS of xenografts derived from brain and solid tumors.Availability and ImplementationSt. Jude Cloud (https://pecan.stjude.cloud/permalink/xenocp) and St. Jude Github (https://github.com/stjude/XenoCP)

Download Full-text

Snaptron: querying and visualizing splicing across tens of thousands of RNA-seq samples

10.1101/097881 ◽

2017 ◽

Cited By ~ 2

Author(s):

Christopher Wilks ◽

Phani Gaddipati ◽

Abhinav Nellore ◽

Ben Langmead

Keyword(s):

Tissue Specificity ◽

Rna Seq ◽

Sequencing Data ◽

Transcription Start ◽

Link Type ◽

Alternative Transcription ◽

Web App ◽

Inverted Indexing ◽

Splice Junctions ◽

Splicing Patterns

AbstractAs more and larger genomics studies appear, there is a growing need for comprehensive and queryable cross-study summaries. Snaptron is a search engine for summarized RNA sequencing data with a query planner that leverages R-tree, B-tree and inverted indexing strategies to rapidly execute queries over 146 million exon-exon splice junctions from over 70,000 human RNA-seq samples. Queries can be tailored by constraining which junctions and samples to consider. Snaptron can also rank and score junctions according to tissue specificity or other criteria. Further, Snaptron can rank and score samples according to the relative frequency of different splicing patterns. We outline biological questions that can be explored with Snaptron queries, including a study of novel exons in annotated genes, of exonization of repetitive element loci, and of a recently discovered alternative transcription start site for the ALK gene. Web app and documentation are at http://snaptron.cs.jhu.edu. Source code is at https://github.com/ChristopherWilks/snaptron under the MIT license.

Download Full-text

GXD’s RNA-Seq and Microarray Experiment Search: using curated metadata to reliably find mouse expression studies of interest

Database ◽

10.1093/database/baaa002 ◽

2020 ◽

Vol 2020 ◽

Cited By ~ 1

Author(s):

Constance M Smith ◽

James A Kadin ◽

Richard M Baldarelli ◽

Jonathan S Beal ◽

Olin Blodgett ◽

...

Keyword(s):

Gene Expression ◽

Microarray Experiment ◽

Gene Expression Omnibus ◽

Easy Access ◽

Free Text ◽

Rna Seq ◽

Endogenous Gene ◽

Expression Studies ◽

Text Searching ◽

Study Type

Abstract The Gene Expression Database (GXD), an extensive community resource of curated expression information for the mouse, has developed an RNA-Seq and Microarray Experiment Search (http://www.informatics.jax.org/gxd/htexp_index). This tool allows users to quickly and reliably find specific experiments in ArrayExpress and the Gene Expression Omnibus (GEO) that study endogenous gene expression in wild-type and mutant mice. Standardized metadata annotations, curated by GXD, allow users to specify the anatomical structure, developmental stage, mutated gene, strain and sex of samples of interest, as well as the study type and key parameters of the experiment. These searches, powered by controlled vocabularies and ontologies, can be combined with free text searching of experiment titles and descriptions. Search result summaries include link-outs to ArrayExpress and GEO, providing easy access to the expression data itself. Links to the PubMed entries for accompanying publications are also included. More information about this tool and GXD can be found at the GXD home page (http://www.informatics.jax.org/expression.shtml). Database URL: http://www.informatics.jax.org/expression.shtml

Download Full-text

Impact of therapy on gene expression in high-risk prostate cancer (PCA) treated with neoadjuvant docetaxel and androgen deprivation therapy.

Journal of Clinical Oncology ◽

10.1200/jco.2016.34.2_suppl.8 ◽

2016 ◽

Vol 34 (2_suppl) ◽

pp. 8-8

Author(s):

Himisha Beltran ◽

Alexander Wyatt ◽

Edmund Chedgy ◽

Ladan Fazli ◽

Andrea Sboner ◽

...

Keyword(s):

Gene Expression ◽

High Risk ◽

Hormone Receptors ◽

Dna Analysis ◽

Rna Extraction ◽

Post Treatment ◽

Rna Seq ◽

Sequencing Data ◽

Quantitative Expression ◽

High Risk Prostate Cancer

8 Background: Molecular analyses of neoadjuvant post-treatment radical prostatectomy (RP) specimens has been challenging as often times only microscopic foci remain present at time of RP precluding RNA-seq. DNA analysis alone in the absence of expression may be suboptimal in elucidating complex mechanisms of resistance and/or prognostic risk stratification. We therefore set out to develop an assay that could quantify mRNA expression in treated and untreated PCA using formalin fixed paraffin embedded (FFPE) tissues. Methods: We evaluated 40 untreated and post-treatment FFPE specimens as well as patient-matched pre-treated needle biopsies and baseline clinical data from patients enrolled on CALGB 90203: a randomized phase 3 trial comparing noeadjuvant docetaxel and ADT followed by RP vs RP alone for men with high risk localized PCA. High-density tumor areas were selected for RNA extraction (min 50ng RNA). We used NanoString nCounter to quantify gene expression of a custom panel of 75 genes including AR and androgen regulated, neural/neuroendocrine (NE), EMT, cell cycle, hormone receptors, TMPRSS-ERG, ARv7 splice variant, and housekeeper genes. mRNA data was integrated with matched whole exome sequencing data. Frozen specimens and RNA-Seq (n = 7) were used for QC and comparative analysis. Results: Quantitative expression using Nanostring showed high correlation with RNA-seq of patient-matched frozen tissue (Spearman coefficient 0.9). There was significant upregulation of AR and the ARv7 expression following treatment, as well as a subset of NE and EMT genes; three high chromogranin A outlier cases were identified in the treatment arm. There was an overall higher AR score in treated cases (based on expression of 30 AR signaling genes) compared to untreated, along the spectrum of CRPC. Conclusions: These data support the feasibility of quantifying gene expression in neoadjuvant-treated PCA cases with limited FFPE tissue requirement. Extensive characterization of AR status and NE/EMT genes identifies molecular outliers that can arise post-treatment and provides new insight into the heterogeneity of treatment response and potential early markers of resistance. Clinical trial information: NCT00430183.

Download Full-text

RNA sequencing analysis for profiling activation of cancer-associated molecular pathways.

Journal of Clinical Oncology ◽

10.1200/jco.2019.37.15_suppl.e13032 ◽

2019 ◽

Vol 37 (15_suppl) ◽

pp. e13032-e13032 ◽

Cited By ~ 2

Author(s):

Anton Buzdin ◽

Andrew Garazha ◽

Maxim Sorokin ◽

Alex Glusker ◽

Alexey Aleshin ◽

...

Keyword(s):

Gene Expression ◽

Original Data ◽

Tissue Expression ◽

Molecular Pathways ◽

Sequencing Analysis ◽

Rna Seq ◽

Sequencing Data ◽

Healthy Human ◽

Tissue Samples ◽

Normal Tissues

e13032 Background: Intracellular molecular pathways (IMPs) control all major events in the living cell. They are considered hotspots in contemporary oncology because knowledge of IMPs activation is essential for understanding mechanisms of molecular pathogenesis in oncology. Profiling IMPs requires RNA-seq data for tumors and for a collection of reference normal tissues. However, there is a shortage now in such profiles for normal tissues from healthy human donors, uniformly profiled in a single series of experiments. Access to the largest dataset of normal profiles GTEx is only partly available through the dbGaP. In TCGA database, norms are adjacent to surgically removed tumors and may be affected by tumor-linked growth factors, inflammation and altered vascularization. ENCODE datasets were for the autopsies of normal tissues, but they can’t form statistically significant reference groups. Methods: Tissue samples representing 20 organs were taken from post-mortal human healthy donors killed in road accidents no later than 36 hours after death, blood samples were taken from healthy volunteers. Gene expression was profiled in RNA-seq experiments using the same reagents, equipment and protocols. Bioinformatic algorithms for IMP analysis were developed and validated using experimental and public gene expression datasets. Results: From original sequencing data we constructed the biggest fully open reference expression database of normal human tissues including 465 profiles termed Oncobox Atlas of Normal Tissue Expression (ANTE, original data: GSE120795). We next developed a method termed Oncobox for interrogating activation of IMPs in human cancers. It includes modules of expression data harmonization and comparison and an algorithm for automatic annotation of molecular pathways. The Oncobox system enables accurate scoring of thousands molecular pathways using RNA-seq data. Oncobox pathway analysis is also applicable for quantitative proteomics and microRNA data in oncology. Conclusions: The Oncobox system can be used for a plethora of applications in cancer research including finding differentially regulated genes and IMPs, and for discovery of new pathway-related diagnostic and prognostic biomarkers.

Download Full-text

GEDI: an R package for integration of transcriptomic data from multiple high-throughput platforms

10.1101/2021.11.11.468093 ◽

2021 ◽

Author(s):

Mathias N Stokholm ◽

Maria B Rabaglino ◽

Haja N Kadarmideen

Keyword(s):

Gene Expression ◽

Data Integration ◽

Principal Component ◽

R Package ◽

Gene Expression Omnibus ◽

Batch Effect ◽

Sequencing Data ◽

Transcriptomic Data ◽

Ncbi Gene Expression Omnibus ◽

Forward Stepwise

Transcriptomic data is often expensive and difficult to generate in large cohorts in comparison to genomic data and therefore is often important to integrate multiple transcriptomic datasets from both microarray and next generation sequencing (NGS) based transcriptomic data across similar experiments or clinical trials to improve analytical power and discovery of novel transcripts and genes. However, transcriptomic data integration presents a few challenges including re-annotation and batch effect removal. We developed the Gene Expression Data Integration (GEDI) R package to enable transcriptomic data integration by combining already existing R packages. With just four functions, the GEDI R package makes constructing a transcriptomic data integration pipeline straightforward. Together, the functions overcome the complications in transcriptomic data integration by automatically re-annotating the data and removing the batch effect. The removal of the batch effect is verified with Principal Component Analysis and the data integration is verified using a logistic regression model with forward stepwise feature selection. To demonstrate the functionalities of the GEDI package, we integrated five bovine endometrial transcriptomic datasets from the NCBI Gene Expression Omnibus. The datasets included Affymetrix, Agilent and RNA-sequencing data. Furthermore, we compared the GEDI package to already existing tools and found that GEDI is the only tool that provides a full transcriptomic data integration pipeline including verification of both batch effect removal and data integration.

Download Full-text

Impact of Gene Annotation Choice on the Quantification of RNA-Seq Data

10.21203/rs.3.rs-421080/v1 ◽

2021 ◽

Author(s):

David Chisanga ◽

Yang Liao ◽

Wei Shi

Keyword(s):

Gene Expression ◽

Gene Annotation ◽

Expression Data ◽

Refseq Gene ◽

Rna Seq ◽

Sequencing Data ◽

Microarray Expression Data ◽

Sequencing Quality ◽

Gene Expression Quantification ◽

Expression Quantification

Abstract Background: RNA sequencing is currently the method of choice for genome-wide profiling of gene expression. A popular approach to quantify expression levels of genes from RNA-seq data is to map reads to a reference genome and then count mapped reads to each gene. Gene annotation data, which include chromosomal coordinates of exons for tens of thousands of genes, are required for this quantification process. There are several major sources of gene annotations that can be used for quantification, such as Ensembl and RefSeq databases. However, there is very little understanding of the effect that the choice of annotation has on the accuracy of gene expression quantification in an RNA-seq analysis.Results: In this paper, we present results from our comparison of Ensembl and RefSeq human annotations on their impact on gene expression quantification using a benchmark RNA-seq dataset generated by the SEquencing Quality Control (SEQC) consortium. We show that the use of RefSeq gene annotation models led to better quantification accuracy, based on the correlation with ground truths including expression data from >800 real-time PCR validated genes, known titration ratios of gene expression and microarray expression data. We also found that the recent expansion of the RefSeq annotation has led to a decrease in its annotation accuracy. Finally, we demonstrated that the RNA-seq quantification differences observed between different annotations were not affected by the use of different normalization methods.Conclusion: In conclusion, our study found that the use of the conservative RefSeq gene annotation yields better RNA-seq quantification results than the more comprehensive Ensembl annotation. We also found that, surprisingly, the recent expansion of the RefSeq database, which was primarily driven by the incorporation of sequencing data into the gene annotation process, resulted in a reduction in the accuracy of RNA-seq quantification.

Download Full-text

SPARTA: Simple Program for Automated reference-based bacterial RNA-seq Transcriptome Analysis

10.1101/021915 ◽

2015 ◽

Author(s):

Benjamin K Johnson ◽

Matthew B Scholz ◽

Tracy K Teal ◽

Robert B Abramovitch

Keyword(s):

Gene Expression ◽

Differential Gene Expression ◽

Quality Analysis ◽

Rna Seq ◽

Sequencing Data ◽

Data Set ◽

Bacterial Rna ◽

Analysis Workflow ◽

Differential Gene ◽

Reference Counting

Summary: SPARTA is a reference-based bacterial RNA-seq analysis workflow application for single-end Illumina reads. SPARTA is turnkey software that simplifies the process of analyzing RNA-seq data sets, making bacterial RNA-seq analysis a routine process that can be undertaken on a personal computer or in the classroom. The easy-to-install, complete workflow processes whole transcriptome shotgun sequencing data files by trimming reads and removing adapters, mapping reads to a reference, counting gene features, calculating differential gene expression, and, importantly, checking for potential batch effects within the data set. SPARTA outputs quality analysis reports, gene feature counts and differential gene expression tables and scatterplots. The workflow is implemented in Python for file management and sequential execution of each analysis step and is available for Mac OS X, Microsoft Windows, and Linux. To promote the use of SPARTA as a teaching platform, a web-based tutorial is available explaining how RNA-seq data are processed and analyzed by the software. Availability and Implementation: Tutorial and workflow can be found at sparta.readthedocs.org. Teaching materials are located at sparta-teaching.readthedocs.org. Source code can be downloaded at www.github.com/abramovitchMSU/, implemented in Python and supported on Mac OS X, Linux, and MS Windows. Contact: Robert B. Abramovitch ([email protected]) Supplemental Information: Supplementary data are available online

Download Full-text

RNA sequencing data: hitchhiker's guide to expression analysis

10.7287/peerj.preprints.27283 ◽

2018 ◽

Author(s):

Koen Van Den Berge ◽

Katharina Hembach ◽

Charlotte Soneson ◽

Simone Tiberi ◽

Lieven Clement ◽

...

Keyword(s):

Gene Expression ◽

Rna Sequencing ◽

Large Scale ◽

Science Studies ◽

Rna Seq ◽

Sequencing Data ◽

Data Types ◽

The Past ◽

Long Read ◽

Statistical Approaches

Gene expression is the fundamental level at which the result of various genetic and regulatory programs are observable. The measurement of transcriptome-wide gene expression has convincingly switched from microarrays to sequencing in a matter of years. RNA sequencing (RNA-seq) provides a quantitative and open system for profiling transcriptional outcomes on a large scale and therefore facilitates a large diversity of applications, including basic science studies, but also agricultural or clinical situations. In the past 10 years or so, much has been learned about the characteristics of the RNA-seq datasets as well as the performance of the myriad of methods developed. In this review, we give an overall view of the developments in RNA-seq data analysis, including experimental design, with an explicit focus on quantification of gene expression and statistical approaches for differential expression. We also highlight emerging data types, such as single-cell RNA-seq and gene expression profiling using long-read technologies.

Download Full-text