scholarly journals GEO2RNAseq: An easy-to-use R pipeline for complete pre-processing of RNA-seq data

2019 ◽  
Author(s):  
Bastian Seelbinder ◽  
Thomas Wolf ◽  
Steffen Priebe ◽  
Sylvie McNamara ◽  
Silvia Gerber ◽  
...  

ABSTRACTIn transcriptomics, the study of the total set of RNAs transcribed by the cell, RNA sequencing (RNA-seq) has become the standard tool for analysing gene expression. The primary goal is the detection of genes whose expression changes significantly between two or more conditions, either for a single species or for two or more interacting species at the same time (dual RNA-seq, triple RNA-seq and so forth). The analysis of RNA-seq can be simplified as many steps of the data pre-processing can be standardised in a pipeline.In this publication we present the “GEO2RNAseq” pipeline for complete, quick and concurrent pre-processing of single, dual, and triple RNA-seq data. It covers all pre-processing steps starting from raw sequencing data to the analysis of differentially expressed genes, including various tables and figures to report intermediate and final results. Raw data may be provided in FASTQ format or can be downloaded automatically from the Gene Expression Omnibus repository. GEO2RNAseq strongly incorporates experimental as well as computational metadata. GEO2RNAseq is implemented in R, lightweight, easy to install via Conda and easy to use, but still very flexible through using modular programming and offering many extensions and alternative workflows.GEO2RNAseq is publicly available at https://anaconda.org/xentrics/r-geo2rnaseq and https://bitbucket.org/thomas_wolf/geo2rnaseq/overview, including source code, installation instruction, and comprehensive package documentation.

2018 ◽  
Author(s):  
Naim Al Mahi ◽  
Mehdi Fazel Najafabadi ◽  
Marcin Pilarczyk ◽  
Michal Kouril ◽  
Mario Medvedovic

ABSTRACTThe vast amount of RNA-seq data deposited in Gene Expression Omnibus (GEO) and Sequence Read Archive (SRA) is still a grossly underutilized resource for biomedical research. To remove technical roadblocks for reusing these data, we have developed a web-application GREIN (GEO RNA-seq Experiments Interactive Navigator) which provides user-friendly interfaces to manipulate and analyze GEO RNA-seq data. GREIN is powered by the back-end computational pipeline for uniform processing of RNA-seq data and the large number (>6,500) of already processed datasets. The front-end user interfaces provide a wealth of user-analytics options including sub-setting and downloading processed data, interactive visualization, statistical power analyses, construction of differential gene expression signatures and their comprehensive functional characterization, and connectivity analysis with LINCS L1000 data. The combination of the massive amount of back-end data and front-end analytics options driven by user-friendly interfaces makes GREIN a unique open-source resource for re-using GEO RNA-seq data. GREIN is accessible at: https://shiny.ilincs.org/grein, the source code at: https://github.com/uc-bd2k/grein, and the Docker container at: https://hub.docker.com/r/ucbd2k/grein.


2019 ◽  
Author(s):  
Michael Rusch ◽  
Liang Ding ◽  
Sasi Arunachalam ◽  
Andrew Thrasher ◽  
Hongjian Jin ◽  
...  

ABSTRACTSummaryXenografts are important models for cancer research and the presence of mouse reads in xenograft next generation sequencing data can potentially confound interpretation of experimental results. We present an efficient, cloud-based BAM-to-BAM cleaning tool called XenoCP to remove mouse reads from xenograft BAM files. We show application of XenoCP in obtaining accurate gene expression quantification in RNA-seq and tumor heterogeneity in WGS of xenografts derived from brain and solid tumors.Availability and ImplementationSt. Jude Cloud (https://pecan.stjude.cloud/permalink/xenocp) and St. Jude Github (https://github.com/stjude/XenoCP)


2017 ◽  
Author(s):  
Christopher Wilks ◽  
Phani Gaddipati ◽  
Abhinav Nellore ◽  
Ben Langmead

AbstractAs more and larger genomics studies appear, there is a growing need for comprehensive and queryable cross-study summaries. Snaptron is a search engine for summarized RNA sequencing data with a query planner that leverages R-tree, B-tree and inverted indexing strategies to rapidly execute queries over 146 million exon-exon splice junctions from over 70,000 human RNA-seq samples. Queries can be tailored by constraining which junctions and samples to consider. Snaptron can also rank and score junctions according to tissue specificity or other criteria. Further, Snaptron can rank and score samples according to the relative frequency of different splicing patterns. We outline biological questions that can be explored with Snaptron queries, including a study of novel exons in annotated genes, of exonization of repetitive element loci, and of a recently discovered alternative transcription start site for the ALK gene. Web app and documentation are at http://snaptron.cs.jhu.edu. Source code is at https://github.com/ChristopherWilks/snaptron under the MIT license.


Database ◽  
2020 ◽  
Vol 2020 ◽  
Author(s):  
Constance M Smith ◽  
James A Kadin ◽  
Richard M Baldarelli ◽  
Jonathan S Beal ◽  
Olin Blodgett ◽  
...  

Abstract The Gene Expression Database (GXD), an extensive community resource of curated expression information for the mouse, has developed an RNA-Seq and Microarray Experiment Search (http://www.informatics.jax.org/gxd/htexp_index). This tool allows users to quickly and reliably find specific experiments in ArrayExpress and the Gene Expression Omnibus (GEO) that study endogenous gene expression in wild-type and mutant mice. Standardized metadata annotations, curated by GXD, allow users to specify the anatomical structure, developmental stage, mutated gene, strain and sex of samples of interest, as well as the study type and key parameters of the experiment. These searches, powered by controlled vocabularies and ontologies, can be combined with free text searching of experiment titles and descriptions. Search result summaries include link-outs to ArrayExpress and GEO, providing easy access to the expression data itself. Links to the PubMed entries for accompanying publications are also included. More information about this tool and GXD can be found at the GXD home page (http://www.informatics.jax.org/expression.shtml). Database URL: http://www.informatics.jax.org/expression.shtml


2016 ◽  
Vol 34 (2_suppl) ◽  
pp. 8-8
Author(s):  
Himisha Beltran ◽  
Alexander Wyatt ◽  
Edmund Chedgy ◽  
Ladan Fazli ◽  
Andrea Sboner ◽  
...  

8 Background: Molecular analyses of neoadjuvant post-treatment radical prostatectomy (RP) specimens has been challenging as often times only microscopic foci remain present at time of RP precluding RNA-seq. DNA analysis alone in the absence of expression may be suboptimal in elucidating complex mechanisms of resistance and/or prognostic risk stratification. We therefore set out to develop an assay that could quantify mRNA expression in treated and untreated PCA using formalin fixed paraffin embedded (FFPE) tissues. Methods: We evaluated 40 untreated and post-treatment FFPE specimens as well as patient-matched pre-treated needle biopsies and baseline clinical data from patients enrolled on CALGB 90203: a randomized phase 3 trial comparing noeadjuvant docetaxel and ADT followed by RP vs RP alone for men with high risk localized PCA. High-density tumor areas were selected for RNA extraction (min 50ng RNA). We used NanoString nCounter to quantify gene expression of a custom panel of 75 genes including AR and androgen regulated, neural/neuroendocrine (NE), EMT, cell cycle, hormone receptors, TMPRSS-ERG, ARv7 splice variant, and housekeeper genes. mRNA data was integrated with matched whole exome sequencing data. Frozen specimens and RNA-Seq (n = 7) were used for QC and comparative analysis. Results: Quantitative expression using Nanostring showed high correlation with RNA-seq of patient-matched frozen tissue (Spearman coefficient 0.9). There was significant upregulation of AR and the ARv7 expression following treatment, as well as a subset of NE and EMT genes; three high chromogranin A outlier cases were identified in the treatment arm. There was an overall higher AR score in treated cases (based on expression of 30 AR signaling genes) compared to untreated, along the spectrum of CRPC. Conclusions: These data support the feasibility of quantifying gene expression in neoadjuvant-treated PCA cases with limited FFPE tissue requirement. Extensive characterization of AR status and NE/EMT genes identifies molecular outliers that can arise post-treatment and provides new insight into the heterogeneity of treatment response and potential early markers of resistance. Clinical trial information: NCT00430183.


2019 ◽  
Vol 37 (15_suppl) ◽  
pp. e13032-e13032 ◽  
Author(s):  
Anton Buzdin ◽  
Andrew Garazha ◽  
Maxim Sorokin ◽  
Alex Glusker ◽  
Alexey Aleshin ◽  
...  

e13032 Background: Intracellular molecular pathways (IMPs) control all major events in the living cell. They are considered hotspots in contemporary oncology because knowledge of IMPs activation is essential for understanding mechanisms of molecular pathogenesis in oncology. Profiling IMPs requires RNA-seq data for tumors and for a collection of reference normal tissues. However, there is a shortage now in such profiles for normal tissues from healthy human donors, uniformly profiled in a single series of experiments. Access to the largest dataset of normal profiles GTEx is only partly available through the dbGaP. In TCGA database, norms are adjacent to surgically removed tumors and may be affected by tumor-linked growth factors, inflammation and altered vascularization. ENCODE datasets were for the autopsies of normal tissues, but they can’t form statistically significant reference groups. Methods: Tissue samples representing 20 organs were taken from post-mortal human healthy donors killed in road accidents no later than 36 hours after death, blood samples were taken from healthy volunteers. Gene expression was profiled in RNA-seq experiments using the same reagents, equipment and protocols. Bioinformatic algorithms for IMP analysis were developed and validated using experimental and public gene expression datasets. Results: From original sequencing data we constructed the biggest fully open reference expression database of normal human tissues including 465 profiles termed Oncobox Atlas of Normal Tissue Expression (ANTE, original data: GSE120795). We next developed a method termed Oncobox for interrogating activation of IMPs in human cancers. It includes modules of expression data harmonization and comparison and an algorithm for automatic annotation of molecular pathways. The Oncobox system enables accurate scoring of thousands molecular pathways using RNA-seq data. Oncobox pathway analysis is also applicable for quantitative proteomics and microRNA data in oncology. Conclusions: The Oncobox system can be used for a plethora of applications in cancer research including finding differentially regulated genes and IMPs, and for discovery of new pathway-related diagnostic and prognostic biomarkers.


2021 ◽  
Author(s):  
Mathias N Stokholm ◽  
Maria B Rabaglino ◽  
Haja N Kadarmideen

Transcriptomic data is often expensive and difficult to generate in large cohorts in comparison to genomic data and therefore is often important to integrate multiple transcriptomic datasets from both microarray and next generation sequencing (NGS) based transcriptomic data across similar experiments or clinical trials to improve analytical power and discovery of novel transcripts and genes. However, transcriptomic data integration presents a few challenges including re-annotation and batch effect removal. We developed the Gene Expression Data Integration (GEDI) R package to enable transcriptomic data integration by combining already existing R packages. With just four functions, the GEDI R package makes constructing a transcriptomic data integration pipeline straightforward. Together, the functions overcome the complications in transcriptomic data integration by automatically re-annotating the data and removing the batch effect. The removal of the batch effect is verified with Principal Component Analysis and the data integration is verified using a logistic regression model with forward stepwise feature selection. To demonstrate the functionalities of the GEDI package, we integrated five bovine endometrial transcriptomic datasets from the NCBI Gene Expression Omnibus. The datasets included Affymetrix, Agilent and RNA-sequencing data. Furthermore, we compared the GEDI package to already existing tools and found that GEDI is the only tool that provides a full transcriptomic data integration pipeline including verification of both batch effect removal and data integration.


2021 ◽  
Author(s):  
David Chisanga ◽  
Yang Liao ◽  
Wei Shi

Abstract Background: RNA sequencing is currently the method of choice for genome-wide profiling of gene expression. A popular approach to quantify expression levels of genes from RNA-seq data is to map reads to a reference genome and then count mapped reads to each gene. Gene annotation data, which include chromosomal coordinates of exons for tens of thousands of genes, are required for this quantification process. There are several major sources of gene annotations that can be used for quantification, such as Ensembl and RefSeq databases. However, there is very little understanding of the effect that the choice of annotation has on the accuracy of gene expression quantification in an RNA-seq analysis.Results: In this paper, we present results from our comparison of Ensembl and RefSeq human annotations on their impact on gene expression quantification using a benchmark RNA-seq dataset generated by the SEquencing Quality Control (SEQC) consortium. We show that the use of RefSeq gene annotation models led to better quantification accuracy, based on the correlation with ground truths including expression data from >800 real-time PCR validated genes, known titration ratios of gene expression and microarray expression data. We also found that the recent expansion of the RefSeq annotation has led to a decrease in its annotation accuracy. Finally, we demonstrated that the RNA-seq quantification differences observed between different annotations were not affected by the use of different normalization methods.Conclusion: In conclusion, our study found that the use of the conservative RefSeq gene annotation yields better RNA-seq quantification results than the more comprehensive Ensembl annotation. We also found that, surprisingly, the recent expansion of the RefSeq database, which was primarily driven by the incorporation of sequencing data into the gene annotation process, resulted in a reduction in the accuracy of RNA-seq quantification.


2015 ◽  
Author(s):  
Benjamin K Johnson ◽  
Matthew B Scholz ◽  
Tracy K Teal ◽  
Robert B Abramovitch

Summary: SPARTA is a reference-based bacterial RNA-seq analysis workflow application for single-end Illumina reads. SPARTA is turnkey software that simplifies the process of analyzing RNA-seq data sets, making bacterial RNA-seq analysis a routine process that can be undertaken on a personal computer or in the classroom. The easy-to-install, complete workflow processes whole transcriptome shotgun sequencing data files by trimming reads and removing adapters, mapping reads to a reference, counting gene features, calculating differential gene expression, and, importantly, checking for potential batch effects within the data set. SPARTA outputs quality analysis reports, gene feature counts and differential gene expression tables and scatterplots. The workflow is implemented in Python for file management and sequential execution of each analysis step and is available for Mac OS X, Microsoft Windows, and Linux. To promote the use of SPARTA as a teaching platform, a web-based tutorial is available explaining how RNA-seq data are processed and analyzed by the software. Availability and Implementation: Tutorial and workflow can be found at sparta.readthedocs.org. Teaching materials are located at sparta-teaching.readthedocs.org. Source code can be downloaded at www.github.com/abramovitchMSU/, implemented in Python and supported on Mac OS X, Linux, and MS Windows. Contact: Robert B. Abramovitch ([email protected]) Supplemental Information: Supplementary data are available online


2018 ◽  
Author(s):  
Koen Van Den Berge ◽  
Katharina Hembach ◽  
Charlotte Soneson ◽  
Simone Tiberi ◽  
Lieven Clement ◽  
...  

Gene expression is the fundamental level at which the result of various genetic and regulatory programs are observable. The measurement of transcriptome-wide gene expression has convincingly switched from microarrays to sequencing in a matter of years. RNA sequencing (RNA-seq) provides a quantitative and open system for profiling transcriptional outcomes on a large scale and therefore facilitates a large diversity of applications, including basic science studies, but also agricultural or clinical situations. In the past 10 years or so, much has been learned about the characteristics of the RNA-seq datasets as well as the performance of the myriad of methods developed. In this review, we give an overall view of the developments in RNA-seq data analysis, including experimental design, with an explicit focus on quantification of gene expression and statistical approaches for differential expression. We also highlight emerging data types, such as single-cell RNA-seq and gene expression profiling using long-read technologies.


Sign in / Sign up

Export Citation Format

Share Document