recount: A large-scale resource of analysis-ready RNA-seq expression data

Mapping Intimacies ◽

10.1101/068478 ◽

2016 ◽

Cited By ~ 5

Author(s):

Leonardo Collado-Torres ◽

Abhinav Nellore ◽

Kai Kammers ◽

Shannon E. Ellis ◽

Margaret A. Taub ◽

...

Keyword(s):

Large Scale ◽

Meta Analysis ◽

Differential Expression Analysis ◽

Expression Data ◽

Rna Seq ◽

Base Level ◽

Genomic Annotation ◽

Public Data ◽

Using Data ◽

Splice Junctions

Abstractrecount is a resource of processed and summarized expression data spanning nearly 60,000 human RNA-seq samples from the Sequence Read Archive (SRA). The associated recount Bio-conductor package provides a convenient API for querying, downloading, and analyzing the data. Each processed study consists of meta/phenotype data, the expression levels of genes and their underlying exons and splice junctions, and corresponding genomic annotation. We also provide data summarization types for quantifying novel transcribed sequence including base-resolution coverage and potentially unannotated splice junctions. We present workflows illustrating how to use recount to perform differential expression analysis including meta-analysis, annotation-free base-level analysis, and replication of smaller studies using data from larger studies. recount provides a valuable and user-friendly resource of processed RNA-seq datasets to draw additional biological insights from existing public data. The resource is available at https://jhubiostatistics.shinyapps.io/recount/.

Download Full-text

Meta-analysis of RNA-seq expression data across species, tissues and studies

Genome Biology ◽

10.1186/s13059-015-0853-4 ◽

2015 ◽

Vol 16 (1) ◽

Cited By ~ 64

Author(s):

Peter H. Sudmant ◽

Maria S. Alexis ◽

Christopher B. Burge

Keyword(s):

Meta Analysis ◽

Expression Data ◽

Rna Seq

Download Full-text

Leveraging high-powered RNA-Seq datasets to improve inference of regulatory activity in single-cell RNA-Seq data

10.1101/553040 ◽

2019 ◽

Cited By ~ 1

Author(s):

Ning Wang ◽

Andrew E. Teschendorff

Keyword(s):

Transcription Factors ◽

Single Cell ◽

Cell Fate ◽

Regulatory Networks ◽

Large Scale ◽

Single Cells ◽

Differential Expression Analysis ◽

Dropout Rate ◽

Rna Seq ◽

Regulatory Activity

AbstractInferring the activity of transcription factors in single cells is a key task to improve our understanding of development and complex genetic diseases. This task is, however, challenging due to the relatively large dropout rate and noisy nature of single-cell RNA-Seq data. Here we present a novel statistical inference framework called SCIRA (Single Cell Inference of Regulatory Activity), which leverages the power of large-scale bulk RNA-Seq datasets to infer high-quality tissue-specific regulatory networks, from which regulatory activity estimates in single cells can be subsequently obtained. We show that SCIRA can correctly infer regulatory activity of transcription factors affected by high technical dropouts. In particular, SCIRA can improve sensitivity by as much as 70% compared to differential expression analysis and current state-of-the-art methods. Importantly, SCIRA can reveal novel regulators of cell-fate in tissue-development, even for cell-types that only make up 5% of the tissue, and can identify key novel tumor suppressor genes in cancer at single cell resolution. In summary, SCIRA will be an invaluable tool for single-cell studies aiming to accurately map activity patterns of key transcription factors during development, and how these are altered in disease.

Download Full-text

Genetic Variant rs755622 Regulates Expression of the Multiple Sclerosis Severity Modifier D-Dopachrome Tautomerase in a Sex-Specific Way

BioMed Research International ◽

10.1155/2018/8285653 ◽

2018 ◽

Vol 2018 ◽

pp. 1-7 ◽

Cited By ~ 6

Author(s):

Zhijie Han ◽

Jiaojiao Qu ◽

Jiehong Zhao ◽

Xiao Zou

Keyword(s):

Multiple Sclerosis ◽

Macrophage Migration Inhibitory Factor ◽

Large Scale ◽

Genetic Variant ◽

Differential Expression Analysis ◽

Promoter Sequence ◽

Minor Allele ◽

Rna Seq ◽

Dopachrome Tautomerase ◽

Allele Variant

Multiple sclerosis (MS) is a sex-specific autoimmune disease involving central nervous system. Previous studies determined that macrophage migration inhibitory factor (MIF) and its homologue D-dopachrome tautomerase (DDT) sex-specifically affect MS progression. Moreover, other studies reported that rs755622 polymorphism in promoter region of MIF gene is associated with risk of MS and affects the promoter activity to regulate MIF expression in a sex-specific way. Given that MIF and DDT share a part of promoter sequence, we surmise that rs755622 can also regulate DDT expression in a sex-specific way. However, this has not yet been studied. Here, we used five large-scale expression quantitative trait loci (eQTLs) and two RNA-seq datasets from brain and blood to assess the potential influence of rs755622 variant on expression of DDT in different genders by the linear regression and differential expression analysis. The results show that the minor allele frequency of rs755622 and expression of DDT are significantly increased in males for MS subjects and this minor allele variant can significantly upregulate DDT expression for males but not females, which suggests that the regulation of DDT expression level by rs755622 can affect MS progression in males. These findings further support and expand conclusions of previous studies and may help to better understand the mechanisms of MS.

Download Full-text

ExAtlas: An interactive online tool for meta-analysis of gene expression data

Journal of Bioinformatics and Computational Biology ◽

10.1142/s0219720015500195 ◽

2015 ◽

Vol 13 (06) ◽

pp. 1550019 ◽

Cited By ~ 37

Author(s):

Alexei A. Sharov ◽

David Schlessinger ◽

Minoru S. H. Ko

Keyword(s):

Gene Expression ◽

Gene Ontology ◽

Gene Expression Data ◽

Fixed Effects ◽

Expression Profiles ◽

Meta Analysis ◽

Data Sets ◽

Expression Data ◽

Gene Set ◽

Public Data

We have developed ExAtlas, an on-line software tool for meta-analysis and visualization of gene expression data. In contrast to existing software tools, ExAtlas compares multi-component data sets and generates results for all combinations (e.g. all gene expression profiles versus all Gene Ontology annotations). ExAtlas handles both users’ own data and data extracted semi-automatically from the public repository (GEO/NCBI database). ExAtlas provides a variety of tools for meta-analyses: (1) standard meta-analysis (fixed effects, random effects, z-score, and Fisher’s methods); (2) analyses of global correlations between gene expression data sets; (3) gene set enrichment; (4) gene set overlap; (5) gene association by expression profile; (6) gene specificity; and (7) statistical analysis (ANOVA, pairwise comparison, and PCA). ExAtlas produces graphical outputs, including heatmaps, scatter-plots, bar-charts, and three-dimensional images. Some of the most widely used public data sets (e.g. GNF/BioGPS, Gene Ontology, KEGG, GAD phenotypes, BrainScan, ENCODE ChIP-seq, and protein–protein interaction) are pre-loaded and can be used for functional annotations.

Download Full-text

Improved gene co-expression network quality through expression dataset down-sampling and network aggregation

Scientific Reports ◽

10.1038/s41598-019-50885-8 ◽

2019 ◽

Vol 9 (1) ◽

Cited By ~ 3

Author(s):

Franziska Liesecke ◽

Johan-Owen De Craene ◽

Sébastien Besseau ◽

Vincent Courdavault ◽

Marc Clastre ◽

...

Keyword(s):

Large Scale ◽

Expression Profiles ◽

Expression Data ◽

Rna Seq ◽

Network Construction ◽

The Past ◽

Wide Range ◽

Gene Associations ◽

New Gene ◽

Network Quality

Abstract Large-scale gene co-expression networks are an effective methodology to analyze sets of co-expressed genes and discover new gene functions or associations. Distances between genes are estimated according to their expression profiles and are visualized in networks that may be further partitioned to reveal communities of co-expressed genes. Creating expression profiles is now eased by the large amounts of publicly available expression data (microarrays and RNA-seq). Although many distance calculation methods have been intensively compared and reviewed in the past, it is unclear how to proceed when many samples reflecting a wide range of different conditions are available. Should as many samples as possible be integrated into network construction or be partitioned into smaller sets of more related samples? Previous studies have indicated a saturation in network performances to capture known associations once a certain number of samples is included in distance calculations. Here, we examined the influence of sample size on co-expression network construction using microarray and RNA-seq expression data from three plant species. We tested different down-sampling methods and compared network performances in recovering known gene associations to networks obtained from full datasets. We further examined how aggregating networks may help increase this performance by testing six aggregation methods.

Download Full-text

recount-brain: a curated repository of human brain RNA-seq datasets metadata

10.1101/618025 ◽

2019 ◽

Author(s):

Ashkaun Razmara ◽

Shannon E. Ellis ◽

Dustin J. Sokolowski ◽

Sean Davis ◽

Michael D. Wilson ◽

...

Keyword(s):

Gene Expression ◽

Human Brain ◽

Gene Expression Data ◽

Differential Expression Analysis ◽

Controlled Vocabulary ◽

The Cancer Genome Atlas ◽

Tissue Type ◽

Expression Data ◽

Rna Seq ◽

Human Brain Tissue

AbstractThe usability of publicly-available gene expression data is often limited by the availability of high-quality, standardized biological phenotype and experimental condition information (“metadata”). We released the recount2 project, which involved re-processing ∼70,000 samples in the Sequencing Read Archive (SRA), Genotype-Tissue Expression (GTEx), and The Cancer Genome Atlas (TCGA) projects. While samples from the latter two projects are well-characterized with extensive metadata, the ∼50,000 RNA-seq samples from SRA in recount2 are inconsistently annotated with metadata. Tissue type, sex, and library type can be estimated from the RNA sequencing (RNA-seq) data itself. However, more detailed and harder to predict metadata, like age and diagnosis, must ideally be provided by labs that deposit the data.To facilitate more analyses within human brain tissue data, we have complemented phenotype predictions by manually constructing a uniformly-curated database of public RNA-seq samples present in SRA and recount2. We describe the reproducible curation process for constructing recount-brain that involves systematic review of the primary manuscript, which can serve as a guide to annotate other studies and tissues. We further expanded recount-brain by merging it with GTEx and TCGA brain samples as well as linking to controlled vocabulary terms for tissue, Brodmann area and disease. Furthermore, we illustrate how to integrate the sample metadata in recount-brain with the gene expression data in recount2 to perform differential expression analysis. We then provide three analysis examples involving modeling postmortem interval, glioblastoma, and meta-analyses across GTEx and TCGA. Overall, recount-brain facilitates expression analyses and improves their reproducibility as individual researchers do not have to manually curate the sample metadata. recount-brain is available via the add_metadata() function from the recount Bioconductor package at bioconductor.org/packages/recount.

Download Full-text

Genes highly overexpressed in salt-stressed Young oil palm (Elaeis guineensis) plants

Revista Brasileira de Engenharia Agrícola e Ambiental ◽

10.1590/1807-1929/agriambi.v25n12p813-818 ◽

2021 ◽

Vol 25 (12) ◽

pp. 813-818

Author(s):

Thalita M. M. Ferreira ◽

André P. Leão ◽

Carlos A. F. de Sousa ◽

Manoel T. Souza Júnior

Keyword(s):

Salt Stress ◽

Oil Palm ◽

Large Scale ◽

Elaeis Guineensis ◽

Differential Expression Analysis ◽

Gene Promoter ◽

Rna Seq ◽

Over Expression ◽

Sequencing Platforms

ABSTRACT RNA-seq is a technique based on the large-scale sequencing of transcript-derived cDNAs using next-generation sequencing platforms mostly used today to characterize an organism’s transcriptome. The analysis of RNA-seq data allows for identifying genes differentially expressed in a given condition, such as salt stress. This study aimed to search and characterize genes from the African oil palm (Elaeis guineensis Jacq.) highly up-regulated during salt stress, with a long-term goal of gene promoter prospection and validation. The apical leaves from the control (electrical conductivity of ~2 dS m-1) and salt-stressed (~40 dS m-1) young oil palm plants, collected at 5 and 12 days after the beginning of the stress, were subjected to extraction of total RNA, with three plants (replicates) per treatment. The complete genome ofE. guineensis, available at the National Center for Biotechnology Information, was used as the reference genome - BioProject PRJNA192219. The differential expression analysis led to the selection for further characterization of seven genes, which had increased expressions of 37-84 times under salt stress. The strategy used in this study enabled the selection of seven salt-responsive genes highly up-regulated during salt stress, and some of them coded for proteins already reported as responsible for salinity tolerance in other plant species through over-expression or knockout.

Download Full-text

A comprehensive RNA-Seq pipeline includes meta-analysis, interactivity and automatic reporting

10.7287/peerj.preprints.27317v1 ◽

2018 ◽

Author(s):

Giulio Spinozzi ◽

Valentina Tini ◽

Laura Mincarelli ◽

Brunangelo Falini ◽

Maria Paola Martelli

Keyword(s):

Gene Ontology ◽

Acute Myeloid Leukemia ◽

Myeloid Leukemia ◽

Meta Analysis ◽

Differential Expression Analysis ◽

Differential Analysis ◽

Rna Seq ◽

Shiny App ◽

Automated Pipeline ◽

Acute Myeloid

There are many methods available for each phase of the RNA-Seq analysis and each of them uses different algorithms. It is therefore useful to identify a pipeline that combines the best tools in terms of time and results. For this purpose, we compared five different pipelines, obtained by combining the most used tools in RNA-Seq analysis. Using RNA-Seq data on samples of different Acute Myeloid Leukemia (AML) cell lines, we compared five pipelines from the alignment to the differential expression analysis (DEA). For each one we evaluated the peak of RAM and time and then compared the differentially expressed genes identified by each pipeline. It emerged that the pipeline with shorter times, lower consumption of RAM and more reliable results, is that which involves the use ofHISAT2for alignment, featureCountsfor quantification and edgeRfor differential analysis. Finally, we developed an automated pipeline that recurs by default to the cited pipeline, but it also allows to choose between different tools. In addition, the pipeline makes a final meta-analysis that includes a Gene Ontology and Pathway analysis. The results can be viewed in an interactive Shiny Appand exported in a report (pdf, word or html formats).

Download Full-text

Addressing the looming identity crisis in single cell RNA-seq

10.1101/150524 ◽

2017 ◽

Cited By ~ 3

Author(s):

Megan Crow ◽

Anirban Paul ◽

Sara Ballouz ◽

Z. Josh Huang ◽

Jesse Gillis

Keyword(s):

Single Cell ◽

Large Scale ◽

Ad Hoc ◽

Meta Analysis ◽

High Specificity ◽

Cell Types ◽

Marker Genes ◽

Rna Seq ◽

Cortical Interneuron ◽

Large Sets

AbstractSingle cell RNA-sequencing technology (scRNA-seq) provides a new avenue to discover and characterize cell types, but the experiment-specific technical biases and analytic variability inherent to current pipelines may undermine the replicability of these studies. Meta-analysis of rapidly accumulating data is further hampered by the use of ad hoc naming conventions. Here we demonstrate our replication framework, MetaNeighbor, that allows researchers to quantify the degree to which cell types replicate across datasets, and to rapidly identify clusters with high similarity for further testing. We first measure the replicability of neuronal identity by comparing more than 13 thousand individual scRNA-seq transcriptomes, sampling with high specificity from within the data to define a range of robust practices. We then assess cross-dataset evidence for novel cortical interneuron subtypes identified by scRNA-seq and find that 24/45 cortical interneuron subtypes have evidence of replication in at least one other study. Identifying these putative replicates allows us to re-analyze the data for differential expression and provide lists of robust candidate marker genes. Across tasks we find that large sets of variably expressed genes can identify replicable cell types and subtypes with high accuracy, suggesting a general route forward for large-scale evaluation of scRNA-seq data.

Download Full-text

Comparing alternative pipelines for cross-platform microarray gene expression data integration with RNA-seq data in breast cancer

10.1101/059600 ◽

2016 ◽

Cited By ~ 2

Author(s):

Alina Frolova ◽

Vladyslav Bondarenko ◽

Maria Obolenska

Keyword(s):

Breast Cancer ◽

Gene Expression ◽

Data Integration ◽

Gene Expression Data ◽

Statistical Power ◽

Meta Analysis ◽

Expression Data ◽

Rna Seq ◽

Microarray Gene Expression ◽

Cross Platform

AbstractBackgroundAccording to major public repositories statistics an overwhelming majority of the existing and newly uploaded data originates from microarray experiments. Unfortunately, the potential of this data to bring new insights is limited by the effects of individual study-specific biases due to small number of biological samples. Increasing sample size by direct microarray data integration increases the statistical power to obtain a more precise estimate of gene expression in a population of individuals resulting in lower false discovery rates. However, despite numerous recommendations for gene expression data integration, there is a lack of a systematic comparison of different processing approaches aimed to asses microarray platforms diversity and ambiguous probesets to genes correspondence, leading to low number of studies applying integration.ResultsHere, we investigated five different approaches of the microarrays data processing in comparison with RNA-seq data on breast cancer samples. We aimed to evaluate different probesets annotations as well as different procedures of choosing between probesets mapped to the same gene. We show that pipelines rankings are mostly preserved across Affymetrix and Illumina platforms. BrainArray approach based on updated annotation and redesigned probesets definition and choosing probeset with the maximum average signal across the samples have best correlation with RNA-seq, while averaging probesets signals as well as scoring the quality of probes sequences mapping to the transcripts of the targeted gene have worse correlation. Finally, randomly selecting probeset among probesets mapped to the same gene significantly decreases the correlation with RNA-seq.ConclusionWe show that methods, which rely on actual probesets signal intensities, are advantageous to methods considering biological characteristics of the probes sequences only and that cross-platform integration of datasets improves correlation with the RNA-seq data. We consider the results obtained in this paper contributive to the integrative analysis as a worthwhile alternative to the classical meta-analysis of the multiple gene expression datasets.

Download Full-text