scholarly journals recount: A large-scale resource of analysis-ready RNA-seq expression data

2016 ◽  
Author(s):  
Leonardo Collado-Torres ◽  
Abhinav Nellore ◽  
Kai Kammers ◽  
Shannon E. Ellis ◽  
Margaret A. Taub ◽  
...  

Abstractrecount is a resource of processed and summarized expression data spanning nearly 60,000 human RNA-seq samples from the Sequence Read Archive (SRA). The associated recount Bio-conductor package provides a convenient API for querying, downloading, and analyzing the data. Each processed study consists of meta/phenotype data, the expression levels of genes and their underlying exons and splice junctions, and corresponding genomic annotation. We also provide data summarization types for quantifying novel transcribed sequence including base-resolution coverage and potentially unannotated splice junctions. We present workflows illustrating how to use recount to perform differential expression analysis including meta-analysis, annotation-free base-level analysis, and replication of smaller studies using data from larger studies. recount provides a valuable and user-friendly resource of processed RNA-seq datasets to draw additional biological insights from existing public data. The resource is available at https://jhubiostatistics.shinyapps.io/recount/.

2015 ◽  
Vol 16 (1) ◽  
Author(s):  
Peter H. Sudmant ◽  
Maria S. Alexis ◽  
Christopher B. Burge

2019 ◽  
Author(s):  
Ning Wang ◽  
Andrew E. Teschendorff

AbstractInferring the activity of transcription factors in single cells is a key task to improve our understanding of development and complex genetic diseases. This task is, however, challenging due to the relatively large dropout rate and noisy nature of single-cell RNA-Seq data. Here we present a novel statistical inference framework called SCIRA (Single Cell Inference of Regulatory Activity), which leverages the power of large-scale bulk RNA-Seq datasets to infer high-quality tissue-specific regulatory networks, from which regulatory activity estimates in single cells can be subsequently obtained. We show that SCIRA can correctly infer regulatory activity of transcription factors affected by high technical dropouts. In particular, SCIRA can improve sensitivity by as much as 70% compared to differential expression analysis and current state-of-the-art methods. Importantly, SCIRA can reveal novel regulators of cell-fate in tissue-development, even for cell-types that only make up 5% of the tissue, and can identify key novel tumor suppressor genes in cancer at single cell resolution. In summary, SCIRA will be an invaluable tool for single-cell studies aiming to accurately map activity patterns of key transcription factors during development, and how these are altered in disease.


2018 ◽  
Vol 2018 ◽  
pp. 1-7 ◽  
Author(s):  
Zhijie Han ◽  
Jiaojiao Qu ◽  
Jiehong Zhao ◽  
Xiao Zou

Multiple sclerosis (MS) is a sex-specific autoimmune disease involving central nervous system. Previous studies determined that macrophage migration inhibitory factor (MIF) and its homologue D-dopachrome tautomerase (DDT) sex-specifically affect MS progression. Moreover, other studies reported that rs755622 polymorphism in promoter region of MIF gene is associated with risk of MS and affects the promoter activity to regulate MIF expression in a sex-specific way. Given that MIF and DDT share a part of promoter sequence, we surmise that rs755622 can also regulate DDT expression in a sex-specific way. However, this has not yet been studied. Here, we used five large-scale expression quantitative trait loci (eQTLs) and two RNA-seq datasets from brain and blood to assess the potential influence of rs755622 variant on expression of DDT in different genders by the linear regression and differential expression analysis. The results show that the minor allele frequency of rs755622 and expression of DDT are significantly increased in males for MS subjects and this minor allele variant can significantly upregulate DDT expression for males but not females, which suggests that the regulation of DDT expression level by rs755622 can affect MS progression in males. These findings further support and expand conclusions of previous studies and may help to better understand the mechanisms of MS.


2015 ◽  
Vol 13 (06) ◽  
pp. 1550019 ◽  
Author(s):  
Alexei A. Sharov ◽  
David Schlessinger ◽  
Minoru S. H. Ko

We have developed ExAtlas, an on-line software tool for meta-analysis and visualization of gene expression data. In contrast to existing software tools, ExAtlas compares multi-component data sets and generates results for all combinations (e.g. all gene expression profiles versus all Gene Ontology annotations). ExAtlas handles both users’ own data and data extracted semi-automatically from the public repository (GEO/NCBI database). ExAtlas provides a variety of tools for meta-analyses: (1) standard meta-analysis (fixed effects, random effects, z-score, and Fisher’s methods); (2) analyses of global correlations between gene expression data sets; (3) gene set enrichment; (4) gene set overlap; (5) gene association by expression profile; (6) gene specificity; and (7) statistical analysis (ANOVA, pairwise comparison, and PCA). ExAtlas produces graphical outputs, including heatmaps, scatter-plots, bar-charts, and three-dimensional images. Some of the most widely used public data sets (e.g. GNF/BioGPS, Gene Ontology, KEGG, GAD phenotypes, BrainScan, ENCODE ChIP-seq, and protein–protein interaction) are pre-loaded and can be used for functional annotations.


2019 ◽  
Vol 9 (1) ◽  
Author(s):  
Franziska Liesecke ◽  
Johan-Owen De Craene ◽  
Sébastien Besseau ◽  
Vincent Courdavault ◽  
Marc Clastre ◽  
...  

Abstract Large-scale gene co-expression networks are an effective methodology to analyze sets of co-expressed genes and discover new gene functions or associations. Distances between genes are estimated according to their expression profiles and are visualized in networks that may be further partitioned to reveal communities of co-expressed genes. Creating expression profiles is now eased by the large amounts of publicly available expression data (microarrays and RNA-seq). Although many distance calculation methods have been intensively compared and reviewed in the past, it is unclear how to proceed when many samples reflecting a wide range of different conditions are available. Should as many samples as possible be integrated into network construction or be partitioned into smaller sets of more related samples? Previous studies have indicated a saturation in network performances to capture known associations once a certain number of samples is included in distance calculations. Here, we examined the influence of sample size on co-expression network construction using microarray and RNA-seq expression data from three plant species. We tested different down-sampling methods and compared network performances in recovering known gene associations to networks obtained from full datasets. We further examined how aggregating networks may help increase this performance by testing six aggregation methods.


2019 ◽  
Author(s):  
Ashkaun Razmara ◽  
Shannon E. Ellis ◽  
Dustin J. Sokolowski ◽  
Sean Davis ◽  
Michael D. Wilson ◽  
...  

AbstractThe usability of publicly-available gene expression data is often limited by the availability of high-quality, standardized biological phenotype and experimental condition information (“metadata”). We released the recount2 project, which involved re-processing ∼70,000 samples in the Sequencing Read Archive (SRA), Genotype-Tissue Expression (GTEx), and The Cancer Genome Atlas (TCGA) projects. While samples from the latter two projects are well-characterized with extensive metadata, the ∼50,000 RNA-seq samples from SRA in recount2 are inconsistently annotated with metadata. Tissue type, sex, and library type can be estimated from the RNA sequencing (RNA-seq) data itself. However, more detailed and harder to predict metadata, like age and diagnosis, must ideally be provided by labs that deposit the data.To facilitate more analyses within human brain tissue data, we have complemented phenotype predictions by manually constructing a uniformly-curated database of public RNA-seq samples present in SRA and recount2. We describe the reproducible curation process for constructing recount-brain that involves systematic review of the primary manuscript, which can serve as a guide to annotate other studies and tissues. We further expanded recount-brain by merging it with GTEx and TCGA brain samples as well as linking to controlled vocabulary terms for tissue, Brodmann area and disease. Furthermore, we illustrate how to integrate the sample metadata in recount-brain with the gene expression data in recount2 to perform differential expression analysis. We then provide three analysis examples involving modeling postmortem interval, glioblastoma, and meta-analyses across GTEx and TCGA. Overall, recount-brain facilitates expression analyses and improves their reproducibility as individual researchers do not have to manually curate the sample metadata. recount-brain is available via the add_metadata() function from the recount Bioconductor package at bioconductor.org/packages/recount.


Author(s):  
Thalita M. M. Ferreira ◽  
André P. Leão ◽  
Carlos A. F. de Sousa ◽  
Manoel T. Souza Júnior

ABSTRACT RNA-seq is a technique based on the large-scale sequencing of transcript-derived cDNAs using next-generation sequencing platforms mostly used today to characterize an organism’s transcriptome. The analysis of RNA-seq data allows for identifying genes differentially expressed in a given condition, such as salt stress. This study aimed to search and characterize genes from the African oil palm (Elaeis guineensis Jacq.) highly up-regulated during salt stress, with a long-term goal of gene promoter prospection and validation. The apical leaves from the control (electrical conductivity of ~2 dS m-1) and salt-stressed (~40 dS m-1) young oil palm plants, collected at 5 and 12 days after the beginning of the stress, were subjected to extraction of total RNA, with three plants (replicates) per treatment. The complete genome ofE. guineensis, available at the National Center for Biotechnology Information, was used as the reference genome - BioProject PRJNA192219. The differential expression analysis led to the selection for further characterization of seven genes, which had increased expressions of 37-84 times under salt stress. The strategy used in this study enabled the selection of seven salt-responsive genes highly up-regulated during salt stress, and some of them coded for proteins already reported as responsible for salinity tolerance in other plant species through over-expression or knockout.


2018 ◽  
Author(s):  
Giulio Spinozzi ◽  
Valentina Tini ◽  
Laura Mincarelli ◽  
Brunangelo Falini ◽  
Maria Paola Martelli

There are many methods available for each phase of the RNA-Seq analysis and each of them uses different algorithms. It is therefore useful to identify a pipeline that combines the best tools in terms of time and results. For this purpose, we compared five different pipelines, obtained by combining the most used tools in RNA-Seq analysis. Using RNA-Seq data on samples of different Acute Myeloid Leukemia (AML) cell lines, we compared five pipelines from the alignment to the differential expression analysis (DEA). For each one we evaluated the peak of RAM and time and then compared the differentially expressed genes identified by each pipeline. It emerged that the pipeline with shorter times, lower consumption of RAM and more reliable results, is that which involves the use ofHISAT2for alignment, featureCountsfor quantification and edgeRfor differential analysis. Finally, we developed an automated pipeline that recurs by default to the cited pipeline, but it also allows to choose between different tools. In addition, the pipeline makes a final meta-analysis that includes a Gene Ontology and Pathway analysis. The results can be viewed in an interactive Shiny Appand exported in a report (pdf, word or html formats).


2017 ◽  
Author(s):  
Megan Crow ◽  
Anirban Paul ◽  
Sara Ballouz ◽  
Z. Josh Huang ◽  
Jesse Gillis

AbstractSingle cell RNA-sequencing technology (scRNA-seq) provides a new avenue to discover and characterize cell types, but the experiment-specific technical biases and analytic variability inherent to current pipelines may undermine the replicability of these studies. Meta-analysis of rapidly accumulating data is further hampered by the use of ad hoc naming conventions. Here we demonstrate our replication framework, MetaNeighbor, that allows researchers to quantify the degree to which cell types replicate across datasets, and to rapidly identify clusters with high similarity for further testing. We first measure the replicability of neuronal identity by comparing more than 13 thousand individual scRNA-seq transcriptomes, sampling with high specificity from within the data to define a range of robust practices. We then assess cross-dataset evidence for novel cortical interneuron subtypes identified by scRNA-seq and find that 24/45 cortical interneuron subtypes have evidence of replication in at least one other study. Identifying these putative replicates allows us to re-analyze the data for differential expression and provide lists of robust candidate marker genes. Across tasks we find that large sets of variably expressed genes can identify replicable cell types and subtypes with high accuracy, suggesting a general route forward for large-scale evaluation of scRNA-seq data.


2016 ◽  
Author(s):  
Alina Frolova ◽  
Vladyslav Bondarenko ◽  
Maria Obolenska

AbstractBackgroundAccording to major public repositories statistics an overwhelming majority of the existing and newly uploaded data originates from microarray experiments. Unfortunately, the potential of this data to bring new insights is limited by the effects of individual study-specific biases due to small number of biological samples. Increasing sample size by direct microarray data integration increases the statistical power to obtain a more precise estimate of gene expression in a population of individuals resulting in lower false discovery rates. However, despite numerous recommendations for gene expression data integration, there is a lack of a systematic comparison of different processing approaches aimed to asses microarray platforms diversity and ambiguous probesets to genes correspondence, leading to low number of studies applying integration.ResultsHere, we investigated five different approaches of the microarrays data processing in comparison with RNA-seq data on breast cancer samples. We aimed to evaluate different probesets annotations as well as different procedures of choosing between probesets mapped to the same gene. We show that pipelines rankings are mostly preserved across Affymetrix and Illumina platforms. BrainArray approach based on updated annotation and redesigned probesets definition and choosing probeset with the maximum average signal across the samples have best correlation with RNA-seq, while averaging probesets signals as well as scoring the quality of probes sequences mapping to the transcripts of the targeted gene have worse correlation. Finally, randomly selecting probeset among probesets mapped to the same gene significantly decreases the correlation with RNA-seq.ConclusionWe show that methods, which rely on actual probesets signal intensities, are advantageous to methods considering biological characteristics of the probes sequences only and that cross-platform integration of datasets improves correlation with the RNA-seq data. We consider the results obtained in this paper contributive to the integrative analysis as a worthwhile alternative to the classical meta-analysis of the multiple gene expression datasets.


Sign in / Sign up

Export Citation Format

Share Document