scholarly journals tmod: an R package for general and multivariate enrichment analysis

Author(s):  
January Weiner 3rd ◽  
Teresa Domaszewska

“Omics” studies generate long lists of genes, proteins, metabolites or other features which can be difficult to decipher. Feature set enrichment analysis utilizing annotated groups/classes of features (such as pathways, gene ontology terms or gene/metabolic modules) can provide a powerful gateway to associate data to phenotypes such as disease process or treatment progression. At the same time, the increasing use of technologies to generate multidimensional omics data sets based on specific cell types or responses to stimuli increases the number and breadth of annotated feature sets available for enrichment analysis, facilitating the ability to draw biologically relevant conclusions. However, existing tools and applications for enrichment analysis are adapted specifically to gene set enrichment and lack functionalities to analyze rapidly growing amounts of metabolomics and other data. Moreover, such tools often provide only a limited range of statistical methods, rely on permutation tests, lack suitable visualization tools to facilitate result interpretation in complex experimental setups, and lack standalone versions usable in semi-automatized workflows. Here, we present tmod, an R package which implements powerful statistical methods for enrichment analysis. Tmod includes definitions of widely used feature sets for transcriptomic and metabolomic profiling and also allows use of custom user-provided feature sets. Moreover, it provides novel and intuitive visualiza- tion methods which facilitate interpretation of complex data sets. The implemented statistical tests allow the significance of enrichment within sorted feature lists to be calculated without randomization tests and thus are suitable for combining functional analysis with multivariate techniques.

Author(s):  
January Weiner 3rd ◽  
Teresa Domaszewska

“Omics” studies generate long lists of genes, proteins, metabolites or other features which can be difficult to decipher. Feature set enrichment analysis utilizing annotated groups/classes of features (such as pathways, gene ontology terms or gene/metabolic modules) can provide a powerful gateway to associate data to phenotypes such as disease process or treatment progression. At the same time, the increasing use of technologies to generate multidimensional omics data sets based on specific cell types or responses to stimuli increases the number and breadth of annotated feature sets available for enrichment analysis, facilitating the ability to draw biologically relevant conclusions. However, existing tools and applications for enrichment analysis are adapted specifically to gene set enrichment and lack functionalities to analyze rapidly growing amounts of metabolomics and other data. Moreover, such tools often provide only a limited range of statistical methods, rely on permutation tests, lack suitable visualization tools to facilitate result interpretation in complex experimental setups, and lack standalone versions usable in semi-automatized workflows. Here, we present tmod, an R package which implements powerful statistical methods for enrichment analysis. Tmod includes definitions of widely used feature sets for transcriptomic and metabolomic profiling and also allows use of custom user-provided feature sets. Moreover, it provides novel and intuitive visualiza- tion methods which facilitate interpretation of complex data sets. The implemented statistical tests allow the significance of enrichment within sorted feature lists to be calculated without randomization tests and thus are suitable for combining functional analysis with multivariate techniques.


Metabolites ◽  
2020 ◽  
Vol 10 (12) ◽  
pp. 479
Author(s):  
Gayatri R. Iyer ◽  
Janis Wigginton ◽  
William Duren ◽  
Jennifer L. LaBarre ◽  
Marci Brandenburg ◽  
...  

Modern analytical methods allow for the simultaneous detection of hundreds of metabolites, generating increasingly large and complex data sets. The analysis of metabolomics data is a multi-step process that involves data processing and normalization, followed by statistical analysis. One of the biggest challenges in metabolomics is linking alterations in metabolite levels to specific biological processes that are disrupted, contributing to the development of disease or reflecting the disease state. A common approach to accomplishing this goal involves pathway mapping and enrichment analysis, which assesses the relative importance of predefined metabolic pathways or other biological categories. However, traditional knowledge-based enrichment analysis has limitations when it comes to the analysis of metabolomics and lipidomics data. We present a Java-based, user-friendly bioinformatics tool named Filigree that provides a primarily data-driven alternative to the existing knowledge-based enrichment analysis methods. Filigree is based on our previously published differential network enrichment analysis (DNEA) methodology. To demonstrate the utility of the tool, we applied it to previously published studies analyzing the metabolome in the context of metabolic disorders (type 1 and 2 diabetes) and the maternal and infant lipidome during pregnancy.


2009 ◽  
Vol 4 (3) ◽  
pp. 308-309 ◽  
Author(s):  
Nicole A. Lazar

In their article, Vul, Harris, Winkielman, and Pashler (2009) , (this issue) raise the issue of nonindependent analysis in behavioral neuroimaging, whereby correlations are artificially inflated as a result of spurious statistical procedures. In this comment, I note that the phenomenon in question is a type of selection bias and hence is neither new nor unique to fMRI. The use of massive, complex data sets (common in modern applications) to answer increasingly intricate scientific questions presents many potential pitfalls to valid statistical analysis. Strong collaboration between statisticians and scientists and the development of statistical methods specific to the types of data encountered in practice can help researchers avoid these pitfalls.


2019 ◽  
Author(s):  
Derek Beaton ◽  
Gilbert Saporta ◽  
Hervé Abdi ◽  

AbstractCurrent large scale studies of brain and behavior typically involve multiple populations, diverse types of data (e.g., genetics, brain structure, behavior, demographics, or “mutli-omics,” and “deep-phenotyping”) measured on various scales of measurement. To analyze these heterogeneous data sets we need simple but flexible methods able to integrate the inherent properties of these complex data sets. Here we introduce partial least squares-correspondence analysis-regression (PLS-CA-R) a method designed to address these constraints. PLS-CA-R generalizes PLS regression to most data types (e.g., continuous, ordinal, categorical, non-negative values). We also show that PLS-CA-R generalizes many “two-table” multivariate techniques and their respective algorithms, such as various PLS approaches, canonical correlation analysis, and redundancy analysis (a.k.a. reduced rank regression).


2016 ◽  
Author(s):  
David J. Arenillas ◽  
Alistair R.R. Forrest ◽  
Hideya Kawaji ◽  
Timo Lassman ◽  
Wyeth W. Wasserman ◽  
...  

AbstractSummaryWith the emergence of large-scale Cap Analysis of Gene Expression (CAGE) data sets from individual labs and the FANTOM consortium, one can now analyze the cis-regulatory regions associated with gene transcription at an unprecedented level of refinement. By coupling transcription factor binding site (TFBS) enrichment analysis with CAGE-derived genomic regions, CAGEd-oPOSSUM can identify TFs that act as key regulators of genes involved in specific mammalian cell and tissue types. The webtool allows for the analysis of CAGE-derived transcription start sites (TSSs) either provided by the user or selected from ~1,300 mammalian samples from the FANTOM5 project with pre-computed TFBS predicted with JASPAR TF binding profiles. The tool helps power insights into the regulation of genes through the study of the specific usage of TSSs within specific cell types and/or under specific conditions.Availability and implementationThe CAGEd-oPOSUM web tool is implemented in Perl, MySQL, and Apache and is available at http://cagedop.cmmt.ubc.ca/CAGEd_oPOSSUM.Supporting InformationSupplementary Text, Figures, and Data are available online at bioRxiv.


2016 ◽  
Author(s):  
Joseph N. Paulson ◽  
Cho-Yi Chen ◽  
Camila M. Lopes-Ramos ◽  
Marieke L Kuijjer ◽  
John Platig ◽  
...  

AbstractAlthough ultrahigh-throughput RNA-Sequencing has become the dominant technology for genome-wide transcriptional profiling, the vast majority of RNA-Seq studies typically profile only tens of samples, and most analytical pipelines are optimized for these smaller studies. However, projects are generating ever-larger data sets comprising RNA-Seq data from hundreds or thousands of samples, often collected at multiple centers and from diverse tissues. These complex data sets present significant analytical challenges due to batch and tissue effects, but provide the opportunity to revisit the assumptions and methods that we use to preprocess, normalize, and filter RNA-Seq data – critical first steps for any subsequent analysis. We find analysis of large RNA-Seq data sets requires both careful quality control and that one account for sparsity due to the heterogeneity intrinsic in multi-group studies. An R package instantiating our method for large-scale RNA-Seq normalization and preprocessing, YARN, is available at bioconductor.org/packages/yarn.HighlightsOverview of assumptions used in preprocessing and normalizationPipeline for preprocessing, quality control, and normalization of large heterogeneous dataA Bioconductor package for the YARN pipeline and easy manipulation of count dataPreprocessed GTEx data set using the YARN pipeline available as a resource


2020 ◽  
Vol 23 (8) ◽  
pp. 805-813
Author(s):  
Ai Jiang ◽  
Peng Xu ◽  
Zhenda Zhao ◽  
Qizhao Tan ◽  
Shang Sun ◽  
...  

Background: Osteoarthritis (OA) is a joint disease that leads to a high disability rate and a low quality of life. With the development of modern molecular biology techniques, some key genes and diagnostic markers have been reported. However, the etiology and pathogenesis of OA are still unknown. Objective: To develop a gene signature in OA. Method: In this study, five microarray data sets were integrated to conduct a comprehensive network and pathway analysis of the biological functions of OA related genes, which can provide valuable information and further explore the etiology and pathogenesis of OA. Results and Discussion: Differential expression analysis identified 180 genes with significantly expressed expression in OA. Functional enrichment analysis showed that the up-regulated genes were associated with rheumatoid arthritis (p < 0.01). Down-regulated genes regulate the biological processes of negative regulation of kinase activity and some signaling pathways such as MAPK signaling pathway (p < 0.001) and IL-17 signaling pathway (p < 0.001). In addition, the OA specific protein-protein interaction (PPI) network was constructed based on the differentially expressed genes. The analysis of network topological attributes showed that differentially upregulated VEGFA, MYC, ATF3 and JUN genes were hub genes of the network, which may influence the occurrence and development of OA through regulating cell cycle or apoptosis, and were potential biomarkers of OA. Finally, the support vector machine (SVM) method was used to establish the diagnosis model of OA, which not only had excellent predictive power in internal and external data sets (AUC > 0.9), but also had high predictive performance in different chip platforms (AUC > 0.9) and also had effective ability in blood samples (AUC > 0.8). Conclusion: The 4-genes diagnostic model may be of great help to the early diagnosis and prediction of OA.


2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Gulden Olgun ◽  
Afshan Nabi ◽  
Oznur Tastan

Abstract Background While some non-coding RNAs (ncRNAs) are assigned critical regulatory roles, most remain functionally uncharacterized. This presents a challenge whenever an interesting set of ncRNAs needs to be analyzed in a functional context. Transcripts located close-by on the genome are often regulated together. This genomic proximity on the sequence can hint at a functional association. Results We present a tool, NoRCE, that performs cis enrichment analysis for a given set of ncRNAs. Enrichment is carried out using the functional annotations of the coding genes located proximal to the input ncRNAs. Other biologically relevant information such as topologically associating domain (TAD) boundaries, co-expression patterns, and miRNA target prediction information can be incorporated to conduct a richer enrichment analysis. To this end, NoRCE includes several relevant datasets as part of its data repository, including cell-line specific TAD boundaries, functional gene sets, and expression data for coding & ncRNAs specific to cancer. Additionally, the users can utilize custom data files in their investigation. Enrichment results can be retrieved in a tabular format or visualized in several different ways. NoRCE is currently available for the following species: human, mouse, rat, zebrafish, fruit fly, worm, and yeast. Conclusions NoRCE is a platform-independent, user-friendly, comprehensive R package that can be used to gain insight into the functional importance of a list of ncRNAs of any type. The tool offers flexibility to conduct the users’ preferred set of analyses by designing their own pipeline of analysis. NoRCE is available in Bioconductor and https://github.com/guldenolgun/NoRCE.


2021 ◽  
Vol 22 (3) ◽  
pp. 1399
Author(s):  
Salim Ghannoum ◽  
Waldir Leoncio Netto ◽  
Damiano Fantini ◽  
Benjamin Ragan-Kelley ◽  
Amirabbas Parizadeh ◽  
...  

The growing attention toward the benefits of single-cell RNA sequencing (scRNA-seq) is leading to a myriad of computational packages for the analysis of different aspects of scRNA-seq data. For researchers without advanced programing skills, it is very challenging to combine several packages in order to perform the desired analysis in a simple and reproducible way. Here we present DIscBIO, an open-source, multi-algorithmic pipeline for easy, efficient and reproducible analysis of cellular sub-populations at the transcriptomic level. The pipeline integrates multiple scRNA-seq packages and allows biomarker discovery with decision trees and gene enrichment analysis in a network context using single-cell sequencing read counts through clustering and differential analysis. DIscBIO is freely available as an R package. It can be run either in command-line mode or through a user-friendly computational pipeline using Jupyter notebooks. We showcase all pipeline features using two scRNA-seq datasets. The first dataset consists of circulating tumor cells from patients with breast cancer. The second one is a cell cycle regulation dataset in myxoid liposarcoma. All analyses are available as notebooks that integrate in a sequential narrative R code with explanatory text and output data and images. R users can use the notebooks to understand the different steps of the pipeline and will guide them to explore their scRNA-seq data. We also provide a cloud version using Binder that allows the execution of the pipeline without the need of downloading R, Jupyter or any of the packages used by the pipeline. The cloud version can serve as a tutorial for training purposes, especially for those that are not R users or have limited programing skills. However, in order to do meaningful scRNA-seq analyses, all users will need to understand the implemented methods and their possible options and limitations.


Sign in / Sign up

Export Citation Format

Share Document