MOVIE: Multi-Omics Visualization of Estimated contributions

Mapping Intimacies ◽

10.1101/379115 ◽

2018 ◽

Author(s):

Sean D. McCabe ◽

Dan-Yu Lin ◽

Michael I. Love

Keyword(s):

Correlation Analysis ◽

Canonical Correlation Analysis ◽

Canonical Correlation ◽

High Specificity ◽

R Package ◽

Data Type ◽

The Cancer Genome Atlas ◽

Supplementary Information ◽

Data Types ◽

Cancer Data

AbstractSummaryThe growth of multi-omics datasets has given rise to many methods for identifying sources of common variation across data types. The unsupervised nature of these methods makes it difficult to evaluate their performance. We present MOVIE, Multi-Omics Visualization of Estimated contributions, as a framework for evaluating the degree of overfitting and the stability of unsupervised multi-omics methods. MOVIE plots the contributions of one data type against another to produce contribution plots, where contributions are calculated for each subject and each data type from the results of each multi-omics method. The usefulness of MOVIE is demonstrated by applying existing multi-omics methods to permuted null data and breast cancer data from The Cancer Genome Atlas. Contribution plots indicated that principal components-based Canonical Correlation Analysis overfit null data, while Sparse multiple Canonical Correlation Analysis and Multi-Omics Factor Analysis provided stable results with high specificity for both the real and permuted null datasets.AvailabilityMOVIE is available as an R package at https://github.com/mccabes292/[email protected] informationSupplementary data are available at Bioinformatics online.

Download Full-text

Conditional canonical correlation estimation based on covariates with random forests

Bioinformatics ◽

10.1093/bioinformatics/btab158 ◽

2021 ◽

Author(s):

Cansu Alakuş ◽

Denis Larocque ◽

Sébastien Jacquemont ◽

Fanny Barlaam ◽

Charles-Olivier Martin ◽

...

Keyword(s):

Correlation Analysis ◽

Canonical Correlation Analysis ◽

Canonical Correlation ◽

R Package ◽

Significance Test ◽

Supplementary Information ◽

Canonical Correlations ◽

Correlation Estimation ◽

The Individual ◽

Individual Trees

Abstract Motivation Investigating the relationships between two sets of variables helps to understand their interactions and can be done with canonical correlation analysis (CCA). However, the correlation between the two sets can sometimes depend on a third set of covariates, often subject-related ones such as age, gender or other clinical measures. In this case, applying CCA to the whole population is not optimal and methods to estimate conditional CCA, given the covariates, can be useful. Results We propose a new method called Random Forest with Canonical Correlation Analysis (RFCCA) to estimate the conditional canonical correlations between two sets of variables given subject-related covariates. The individual trees in the forest are built with a splitting rule specifically designed to partition the data to maximize the canonical correlation heterogeneity between child nodes. We also propose a significance test to detect the global effect of the covariates on the relationship between two sets of variables. The performance of the proposed method and the global significance test is evaluated through simulation studies that show it provides accurate canonical correlation estimations and well-controlled Type-1 error. We also show an application of the proposed method with EEG data. Availability and implementation RFCCA is implemented in a freely available R package on CRAN (https://CRAN.R-project.org/package=RFCCA). Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Sparse semiparametric canonical correlation analysis for data of mixed types

Biometrika ◽

10.1093/biomet/asaa007 ◽

2020 ◽

Vol 107 (3) ◽

pp. 609-625 ◽

Cited By ~ 1

Author(s):

Grace Yoon ◽

Raymond J Carroll ◽

Irina Gaynanova

Keyword(s):

Correlation Analysis ◽

Canonical Correlation Analysis ◽

Canonical Correlation ◽

Mixed Data ◽

Gaussian Copula ◽

Breast Cancer Patients ◽

Data Types ◽

Semiparametric Approach ◽

Linear Relationships ◽

Correlation Analysis Method

Summary Canonical correlation analysis investigates linear relationships between two sets of variables, but it often works poorly on modern datasets because of high dimensionality and mixed data types such as continuous, binary and zero-inflated. To overcome these challenges, we propose a semiparametric approach to sparse canonical correlation analysis based on the Gaussian copula. The main result of this paper is a truncated latent Gaussian copula model for data with excess zeros, which allows us to derive a rank-based estimator of the latent correlation matrix for mixed variable types without estimation of marginal transformation functions. The resulting canonical correlation analysis method works well in high-dimensional settings, as demonstrated via numerical studies, and when applied to the analysis of association between gene expression and microRNA data from breast cancer patients.

Download Full-text

PAcluster: Clustering polyadenylation site data using canonical correlation analysis

Journal of Bioinformatics and Computational Biology ◽

10.1142/s0219720017500184 ◽

2017 ◽

Vol 15 (05) ◽

pp. 1750018 ◽

Cited By ~ 1

Author(s):

Guoli Ji ◽

Qianmin Lin ◽

Yuqi Long ◽

Congting Ye ◽

Wenbin Ye ◽

...

Keyword(s):

Correlation Analysis ◽

Canonical Correlation Analysis ◽

Canonical Correlation ◽

Alternative Polyadenylation ◽

Gene Clusters ◽

R Package ◽

Distance Measures ◽

Specific Gene ◽

Performance Indexes ◽

Biological Dataset

Alternative polyadenylation (APA) is a pervasive mechanism that contributes to gene regulation. Increasing sequenced poly(A) sites are placing new demands for the development of computational methods to investigate APA regulation. Cluster analysis is important to identify groups of co-expressed genes. However, clustering of poly(A) sites has not been extensively studied in APA, where most APA studies failed to consider the distribution, abundance, and variation of APA sites in each gene. Here we constructed a two-layer model based on canonical correlation analysis (CCA) to explore the underlying biological mechanisms in APA regulation. The first layer quantifies the general correlation of APA sites across various conditions between each gene and the second layer identifies genes with statistically significant correlation on their APA patterns to infer APA-specific gene clusters. Using hierarchical clustering, we comprehensively compared our method with four other widely used distance measures based on three performance indexes. Results showed that our method significantly enhanced the clustering performance for both synthetic and real poly(A) site data and could generate clusters with more biological meaning. We have implemented the CCA-based method as a publically available R package called PAcluster, which provides an efficient solution to the clustering of large APA-specific biological dataset.

Download Full-text

Simultaneous Analysis of Multiple Data Types in Pharmacogenomic Studies Using Weighted Sparse Canonical Correlation Analysis

OMICS A Journal of Integrative Biology ◽

10.1089/omi.2011.0126 ◽

2012 ◽

Vol 16 (7-8) ◽

pp. 363-373 ◽

Cited By ~ 12

Author(s):

Prabhakar Chalise ◽

Anthony Batzler ◽

Ryan Abo ◽

Liewei Wang ◽

Brooke L. Fridley

Keyword(s):

Correlation Analysis ◽

Canonical Correlation Analysis ◽

Canonical Correlation ◽

Simultaneous Analysis ◽

Data Types ◽

Multiple Data ◽

Sparse Canonical Correlation Analysis ◽

Pharmacogenomic Studies

Download Full-text

Multi-Omics Data Fusion for Cancer Molecular Subtyping Using Sparse Canonical Correlation Analysis

Frontiers in Genetics ◽

10.3389/fgene.2021.607817 ◽

2021 ◽

Vol 12 ◽

Author(s):

Lin Qi ◽

Wei Wang ◽

Tan Wu ◽

Lina Zhu ◽

Lingli He ◽

...

Keyword(s):

Ovarian Cancer ◽

Correlation Analysis ◽

Data Fusion ◽

Case Studies ◽

Canonical Correlation Analysis ◽

Canonical Correlation ◽

Omics Data ◽

Data Types ◽

Molecular Subtyping ◽

Sparse Canonical Correlation Analysis

It is now clear that major malignancies are heterogeneous diseases associated with diverse molecular properties and clinical outcomes, posing a great challenge for more individualized therapy. In the last decade, cancer molecular subtyping studies were mostly based on transcriptomic profiles, ignoring heterogeneity at other (epi-)genetic levels of gene regulation. Integrating multiple types of (epi)genomic data generates a more comprehensive landscape of biological processes, providing an opportunity to better dissect cancer heterogeneity. Here, we propose sparse canonical correlation analysis for cancer classification (SCCA-CC), which projects each type of single-omics data onto a unified space for data fusion, followed by clustering and classification analysis. Without loss of generality, as case studies, we integrated two types of omics data, mRNA and miRNA profiles, for molecular classification of ovarian cancer (n = 462), and breast cancer (n = 451). The two types of omics data were projected onto a unified space using SCCA, followed by data fusion to identify cancer subtypes. The subtypes we identified recapitulated subtypes previously recognized by other groups (all P- values < 0.001), but display more significant clinical associations. Especially in ovarian cancer, the four subtypes we identified were significantly associated with overall survival, while the taxonomy previously established by TCGA did not (P- values: 0.039 vs. 0.12). The multi-omics classifiers we established can not only classify individual types of data but also demonstrated higher accuracies on the fused data. Compared with iCluster, SCCA-CC demonstrated its superiority by identifying subtypes of higher coherence, clinical relevance, and time efficiency. In conclusion, we developed an integrated bioinformatic framework SCCA-CC for cancer molecular subtyping. Using two case studies in breast and ovarian cancer, we demonstrated its effectiveness in identifying biologically meaningful and clinically relevant subtypes. SCCA-CC presented a unique advantage in its ability to classify both single-omics data and multi-omics data, which significantly extends the applicability to various data types, and making more efficient use of published omics resources.

Download Full-text

SEEDCCA: An Integrated R-Package for Canonical Correlation Analysis and Partial Least Squares

The R Journal ◽

10.32614/rj-2021-026 ◽

2021 ◽

Vol 13 (1) ◽

pp. 7

Author(s):

Bo-Young Kim ◽

Yunju Im ◽

Jae,Keun Yoo

Keyword(s):

Correlation Analysis ◽

Least Squares ◽

Partial Least Squares ◽

Canonical Correlation Analysis ◽

Canonical Correlation ◽

R Package

Download Full-text

FlashPCA: fast sparse canonical correlation analysis of genomic data

10.1101/047217 ◽

2016 ◽

Author(s):

Gad Abraham ◽

Michael Inouye

Keyword(s):

Correlation Analysis ◽

Canonical Correlation Analysis ◽

Canonical Correlation ◽

Genomic Data ◽

R Package ◽

Rapid Analysis ◽

Nucleotide Polymorphisms ◽

Single Nucleotide ◽

Sparse Canonical Correlation Analysis ◽

Gene Expression Levels

SummarySparse canonical correlation analysis (SCCA) is a useful approach for correlating one set of measurements, such as single nucleotide polymorphisms (SNPs), with another set of measurements, such as gene expression levels. We present a fast implementation of SCCA, enabling rapid analysis of hundreds of thousands of SNPs together with thousands of phenotypes. Our approach is implemented both as an R package flashpcaR and within the standalone commandline tool flashpca.Availability and implementationhttps://github.com/gabraham/[email protected]

Download Full-text

NetBoxR: Automated Discovery of Biological Process Modules by Network Analysis in R

10.1101/2020.06.02.129387 ◽

2020 ◽

Author(s):

Eric Minwei Liu ◽

Augustin Luna ◽

Guanlan Dong ◽

Chris Sander

Keyword(s):

Cell Biology ◽

Large Scale ◽

Clustering Algorithm ◽

High Throughput Sequencing ◽

R Package ◽

Cancer Genome ◽

The Cancer Genome Atlas ◽

Supplementary Information ◽

Network Clustering ◽

Data Types

AbstractSummaryLarge-scale sequencing projects, such as The Cancer Genome Atlas (TCGA) and the International Cancer Genome Consortium (ICGC), have accumulated a variety of high throughput sequencing and molecular profiling data, but it is still challenging to identify potentially causal genetic mutations in cancer as well as in other diseases in an automated fashion. We developed the NetBoxR package written in the R programming language, that makes use of the NetBox algorithm to identify candidate cancer-related processes. The algorithm makes use of a networkbased approach that combines prior knowledge with a network clustering algorithm, obviating the need for and the limitation of functionally curated gene sets. A key aspect of this approach is its ability to combine multiple data types, such as mutations and copy number alterations, leading to more reliable identification of functional modules. We make the tool available in the Bioconductor R ecosystem for applications in cancer research and cell biology.Availability and implementationThe NetBoxR package is free and open-sourced under the GNU GPL-3 license R package available at https://www.bioconductor.org/packages/release/bioc/html/[email protected]; [email protected]; [email protected] informationNone

Download Full-text

Integrating multi-OMICS data through sparse canonical correlation analysis for the prediction of complex traits: a comparison study

Bioinformatics ◽

10.1093/bioinformatics/btaa530 ◽

2020 ◽

Vol 36 (17) ◽

pp. 4616-4625 ◽

Cited By ~ 1

Author(s):

Theodoulos Rodosthenous ◽

Vahid Shahrezaei ◽

Marina Evangelou

Keyword(s):

Correlation Analysis ◽

Canonical Correlation Analysis ◽

Latent Variables ◽

Canonical Correlation ◽

Complex Traits ◽

Predictive Accuracy ◽

Matrix Decomposition ◽

Complex Trait ◽

Supplementary Information ◽

Canonical Variables

Abstract Motivation Recent developments in technology have enabled researchers to collect multiple OMICS datasets for the same individuals. The conventional approach for understanding the relationships between the collected datasets and the complex trait of interest would be through the analysis of each OMIC dataset separately from the rest, or to test for associations between the OMICS datasets. In this work we show that integrating multiple OMICS datasets together, instead of analysing them separately, improves our understanding of their in-between relationships as well as the predictive accuracy for the tested trait. Several approaches have been proposed for the integration of heterogeneous and high-dimensional (p≫n) data, such as OMICS. The sparse variant of canonical correlation analysis (CCA) approach is a promising one that seeks to penalize the canonical variables for producing sparse latent variables while achieving maximal correlation between the datasets. Over the last years, a number of approaches for implementing sparse CCA (sCCA) have been proposed, where they differ on their objective functions, iterative algorithm for obtaining the sparse latent variables and make different assumptions about the original datasets. Results Through a comparative study we have explored the performance of the conventional CCA proposed by Parkhomenko et al., penalized matrix decomposition CCA proposed by Witten and Tibshirani and its extension proposed by Suo et al. The aforementioned methods were modified to allow for different penalty functions. Although sCCA is an unsupervised learning approach for understanding of the in-between relationships, we have twisted the problem as a supervised learning one and investigated how the computed latent variables can be used for predicting complex traits. The approaches were extended to allow for multiple (more than two) datasets where the trait was included as one of the input datasets. Both ways have shown improvement over conventional predictive models that include one or multiple datasets. Availability and implementation https://github.com/theorod93/sCCA. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Identifying diagnosis-specific genotype–phenotype associations via joint multitask sparse canonical correlation analysis and classification

Bioinformatics ◽

10.1093/bioinformatics/btaa434 ◽

2020 ◽

Vol 36 (Supplement_1) ◽

pp. i371-i379 ◽

Cited By ~ 1

Author(s):

Lei Du ◽

Fang Liu ◽

Kefei Liu ◽

Xiaohui Yao ◽

Shannon L Risacher ◽

...

Keyword(s):

Correlation Analysis ◽

Canonical Correlation Analysis ◽

Canonical Correlation ◽

Correlation Coefficients ◽

Imaging Genetics ◽

Supplementary Information ◽

Local Optimum ◽

Nucleotide Polymorphisms ◽

Single Nucleotide ◽

Sparse Canonical Correlation Analysis

Abstract Motivation Brain imaging genetics studies the complex associations between genotypic data such as single nucleotide polymorphisms (SNPs) and imaging quantitative traits (QTs). The neurodegenerative disorders usually exhibit the diversity and heterogeneity, originating from which different diagnostic groups might carry distinct imaging QTs, SNPs and their interactions. Sparse canonical correlation analysis (SCCA) is widely used to identify bi-multivariate genotype–phenotype associations. However, most existing SCCA methods are unsupervised, leading to an inability to identify diagnosis-specific genotype–phenotype associations. Results In this article, we propose a new joint multitask learning method, named MT–SCCALR, which absorbs the merits of both SCCA and logistic regression. MT–SCCALR learns genotype–phenotype associations of multiple tasks jointly, with each task focusing on identifying one diagnosis-specific genotype–phenotype pattern. Meanwhile, MT–SCCALR cannot only select relevant SNPs and imaging QTs for each diagnostic group alone, but also allows the selection of those shared by multiple diagnostic groups. We derive an efficient optimization algorithm whose convergence to a local optimum is guaranteed. Compared with two state-of-the-art methods, MT–SCCALR yields better or similar canonical correlation coefficients and classification performances. In addition, it owns much better discriminative canonical weight patterns of great interest than competitors. This demonstrates the power and capability of MTSCCAR in identifying diagnostically heterogeneous genotype–phenotype patterns, which would be helpful to understand the pathophysiology of brain disorders. Availability and implementation The software is publicly available at https://github.com/dulei323/MTSCCALR. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text