M3C: Monte Carlo reference-based consensus clustering

Mapping Intimacies ◽

10.1101/377002 ◽

2018 ◽

Cited By ~ 4

Author(s):

Christopher R. John ◽

David Watson ◽

Dominic Russ ◽

Katriona Goldmann ◽

Michael Ehrenstein ◽

...

Keyword(s):

Monte Carlo ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Real Data ◽

The Cancer Genome Atlas ◽

Consensus Clustering ◽

Null Distributions ◽

Genome Wide ◽

Genome Wide Data ◽

Cancer Genome Atlas

AbstractGenome-wide data is used to stratify patients into classes for precision medicine using clustering algorithms. A common problem in this area is selection of the number of clusters (K). The Monti consensus clustering algorithm is a widely used method which uses stability selection to estimate K. However, the method has bias towards higher values of K and yields high numbers of false positives. As a solution, we developed Monte Carlo reference-based consensus clustering (M3C), which is based on this algorithm. M3C simulates null distributions of stability scores for a range of K values thus enabling a comparison with real data to remove bias and statistically test for the presence of structure. M3C corrects the inherent bias of consensus clustering as demonstrated on simulated and real expression data from The Cancer Genome Atlas (TCGA). For testing M3C, we developed clusterlab, a new method for simulating multivariate Gaussian clusters.

Download Full-text

A new unsupervised clustering algorithm applied to genome-wide profiles of breast cancers in The Cancer Genome Atlas proper subsets triple-negative samples.

Journal of Clinical Oncology ◽

10.1200/jco.2017.35.15_suppl.e23195 ◽

2017 ◽

Vol 35 (15_suppl) ◽

pp. e23195-e23195

Author(s):

Jason Mezey ◽

Steven Schwager ◽

Sushila Shenoy ◽

Jef Benbanaste ◽

Michael Elashoff ◽

...

Keyword(s):

Clustering Algorithm ◽

Genomic Data ◽

Cancer Genome ◽

Proper Subset ◽

The Cancer Genome Atlas ◽

Driver Mutations ◽

Genome Wide ◽

A Genome ◽

Cancer Genome Atlas ◽

Genome Atlas

e23195 Background: Clustering algorithms have identified subtypes of major cancers from analysis of genome-wide gene expression (GE) and somatic mutation (SM) profiles. These algorithms almost never discover a proper subset cluster, a recovered cluster that includes all the samples of a specific subtype. For breast cancer (BC), clustering of genome-wide profiles has been unable to proper subset triple negatives (TNs), TN subtypes, or other major subtypes. Methods: To search for a proper subset cluster for TNs, we applied a new clustering algorithm to the public domain GE and SM data of BC samples in The Cancer Genome Atlas (TCGA). A module of Medidata’s Clinical Trial Genomics (CTG) platform for automated clinical and genomic data integration and analysis, it uses a hierarchical component with tree learned cut points applied to a principal component dimension reduced similarity matrix calculated from a genome-wide data profile. Results: Our analysis of 540 TCGA BC samples run without human supervision produced a proper subset cluster that included all 55 TN samples and only 74 non-TN samples. GE data have previously indicated TN status, but this is the first demonstration that these TCGA BC data contain enough information to proper subset TNs, implying that this broad BC subtype has a strong, quantifiable impact on GE. We show that the genome-wide SMs of TCGA BC samples can be used to proper subset 4 novel subtypes distinguished as classes “TP53 mutated”, “PIK3CA mutated”, “both TP53 and PIK3CA mutated”, and “neither mutated”, signifying an important role for these known driver mutations in producing the subtypes’ genome-wide mutation profiles. We find that most ( > 80%) TN BCs are in “TP53 mutated” but only 1 TN sample ( < 2%) is in “PIK3CA mutated”, indicating distinct biology for these TNs with potential implications for TN therapy. Conclusions: CTG clustering achieves proper subset cancer subtype clustering of TCGA BC samples. These results illustrate the therapeutic discovery potential possible from genomic data of the high quality present in TCGA if combined with detailed clinical data with the Medidata CTG integration and annotation platform.

Download Full-text

Promoter Methylation of PRKCB, ADAMTS12, and NAALAD2 Is Specific to Prostate Cancer and Predicts Biochemical Disease Recurrence

International Journal of Molecular Sciences ◽

10.3390/ijms22116091 ◽

2021 ◽

Vol 22 (11) ◽

pp. 6091

Author(s):

Kristina Daniunaite ◽

Arnas Bakavicius ◽

Kristina Zukauskaite ◽

Ieva Rauluseviciute ◽

Juozas Rimantas Lazutka ◽

...

Keyword(s):

Prostate Cancer ◽

Clinical Practice ◽

Promoter Methylation ◽

Disease Recurrence ◽

The Cancer Genome Atlas ◽

Cancer Dataset ◽

Protein Coding ◽

Diagnosis And Prognosis ◽

Genome Wide ◽

Cancer Genome Atlas

The molecular diversity of prostate cancer (PCa) has been demonstrated by recent genome-wide studies, proposing a significant number of different molecular markers. However, only a few of them have been transferred into clinical practice so far. The present study aimed to identify and validate novel DNA methylation biomarkers for PCa diagnosis and prognosis. Microarray-based methylome data of well-characterized cancerous and noncancerous prostate tissue (NPT) pairs was used for the initial screening. Ten protein-coding genes were selected for validation in a set of 151 PCa, 51 NPT, as well as 17 benign prostatic hyperplasia samples. The Prostate Cancer Dataset (PRAD) of The Cancer Genome Atlas (TCGA) was utilized for independent validation of our findings. Methylation frequencies of ADAMTS12, CCDC181, FILIP1L, NAALAD2, PRKCB, and ZMIZ1 were up to 91% in our study. PCa specific methylation of ADAMTS12, CCDC181, NAALAD2, and PRKCB was demonstrated by qualitative and quantitative means (all p < 0.05). In agreement with PRAD, promoter methylation of these four genes was associated with the transcript down-regulation in the Lithuanian cohort (all p < 0.05). Methylation of ADAMTS12, NAALAD2, and PRKCB was independently predictive for biochemical disease recurrence, while NAALAD2 and PRKCB increased the prognostic power of multivariate models (all p < 0.01). The present study identified methylation of ADAMTS12, NAALAD2, and PRKCB as novel diagnostic and prognostic PCa biomarkers that might guide treatment decisions in clinical practice.

Download Full-text

Identification of supervised and sparse functional genomic pathways

Statistical Applications in Genetics and Molecular Biology ◽

10.1515/sagmb-2018-0026 ◽

2020 ◽

Vol 19 (1) ◽

Author(s):

Fan Zhang ◽

Jeffrey C. Miecznikowski ◽

David L. Tritchler

Keyword(s):

Real Data ◽

The Cancer Genome Atlas ◽

Functional Networks ◽

Functional Genomic ◽

Least Squares Regression ◽

Comprehensive Understanding ◽

Omics Technologies ◽

Cancer Genome Atlas ◽

Functional Pathways ◽

Genome Atlas

AbstractFunctional pathways involve a series of biological alterations that may result in the occurrence of many diseases including cancer. With the availability of various “omics” technologies it becomes feasible to integrate information from a hierarchy of biological layers to provide a more comprehensive understanding to the disease. In many diseases, it is believed that only a small number of networks, each relatively small in size, drive the disease. Our goal in this study is to develop methods to discover these functional networks across biological layers correlated with the phenotype. We derive a novel Network Summary Matrix (NSM) that highlights potential pathways conforming to least squares regression relationships. An algorithm called Decomposition of Network Summary Matrix via Instability (DNSMI) involving decomposition of NSM using instability regularization is proposed. Simulations and real data analysis from The Cancer Genome Atlas (TCGA) program will be shown to demonstrate the performance of the algorithm.

Download Full-text

Comprehensive Analysis of The Significance of Hypoxia-Related Genes in Ovarian Cancer

10.21203/rs.3.rs-457769/v1 ◽

2021 ◽

Author(s):

Wancheng Zhao ◽

Lili Yin

Keyword(s):

Ovarian Cancer ◽

Epithelial Mesenchymal Transition ◽

Principal Component ◽

Vital Role ◽

The Cancer Genome Atlas ◽

Consensus Clustering ◽

Mesenchymal Transition ◽

Cancer Genome Atlas ◽

Pca Algorithm

Abstract Background: Hypoxia-related genes have been reported to play important roles in a variety of cancers. However, their roles in ovarian cancer (OC) have remained unknown. The aim of our research was to explore the significance of hypoxia-related genes in OC patients.Methods: In this study, 15 hypoxia-related genes were screened from The Cancer Genome Atlas (TCGA) database to group the ovarian cancer patients using the consensus clustering method. Principal component analysis (PCA) was performed to calculate the hypoxia score for each patient to quantify the hypoxic status. Results: The OC patients from TCGA-OV dataset were divided into two distinct hypoxia statuses (cluster.A and cluster.B) based on the expression level of the 15 hypoxia-related genes. Most hypoxia-related genes were expressed more highly in the cluster.A group than in the cluster.B group. We also found that patients in the cluster.A group exhibited higher expression of immune checkpoint-related genes, epithelial-mesenchymal transition-related genes, and immune activation-related genes, as well as elevated immune infiltrates. PCA algorithm indicated that patients in the cluster.A group had higher hypoxia scores than that in in the cluster.B group.Conclusions: In summary, our research elucidated the vital role of hypoxia-related genes in immune infiltrates of OC. Our investigation of hypoxic status may be able to improve the efficacy of immunotherapy for OC.

Download Full-text

A new stochastic gradient descent possibilistic clustering algorithm

AI Communications ◽

10.3233/aic-210125 ◽

2021 ◽

pp. 1-18

Author(s):

Angeliki Koutsimpela ◽

Konstantinos D. Koutroumbas

Keyword(s):

Cost Function ◽

Gradient Descent ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Real Data ◽

Stochastic Gradient ◽

Stochastic Gradient Descent ◽

Data Sets ◽

Convergence Results ◽

Possibilistic Clustering

Several well known clustering algorithms have their own online counterparts, in order to deal effectively with the big data issue, as well as with the case where the data become available in a streaming fashion. However, very few of them follow the stochastic gradient descent philosophy, despite the fact that the latter enjoys certain practical advantages (such as the possibility of (a) running faster than their batch processing counterparts and (b) escaping from local minima of the associated cost function), while, in addition, strong theoretical convergence results have been established for it. In this paper a novel stochastic gradient descent possibilistic clustering algorithm, called O- PCM 2 is introduced. The algorithm is presented in detail and it is rigorously proved that the gradient of the associated cost function tends to zero in the L 2 sense, based on general convergence results established for the family of the stochastic gradient descent algorithms. Furthermore, an additional discussion is provided on the nature of the points where the algorithm may converge. Finally, the performance of the proposed algorithm is tested against other related algorithms, on the basis of both synthetic and real data sets.

Download Full-text

Transposable element expression in tumors is associated with immune infiltration and increased antigenicity

Nature Communications ◽

10.1038/s41467-019-13035-2 ◽

2019 ◽

Vol 10 (1) ◽

Cited By ~ 16

Author(s):

Yu Kong ◽

Christopher M. Rose ◽

Ashley A. Cass ◽

Alexander G. Williams ◽

Martine Darwish ◽

...

Keyword(s):

Dna Methylation ◽

De Novo ◽

Computational Method ◽

The Cancer Genome Atlas ◽

Potential Consequence ◽

Sequencing Data ◽

Antiviral Responses ◽

Genome Wide ◽

Cancer Genome Atlas ◽

Demethylation Agent

AbstractProfound global loss of DNA methylation is a hallmark of many cancers. One potential consequence of this is the reactivation of transposable elements (TEs) which could stimulate the immune system via cell-intrinsic antiviral responses. Here, we develop REdiscoverTE, a computational method for quantifying genome-wide TE expression in RNA sequencing data. Using The Cancer Genome Atlas database, we observe increased expression of over 400 TE subfamilies, of which 262 appear to result from a proximal loss of DNA methylation. The most recurrent TEs are among the evolutionarily youngest in the genome, predominantly expressed from intergenic loci, and associated with antiviral or DNA damage responses. Treatment of glioblastoma cells with a demethylation agent results in both increased TE expression and de novo presentation of TE-derived peptides on MHC class I molecules. Therapeutic reactivation of tumor-specific TEs may synergize with immunotherapy by inducing inflammation and the display of potentially immunogenic neoantigens.

Download Full-text

Heterogeneity of MSI-H gastric cancer identifies a subtype with worse survival

Journal of Medical Genetics ◽

10.1136/jmedgenet-2019-106609 ◽

2020 ◽

Vol 58 (1) ◽

pp. 12-19 ◽

Cited By ~ 1

Author(s):

Yanmei Yang ◽

Zhong Shi ◽

Rui Bai ◽

Wangxiong Hu

Keyword(s):

Immune Checkpoint Blockade ◽

The Cancer Genome Atlas ◽

Consensus Clustering ◽

Synonymous Mutations ◽

The Poor ◽

Symptom Alleviation ◽

Cancer Genome Atlas ◽

Stomach Adenocarcinoma ◽

Asian Cohort ◽

Matrix Factorisation

BackgroundMicrosatellite instability-high (MSI-H) tumour patients generally have a better prognosis than microsatellite-stable (MSS) ones due to the large number of non-synonymous mutations. However, an increasing number of studies have revealed that less than half of MSI-H patients gain survival benefits or symptom alleviation from immune checkpoint-blockade treatment. Thus, an in-depth inspection of heterogeneous MSI-H tumours is urgently required.MethodsHere, we used non-negative matrix factorisation (non-NMF)-based consensus clustering to define stomach adenocarcinoma (STAD) MSI-H subtypes in samples from The Cancer Genome Atlas and an Asian cohort, GSE62254.ResultsMSI-H STAD samples are basically clustered into two subgroups (MSI-H1 and MSI-H2). Further examination of the immune landscape showed that immune suppression factors were enriched in the MSI-H1 subgroup, which may be associated with the poor prognosis in this subgroup.ConclusionsOur results illustrate the genetic heterogeneity within MSI-H STADs, with important implications for cancer patient risk stratification, prognosis and treatment.

Download Full-text

Deciphering N6-Methyladenosine-Related Genes Signature to Predict Survival in Lung Adenocarcinoma

BioMed Research International ◽

10.1155/2020/2514230 ◽

2020 ◽

Vol 2020 ◽

pp. 1-13

Author(s):

Jie Zhu ◽

Min Wang ◽

Daixing Hu

Keyword(s):

Lung Adenocarcinoma ◽

Gene Expression Omnibus ◽

The Cancer Genome Atlas ◽

Consensus Clustering ◽

Lasso Regression ◽

Gene Expressions ◽

Tumor Tissues ◽

Cancer Genome Atlas ◽

Selection Operator ◽

Cox Analysis

Lung cancer is the most commonly diagnosed cancer and the leading cause of cancer-related death. Among these, lung adenocarcinoma (LUAD) accounts for most cases. Due to the improvement of precision medicine based on molecular characterization, the treatment of LUAD underwent significant changes. With these changes, the prognosis of LUAD becomes diverse. N6-methyladenosine (m6A) is the most predominant modification in mRNAs, which has been a research hotspot in the field of oncology. Nevertheless, little has been studied to reveal the correlations between the m6A-related genes and prognosis in LUAD. Thus, we conducted a comprehensive analysis of m6A-related gene expressions in LUAD patients based on The Cancer Genome Atlas (TCGA) database by revealing their relationship with prognosis. Different expressions of the m6A-related genes in tumor tissues and non-tumor tissues were confirmed. Furthermore, their relationship with prognosis was studied via Consensus Clustering Analysis, Principal Components Analysis (PCA), and Least Absolute Shrinkage and Selection Operator (LASSO) Regression. Based on the above analyses, a m6A-based signature to predict the overall survival (OS) in LUAD was successfully established. Among the 479 cases, we found that most of the m6A-related genes were differentially expressed between tumor and non-tumor tissues. Six genes, HNRNPC, METTL3, YTHDC2, KIAA1429, ALKBH5, and YTHDF1 were screened to build a risk scoring signature, which is strongly related to the clinical features pathological stages (p<0.05), M stages (p<0.05), T stages (p < 0.05), gender (p=0.04), and survival outcome (p=0.02). Multivariate Cox analysis indicated that risk value could be used as an independent prognostic factor, revealing that the m6A-related genes signature has great predictive value. Its efficacy was also validated by data from the Gene Expression Omnibus (GEO) database.

Download Full-text

Scalable Nonparametric Prescreening Method for Searching Higher-Order Genetic Interactions Underlying Quantitative Traits

Genetics ◽

10.1534/genetics.119.302658 ◽

2019 ◽

Vol 213 (4) ◽

pp. 1209-1224 ◽

Cited By ~ 2

Author(s):

Juho A. J. Kontio ◽

Mikko J. Sillanpää

Keyword(s):

High Throughput ◽

High Throughput Sequencing ◽

Real Data ◽

Higher Order ◽

The Cancer Genome Atlas ◽

Sequencing Data ◽

High Throughput Sequencing Data ◽

Automatic Relevance Determination ◽

Cancer Genome Atlas ◽

Low Dimensional

Gaussian process (GP)-based automatic relevance determination (ARD) is known to be an efficient technique for identifying determinants of gene-by-gene interactions important to trait variation. However, the estimation of GP models is feasible only for low-dimensional datasets (∼200 variables), which severely limits application of the GP-based ARD method for high-throughput sequencing data. In this paper, we provide a nonparametric prescreening method that preserves virtually all the major benefits of the GP-based ARD method and extends its scalability to the typical high-dimensional datasets used in practice. In several simulated test scenarios, the proposed method compared favorably with existing nonparametric dimension reduction/prescreening methods suitable for higher-order interaction searches. As a real-data example, the proposed method was applied to a high-throughput dataset downloaded from the cancer genome atlas (TCGA) with measured expression levels of 16,976 genes (after preprocessing) from patients diagnosed with acute myeloid leukemia.

Download Full-text

IsoformSwitchAnalyzeR: analysis of changes in genome-wide patterns of alternative splicing and its functional consequences

Bioinformatics ◽

10.1093/bioinformatics/btz247 ◽

2019 ◽

Vol 35 (21) ◽

pp. 4469-4471 ◽

Cited By ~ 21

Author(s):

Kristoffer Vitting-Seerup ◽

Albin Sandelin

Keyword(s):

Alternative Splicing ◽

The Cancer Genome Atlas ◽

Supplementary Information ◽

Rna Seq ◽

Genome Wide ◽

Functional Consequences ◽

Cancer Genome Atlas ◽

Health And Disease ◽

Splicing Patterns

Abstract Summary Alternative splicing is an important mechanism involved in health and disease. Recent work highlights the importance of investigating genome-wide changes in splicing patterns and the subsequent functional consequences. Current computational methods only support such analysis on a gene-by-gene basis. Therefore, we extended IsoformSwitchAnalyzeR R library to enable analysis of genome-wide changes in specific types of alternative splicing and predicted functional consequences of the resulting isoform switches. As a case study, we analyzed RNA-seq data from The Cancer Genome Atlas and found systematic changes in alternative splicing and the consequences of the associated isoform switches. Availability and implementation Windows, Linux and Mac OS: http://bioconductor.org/packages/IsoformSwitchAnalyzeR. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text