Regression on imperfect class labels derived by unsupervised clustering

Briefings in Bioinformatics ◽

10.1093/bib/bbaa014 ◽

2020 ◽

Author(s):

Rasmus Froberg Brøndum ◽

Thomas Yssing Michaelsen ◽

Martin Bøgsted

Keyword(s):

Gene Expression ◽

Multiple Myeloma ◽

Learning Algorithm ◽

Gaussian Mixture Models ◽

Simulated Data ◽

Gaussian Mixture ◽

Unsupervised Clustering ◽

Expression Data ◽

Method Performance ◽

Class Labels

Abstract Outcome regressed on class labels identified by unsupervised clustering is custom in many applications. However, it is common to ignore the misclassification of class labels caused by the learning algorithm, which potentially leads to serious bias of the estimated effect parameters. Due to their generality we suggest to address the problem by use of regression calibration or the misclassification simulation and extrapolation method. Performance is illustrated by simulated data from Gaussian mixture models, documenting a reduced bias and improved coverage of confidence intervals when adjusting for misclassification with either method. Finally, we apply our method to data from a previous study, which regressed overall survival on class labels derived from unsupervised clustering of gene expression data from bone marrow samples of multiple myeloma patients.

Download Full-text

A new clustering method of gene expression data based on multivariate Gaussian mixture models

Signal Image and Video Processing ◽

10.1007/s11760-015-0749-5 ◽

2015 ◽

Vol 10 (2) ◽

pp. 359-368 ◽

Cited By ~ 8

Author(s):

Zhe Liu ◽

Yu-qing Song ◽

Cong-hua Xie ◽

Zheng Tang

Keyword(s):

Gene Expression ◽

Mixture Models ◽

Gene Expression Data ◽

Gaussian Mixture Models ◽

Gaussian Mixture ◽

Expression Data ◽

Clustering Method ◽

Multivariate Gaussian

Download Full-text

A Unified Formulation of k-Means, Fuzzy c-Means and Gaussian Mixture Model by the Kolmogorov–Nagumo Average

Entropy ◽

10.3390/e23050518 ◽

2021 ◽

Vol 23 (5) ◽

pp. 518

Author(s):

Osamu Komori ◽

Shinto Eguchi

Keyword(s):

Pareto Distribution ◽

Statistical Data ◽

Learning Algorithm ◽

Survival Function ◽

Gaussian Mixture Models ◽

Gaussian Mixture ◽

Simulation Studies ◽

Probabilistic Framework ◽

Underlying Distribution ◽

Fuzzy C Means

Clustering is a major unsupervised learning algorithm and is widely applied in data mining and statistical data analyses. Typical examples include k-means, fuzzy c-means, and Gaussian mixture models, which are categorized into hard, soft, and model-based clusterings, respectively. We propose a new clustering, called Pareto clustering, based on the Kolmogorov–Nagumo average, which is defined by a survival function of the Pareto distribution. The proposed algorithm incorporates all the aforementioned clusterings plus maximum-entropy clustering. We introduce a probabilistic framework for the proposed method, in which the underlying distribution to give consistency is discussed. We build the minorize-maximization algorithm to estimate the parameters in Pareto clustering. We compare the performance with existing methods in simulation studies and in benchmark dataset analyses to demonstrate its highly practical utilities.

Download Full-text

From correlation to causation networks: a simple approximate learning algorithm and its application to high-dimensional plant gene expression data

BMC Systems Biology ◽

10.1186/1752-0509-1-37 ◽

2007 ◽

Vol 1 (1) ◽

Cited By ~ 212

Author(s):

Rainer Opgen-Rhein ◽

Korbinian Strimmer

Keyword(s):

Gene Expression ◽

Gene Expression Data ◽

Learning Algorithm ◽

High Dimensional ◽

Expression Data ◽

Plant Gene Expression ◽

Plant Gene

Download Full-text

MGMM: An R Package for fitting Gaussian Mixture Models on Incomplete Data

10.1101/2019.12.20.884551 ◽

2019 ◽

Cited By ~ 1

Author(s):

Zachary R. McCaw ◽

Hanna Julienne ◽

Hugues Aschard

Keyword(s):

Missing Data ◽

Mixture Models ◽

Gaussian Mixture Models ◽

Model Fitting ◽

Simulated Data ◽

R Package ◽

Gaussian Mixture ◽

Parameter Estimates ◽

Cluster Assignment ◽

Underlying Distribution

AbstractAlthough missing data are prevalent in applications, existing implementations of Gaussian mixture models (GMMs) require complete data. Standard practice is to perform complete case analysis or imputation prior to model fitting. Both approaches have serious drawbacks, potentially resulting in biased and unstable parameter estimates. Here we present MGMM, an R package for fitting GMMs in the presence of missing data. Using three case studies on real and simulated data sets, we demonstrate that, when the underlying distribution is near-to a GMM, MGMM is more effective at recovering the true cluster assignments than state of the art imputation followed by standard GMM. Moreover, MGMM provides an accurate assessment of cluster assignment uncertainty even when the generative distribution is not a GMM. This assessment may be used to identify unassignable observations. MGMM is available as an R package on CRAN: https://CRAN.R-project.org/package=MGMM.

Download Full-text

Predicting drug polypharmacology from cell morphology readouts using variational autoencoder latent space arithmetic

10.1101/2021.09.02.458673 ◽

2021 ◽

Author(s):

Yuen Ler Chow ◽

Shantanu Singh ◽

Anne E Carpenter ◽

Gregory P. Way

Keyword(s):

Gene Expression ◽

Cell Morphology ◽

Learning Algorithm ◽

Simulated Data ◽

Biomedical Data ◽

Data Types ◽

Generative Capacity ◽

Latent Space ◽

Variational Autoencoder ◽

Target Effects

A variational autoencoder (VAE) is a machine learning algorithm, useful for generating a compressed and interpretable latent space. These representations have been generated from various biomedical data types and can be used to produce realistic-looking simulated data. However, standard vanilla VAEs suffer from entangled and uninformative latent spaces, which can be mitigated using other types of VAEs such as β-VAE and MMD-VAE. In this project, we evaluated the ability of VAEs to learn cell morphology characteristics derived from cell images. We trained and evaluated these three VAE variants-Vanilla VAE, β-VAE, and MMD-VAE-on cell morphology readouts and explored the generative capacity of each model to predict compound polypharmacology (the interactions of a drug with more than one target) using an approach called latent space arithmetic (LSA). To test the generalizability of the strategy, we also trained these VAEs using gene expression data of the same compound perturbations and found that gene expression provides complementary information. We found that the β-VAE and MMD-VAE disentangle morphology signals and reveal a more interpretable latent space. We reliably simulated morphology and gene expression readouts from certain compounds thereby predicting cell states perturbed with compounds of known polypharmacology. Inferring cell state for specific drug mechanisms could aid researchers in developing and identifying targeted therapeutics and categorizing off-target effects in the future.

Download Full-text

Shrinkage of dispersion parameters in the binomial family, with application to differential exon skipping

10.1101/012823 ◽

2014 ◽

Author(s):

Sean Ruddy ◽

Marla Johnson ◽

Elizabeth Purdom

Keyword(s):

Gene Expression ◽

Gene Expression Data ◽

Empirical Bayes ◽

Simulated Data ◽

Exon Skipping ◽

Expression Data ◽

Weighted Likelihood ◽

Sequencing Data ◽

Dispersion Parameters ◽

Per Gene

The prevalence of sequencing experiments in genomics has led to an increased use of methods for count data in analyzing high-throughput genomic data to perform analyses. The importance of shrinkage methods in improving the performance of statistical methods remains. A common example is that of gene expression data, where the counts per gene are often modeled as some form of an over-dispersed Poisson. In this case, shrinkage estimates of the per-gene dispersion parameter have led to improved estimation of dispersion in the case of a small number of samples. We address a different count setting introduced by the use of sequencing data: comparing differential proportional usage via an over-dispersed binomial model. This is motivated by our interest in testing for differential exon skipping in mRNA-Seq experiments. We introduce a novel method that is developed by modeling the dispersion based on the double binomial distribution proposed by Efron (1986). Our method (WEB-Seq) is an empirical bayes strategy for producing a shrunken estimate of dispersion and effectively detects differential proportional usage, and has close ties to the weighted-likelihood strategy of edgeR developed for gene expression data (Robinson and Smyth, 2007; Robinson et al., 2010). We analyze its behavior on simulated data sets as well as real data and show that our method is fast, powerful and gives accurate control of the FDR compared to alternative approaches. We provide implementation of our methods in the R package DoubleExpSeq available on CRAN.

Download Full-text

CrossICC: iterative consensus clustering of cross-platform gene expression data without adjusting batch effect

Briefings in Bioinformatics ◽

10.1093/bib/bbz116 ◽

2019 ◽

Vol 21 (5) ◽

pp. 1818-1824 ◽

Cited By ~ 1

Author(s):

Qi Zhao ◽

Yu Sun ◽

Zekun Liu ◽

Hongwan Zhang ◽

Xingyang Li ◽

...

Keyword(s):

Gene Expression ◽

Gene Expression Data ◽

Gene Signature ◽

Unsupervised Clustering ◽

Batch Effect ◽

Consensus Clustering ◽

Expression Data ◽

Personalized Care ◽

Cancer Subtypes ◽

Multiple Datasets

Abstract Unsupervised clustering of high-throughput gene expression data is widely adopted for cancer subtyping. However, cancer subtypes derived from a single dataset are usually not applicable across multiple datasets from different platforms. Merging different datasets is necessary to determine accurate and applicable cancer subtypes but is still embarrassing due to the batch effect. CrossICC is an R package designed for the unsupervised clustering of gene expression data from multiple datasets/platforms without the requirement of batch effect adjustment. CrossICC utilizes an iterative strategy to derive the optimal gene signature and cluster numbers from a consensus similarity matrix generated by consensus clustering. This package also provides abundant functions to visualize the identified subtypes and evaluate subtyping performance. We expected that CrossICC could be used to discover the robust cancer subtypes with significant translational implications in personalized care for cancer patients. Availability and Implementation The package is implemented in R and available at GitHub (https://github.com/bioinformatist/CrossICC) and Bioconductor (http://bioconductor.org/packages/release/bioc/html/CrossICC.html) under the GPL v3 License.

Download Full-text

Integrating Gene Ontology Based Grouping and Ranking into the Machine Learning Algorithm for Gene Expression Data Analysis

10.1007/978-3-030-87101-7_20 ◽

2021 ◽

pp. 205-214

Author(s):

Malik Yousef ◽

Ahmet Sayıcı ◽

Burcu Bakir-Gungor

Keyword(s):

Gene Expression ◽

Machine Learning ◽

Gene Ontology ◽

Data Analysis ◽

Gene Expression Data ◽

Learning Algorithm ◽

Machine Learning Algorithm ◽

Expression Data ◽

Gene Expression Data Analysis

Download Full-text

A Fast Globally Supervised Learning Algorithm for Gaussian Mixture Models

Web-Age Information Management - Lecture Notes in Computer Science ◽

10.1007/3-540-45151-x_42 ◽

2000 ◽

pp. 449-454 ◽

Cited By ~ 1

Author(s):

Jiyong Ma ◽

Wen Gao

Keyword(s):

Supervised Learning ◽

Mixture Models ◽

Learning Algorithm ◽

Gaussian Mixture Models ◽

Gaussian Mixture

Download Full-text

Unsupervised clustering of time series gene expression data based on spectrum processing and autoregressive modeling

Computational Methods with Applications in Bioinformatics Analysis ◽

10.1142/9789813207981_0001 ◽

2017 ◽

pp. 1-21

Author(s):

Chien-Yuan Li ◽

Rong-Ming Chen ◽

Been-Chian Chien ◽

Rouh-Mei Hu ◽

Jeffrey J. P. Tsai

Keyword(s):

Gene Expression ◽

Time Series ◽

Gene Expression Data ◽

Unsupervised Clustering ◽

Expression Data ◽

Autoregressive Modeling ◽

Time Series Gene Expression

Download Full-text