scholarly journals Regression on imperfect class labels derived by unsupervised clustering

Author(s):  
Rasmus Froberg Brøndum ◽  
Thomas Yssing Michaelsen ◽  
Martin Bøgsted

Abstract Outcome regressed on class labels identified by unsupervised clustering is custom in many applications. However, it is common to ignore the misclassification of class labels caused by the learning algorithm, which potentially leads to serious bias of the estimated effect parameters. Due to their generality we suggest to address the problem by use of regression calibration or the misclassification simulation and extrapolation method. Performance is illustrated by simulated data from Gaussian mixture models, documenting a reduced bias and improved coverage of confidence intervals when adjusting for misclassification with either method. Finally, we apply our method to data from a previous study, which regressed overall survival on class labels derived from unsupervised clustering of gene expression data from bone marrow samples of multiple myeloma patients.

Entropy ◽  
2021 ◽  
Vol 23 (5) ◽  
pp. 518
Author(s):  
Osamu Komori ◽  
Shinto Eguchi

Clustering is a major unsupervised learning algorithm and is widely applied in data mining and statistical data analyses. Typical examples include k-means, fuzzy c-means, and Gaussian mixture models, which are categorized into hard, soft, and model-based clusterings, respectively. We propose a new clustering, called Pareto clustering, based on the Kolmogorov–Nagumo average, which is defined by a survival function of the Pareto distribution. The proposed algorithm incorporates all the aforementioned clusterings plus maximum-entropy clustering. We introduce a probabilistic framework for the proposed method, in which the underlying distribution to give consistency is discussed. We build the minorize-maximization algorithm to estimate the parameters in Pareto clustering. We compare the performance with existing methods in simulation studies and in benchmark dataset analyses to demonstrate its highly practical utilities.


Author(s):  
Zachary R. McCaw ◽  
Hanna Julienne ◽  
Hugues Aschard

AbstractAlthough missing data are prevalent in applications, existing implementations of Gaussian mixture models (GMMs) require complete data. Standard practice is to perform complete case analysis or imputation prior to model fitting. Both approaches have serious drawbacks, potentially resulting in biased and unstable parameter estimates. Here we present MGMM, an R package for fitting GMMs in the presence of missing data. Using three case studies on real and simulated data sets, we demonstrate that, when the underlying distribution is near-to a GMM, MGMM is more effective at recovering the true cluster assignments than state of the art imputation followed by standard GMM. Moreover, MGMM provides an accurate assessment of cluster assignment uncertainty even when the generative distribution is not a GMM. This assessment may be used to identify unassignable observations. MGMM is available as an R package on CRAN: https://CRAN.R-project.org/package=MGMM.


2021 ◽  
Author(s):  
Yuen Ler Chow ◽  
Shantanu Singh ◽  
Anne E Carpenter ◽  
Gregory P. Way

A variational autoencoder (VAE) is a machine learning algorithm, useful for generating a compressed and interpretable latent space. These representations have been generated from various biomedical data types and can be used to produce realistic-looking simulated data. However, standard vanilla VAEs suffer from entangled and uninformative latent spaces, which can be mitigated using other types of VAEs such as β-VAE and MMD-VAE. In this project, we evaluated the ability of VAEs to learn cell morphology characteristics derived from cell images. We trained and evaluated these three VAE variants-Vanilla VAE, β-VAE, and MMD-VAE-on cell morphology readouts and explored the generative capacity of each model to predict compound polypharmacology (the interactions of a drug with more than one target) using an approach called latent space arithmetic (LSA). To test the generalizability of the strategy, we also trained these VAEs using gene expression data of the same compound perturbations and found that gene expression provides complementary information. We found that the β-VAE and MMD-VAE disentangle morphology signals and reveal a more interpretable latent space. We reliably simulated morphology and gene expression readouts from certain compounds thereby predicting cell states perturbed with compounds of known polypharmacology. Inferring cell state for specific drug mechanisms could aid researchers in developing and identifying targeted therapeutics and categorizing off-target effects in the future.


2014 ◽  
Author(s):  
Sean Ruddy ◽  
Marla Johnson ◽  
Elizabeth Purdom

The prevalence of sequencing experiments in genomics has led to an increased use of methods for count data in analyzing high-throughput genomic data to perform analyses. The importance of shrinkage methods in improving the performance of statistical methods remains. A common example is that of gene expression data, where the counts per gene are often modeled as some form of an over-dispersed Poisson. In this case, shrinkage estimates of the per-gene dispersion parameter have led to improved estimation of dispersion in the case of a small number of samples. We address a different count setting introduced by the use of sequencing data: comparing differential proportional usage via an over-dispersed binomial model. This is motivated by our interest in testing for differential exon skipping in mRNA-Seq experiments. We introduce a novel method that is developed by modeling the dispersion based on the double binomial distribution proposed by Efron (1986). Our method (WEB-Seq) is an empirical bayes strategy for producing a shrunken estimate of dispersion and effectively detects differential proportional usage, and has close ties to the weighted-likelihood strategy of edgeR developed for gene expression data (Robinson and Smyth, 2007; Robinson et al., 2010). We analyze its behavior on simulated data sets as well as real data and show that our method is fast, powerful and gives accurate control of the FDR compared to alternative approaches. We provide implementation of our methods in the R package DoubleExpSeq available on CRAN.


2019 ◽  
Vol 21 (5) ◽  
pp. 1818-1824 ◽  
Author(s):  
Qi Zhao ◽  
Yu Sun ◽  
Zekun Liu ◽  
Hongwan Zhang ◽  
Xingyang Li ◽  
...  

Abstract   Unsupervised clustering of high-throughput gene expression data is widely adopted for cancer subtyping. However, cancer subtypes derived from a single dataset are usually not applicable across multiple datasets from different platforms. Merging different datasets is necessary to determine accurate and applicable cancer subtypes but is still embarrassing due to the batch effect. CrossICC is an R package designed for the unsupervised clustering of gene expression data from multiple datasets/platforms without the requirement of batch effect adjustment. CrossICC utilizes an iterative strategy to derive the optimal gene signature and cluster numbers from a consensus similarity matrix generated by consensus clustering. This package also provides abundant functions to visualize the identified subtypes and evaluate subtyping performance. We expected that CrossICC could be used to discover the robust cancer subtypes with significant translational implications in personalized care for cancer patients. Availability and Implementation The package is implemented in R and available at GitHub (https://github.com/bioinformatist/CrossICC) and Bioconductor (http://bioconductor.org/packages/release/bioc/html/CrossICC.html) under the GPL v3 License.


Sign in / Sign up

Export Citation Format

Share Document