Parallel clustering of single cell transcriptomic data with split-merge sampling on Dirichlet process mixtures

Tiehang Duan; José P Pinto; Xiaohui Xie

doi:10.1093/bioinformatics/bty702

Parallel clustering of single cell transcriptomic data with split-merge sampling on Dirichlet process mixtures

Bioinformatics ◽

10.1093/bioinformatics/bty702 ◽

2018 ◽

Vol 35 (6) ◽

pp. 953-961 ◽

Cited By ~ 3

Author(s):

Tiehang Duan ◽

José P Pinto ◽

Xiaohui Xie

Keyword(s):

Single Cell ◽

Dirichlet Process ◽

High Performance ◽

Supplementary Information ◽

Clustering Methods ◽

Dirichlet Process Mixture ◽

Computational Speed ◽

Clustering Quality ◽

Single Data ◽

Cell Transcriptome

Abstract Motivation With the development of droplet based systems, massive single cell transcriptome data has become available, which enables analysis of cellular and molecular processes at single cell resolution and is instrumental to understanding many biological processes. While state-of-the-art clustering methods have been applied to the data, they face challenges in the following aspects: (i) the clustering quality still needs to be improved; (ii) most models need prior knowledge on number of clusters, which is not always available; (iii) there is a demand for faster computational speed. Results We propose to tackle these challenges with Parallelized Split Merge Sampling on Dirichlet Process Mixture Model (the Para-DPMM model). Unlike classic DPMM methods that perform sampling on each single data point, the split merge mechanism samples on the cluster level, which significantly improves convergence and optimality of the result. The model is highly parallelized and can utilize the computing power of high performance computing (HPC) clusters, enabling massive inference on huge datasets. Experiment results show the model outperforms current widely used models in both clustering quality and computational speed. Availability and implementation Source code is publicly available on https://github.com/tiehangd/Para_DPMM/tree/master/Para_DPMM_package. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Parallelized Inference for Single Cell Transcriptomic Clustering with Split Merge Sampling on DPMM Model

10.1101/271163 ◽

2018 ◽

Author(s):

Tiehang Duan ◽

José P. Pinto ◽

Xiaohui Xie

Keyword(s):

Single Cell ◽

High Performance ◽

Clustering Methods ◽

Single Data Point ◽

Computational Speed ◽

Clustering Quality ◽

Single Data ◽

Cell Transcriptome ◽

Single Cell Transcriptome ◽

Mean Time

Motivation: With the development of droplet based systems, massive single cell transcriptome data has become available, which enables analysis of cellular and molecular processes at single cell resolution and is instrumental to understanding many biological processes. While state-of-the-art clustering methods have been applied to the data, they face challenges in the following aspects: (1) the clustering quality still needs to be improved; (2) most models need prior knowledge on number of clusters, which is not always available; (3) there is a demand for faster computational speed.Results: We propose to tackle these challenges with Parallelized Split Merge Sampling on Dirichlet Process Mixture Model (the Para-DPMM model). Unlike classic DPMM methods that perform sampling on each single data point, the split merge mechanism samples on the cluster level, which significantly improves convergence and optimality of the result. The model is highly parallelized and can utilize the computing power of high performance computing (HPC) clusters, enabling massive inference on huge datasets. Experiment results show the model achieves about 7% improvement in clustering accuracy for small datasets and more than 20% improvement for large challenging datasets compared with current widely used models. In the mean time, the model’s computing speed is significantly faster.Availability: Source code is publicly available on https://github.com/tiehangd/Para_DPMM/tree/master/Para_DPMM_package

Download Full-text

SinNLRR: a robust subspace clustering method for cell type detection by non-negative and low-rank representation

Bioinformatics ◽

10.1093/bioinformatics/btz139 ◽

2019 ◽

Vol 35 (19) ◽

pp. 3642-3650 ◽

Cited By ~ 13

Author(s):

Ruiqing Zheng ◽

Min Li ◽

Zhenlan Liang ◽

Fang-Xiang Wu ◽

Yi Pan ◽

...

Keyword(s):

Single Cell ◽

Low Rank ◽

Supplementary Information ◽

Similarity Matrix ◽

Similarity Learning ◽

Clustering Methods ◽

Cell Type ◽

Gene Markers ◽

Adaptive Penalty ◽

New Perspective

Abstract Motivation The development of single-cell RNA-sequencing (scRNA-seq) provides a new perspective to study biological problems at the single-cell level. One of the key issues in scRNA-seq analysis is to resolve the heterogeneity and diversity of cells, which is to cluster the cells into several groups. However, many existing clustering methods are designed to analyze bulk RNA-seq data, it is urgent to develop the new scRNA-seq clustering methods. Moreover, the high noise in scRNA-seq data also brings a lot of challenges to computational methods. Results In this study, we propose a novel scRNA-seq cell type detection method based on similarity learning, called SinNLRR. The method is motivated by the self-expression of the cells with the same group. Specifically, we impose the non-negative and low rank structure on the similarity matrix. We apply alternating direction method of multipliers to solve the optimization problem and propose an adaptive penalty selection method to avoid the sensitivity to the parameters. The learned similarity matrix could be incorporated with spectral clustering, t-distributed stochastic neighbor embedding for visualization and Laplace score for prioritizing gene markers. In contrast to other scRNA-seq clustering methods, our method achieves more robust and accurate results on different datasets. Availability and implementation Our MATLAB implementation of SinNLRR is available at, https://github.com/zrq0123/SinNLRR. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

GPseudoClust: deconvolution of shared pseudo-profiles at single-cell resolution

10.1101/567115 ◽

2019 ◽

Author(s):

Magdalena E Strauss ◽

Paul DW Kirk ◽

John E Reid ◽

Lorenz Wernisch

Keyword(s):

Single Cell ◽

Time Course ◽

Gene Clusters ◽

Supplementary Information ◽

Clustering Methods ◽

Link Type ◽

Novel Approach ◽

Broad Array ◽

Recent Method ◽

Cell Data

AbstractMotivationMany methods have been developed to cluster genes on the basis of their changes in mRNA expression over time, using bulk RNA-seq or microarray data. However, single-cell data may present a particular challenge for these algorithms, since the temporal ordering of cells is not directly observed. One way to address this is to first use pseudotime methods to order the cells, and then apply clustering techniques for time course data. However, pseudotime estimates are subject to high levels of uncertainty, and failing to account for this uncertainty is liable to lead to erroneous and/or over-confident gene clusters.ResultsThe proposed method, GPseudoClust, is a novel approach that jointly infers pseudotem-poral ordering and gene clusters, and quantifies the uncertainty in both. GPseudoClust combines a recent method for pseudotime inference with nonparametric Bayesian clustering methods, efficient MCMC sampling, and novel subsampling strategies which aid computation. We consider a broad array of simulated and experimental datasets to demonstrate the effectiveness of GPseudoClust in a range of settings.AvailabilityAn implementation is available on GitHub: https://github.com/magStra/nonparametricSummaryPSM and https://github.com/magStra/[email protected] informationSupplementary materials are available.

Download Full-text

A Bayesian Nonparametric Model for Inferring Subclonal Populations from Structured DNA Sequencing Data

10.1101/2020.11.10.330183 ◽

2020 ◽

Author(s):

Shai He ◽

Aaron Schein ◽

Vishal Sarsani ◽

Patrick Flaherty

Keyword(s):

Dna Sequencing ◽

Single Cell ◽

Dirichlet Process ◽

Lymphoblastic Leukemia ◽

Nonparametric Model ◽

Dirichlet Process Mixture ◽

Sequencing Data ◽

Hierarchical Dirichlet Process ◽

Dirichlet Process Prior

There are distinguishing features or “hallmarks” of cancer that are found across tumors, individuals, and types of cancer, and these hallmarks can be driven by specific genetic mutations. Yet, within a single tumor there is often extensive genetic heterogeneity as evidenced by single-cell and bulk DNA sequencing data. The goal of this work is to jointly infer the underlying genotypes of tumor subpopulations and the distribution of those subpopulations in individual tumors by integrating single-cell and bulk sequencing data. Understanding the genetic composition of the tumor at the time of treatment is important in the personalized design of targeted therapeutic combinations and monitoring for possible recurrence after treatment.We propose a hierarchical Dirichlet process mixture model that incorporates the correlation structure induced by a structured sampling arrangement and we show that this model improves the quality of inference. We develop a representation of the hierarchical Dirichlet process prior as a Gamma-Poisson hierarchy and we use this representation to derive a fast Gibbs sampling inference algorithm using the augment-and-marginalize method. Experiments with simulation data show that our model outperforms standard numerical and statistical methods for decomposing admixed count data. Analyses of real acute lymphoblastic leukemia cancer sequencing dataset shows that our model improves upon state-of-the-art bioinformatic methods. An interpretation of the results of our model on this real dataset reveals co-mutated loci across samples.

Download Full-text

BAMM-SC: A Bayesian mixture model for clustering droplet-based single cell transcriptomic data from population studies

10.1101/392662 ◽

2018 ◽

Author(s):

Zhe Sun ◽

Li Chen ◽

Hongyi Xin ◽

Qianhui Huang ◽

Anthony R Cillo ◽

...

Keyword(s):

Single Cell ◽

Single Cells ◽

R Package ◽

Clustering Methods ◽

Model Framework ◽

Bayesian Hierarchical ◽

Bayesian Mixture ◽

Population Scale ◽

Cell Transcriptome ◽

Single Cell Transcriptome

AbstractThe recently developed droplet-based single cell transcriptome sequencing (scRNA-seq) technology makes it feasible to perform a population-scale scRNA-seq study, in which the transcriptome is measured for tens of thousands of single cells from multiple individuals. Despite the advances of many clustering methods, there are few tailored methods for population-scale scRNA-seq studies. Here, we have developed a BAyesiany Mixture Model for Single Cell sequencing (BAMM-SC) method to cluster scRNA-seq data from multiple individuals simultaneously. Specifically, BAMM-SC takes raw data as input and can account for data heterogeneity and batch effect among multiple individuals in a unified Bayesian hierarchical model framework. Results from extensive simulations and application of BAMM-SC to in-house scRNA-seq datasets using blood, lung and skin cells from humans or mice demonstrated that BAMM-SC outperformed existing clustering methods with improved clustering accuracy and reduced impact from batch effects. BAMM-SC has been implemented in a user-friendly R package with a detailed tutorial available on www.pitt.edu/~Cwec47/singlecell.html.

Download Full-text

scRNABatchQC: multi-samples quality control for single cell RNA-seq data

Bioinformatics ◽

10.1093/bioinformatics/btz601 ◽

2019 ◽

Vol 35 (24) ◽

pp. 5306-5308

Author(s):

Qi Liu ◽

Quanhu Sheng ◽

Jie Ping ◽

Marisol Adelina Ramirez ◽

Ken S Lau ◽

...

Keyword(s):

Single Cell ◽

R Package ◽

Supplementary Information ◽

Rna Seq ◽

Technical Artifact ◽

Multiple Sample ◽

Systematic Biases ◽

Cell Transcriptome ◽

Single Cell Transcriptome ◽

Spurious Results

Abstract Summary Single cell RNA sequencing is a revolutionary technique to characterize inter-cellular transcriptomics heterogeneity. However, the data are noise-prone because gene expression is often driven by both technical artifacts and genuine biological variations. Proper disentanglement of these two effects is critical to prevent spurious results. While several tools exist to detect and remove low-quality cells in one single cell RNA-seq dataset, there is lack of approach to examining consistency between sample sets and detecting systematic biases, batch effects and outliers. We present scRNABatchQC, an R package to compare multiple sample sets simultaneously over numerous technical and biological features, which gives valuable hints to distinguish technical artifact from biological variations. scRNABatchQC helps identify and systematically characterize sources of variability in single cell transcriptome data. The examination of consistency across datasets allows visual detection of biases and outliers. Availability and implementation scRNABatchQC is freely available at https://github.com/liuqivandy/scRNABatchQC as an R package. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

A statistical approach for tracking clonal dynamics in cancer using longitudinal next-generation sequencing data

Bioinformatics ◽

10.1093/bioinformatics/btaa672 ◽

2020 ◽

Author(s):

Dimitrios V Vavoulis ◽

Anthony Cutts ◽

Jenny C Taylor ◽

Anna Schuh

Keyword(s):

Dirichlet Process ◽

Model Performance ◽

Supplementary Information ◽

Next Generation Sequencing Data ◽

Model Parameters ◽

Dirichlet Process Mixture ◽

Sequencing Data ◽

Sample Collection ◽

Liquid Biopsies ◽

Cross Sectional

Abstract Motivation Tumours are composed of distinct cancer cell populations (clones), which continuously adapt to their local micro-environment. Standard methods for clonal deconvolution seek to identify groups of mutations and estimate the prevalence of each group in the tumour, while considering its purity and copy number profile. These methods have been applied on cross-sectional data and on longitudinal data after discarding information on the timing of sample collection. Two key questions are how can we incorporate such information in our analyses and is there any benefit in doing so? Results We developed a clonal deconvolution method, which incorporates explicitly the temporal spacing of longitudinally sampled tumours. By merging a Dirichlet Process Mixture Model with Gaussian Process priors and using as input a sequence of several sparsely collected samples, our method can reconstruct the temporal profile of the abundance of any mutation cluster supported by the data as a continuous function of time. We benchmarked our method on whole genome, whole exome and targeted sequencing data from patients with chronic lymphocytic leukaemia, on liquid biopsy data from a patient with melanoma and on synthetic data and we found that incorporating information on the timing of tissue collection improves model performance, as long as data of sufficient volume and complexity are available for estimating free model parameters. Thus, our approach is particularly useful when collecting a relatively long sequence of tumour samples is feasible, as in liquid cancers (e.g. leukaemia) and liquid biopsies. Availability and implementation The statistical methodology presented in this paper is freely available at github.com/dvav/clonosGP. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Artificial-cell-type aware cell-type classification in CITE-seq

Bioinformatics ◽

10.1093/bioinformatics/btaa467 ◽

2020 ◽

Vol 36 (Supplement_1) ◽

pp. i542-i550 ◽

Cited By ~ 1

Author(s):

Qiuyu Lian ◽

Hongyi Xin ◽

Jianzhu Ma ◽

Liza Konnikova ◽

Wei Chen ◽

...

Keyword(s):

Cell Surface ◽

Single Cell ◽

Domain Knowledge ◽

Cell Types ◽

Surface Marker ◽

Supplementary Information ◽

Clustering Methods ◽

Cell Type ◽

Artificial Cell ◽

Marker Proteins

Abstract Motivation Cellular Indexing of Transcriptomes and Epitopes by sequencing (CITE-seq), couples the measurement of surface marker proteins with simultaneous sequencing of mRNA at single cell level, which brings accurate cell surface phenotyping to single-cell transcriptomics. Unfortunately, multiplets in CITE-seq datasets create artificial cell types (ACT) and complicate the automation of cell surface phenotyping. Results We propose CITE-sort, an artificial-cell-type aware surface marker clustering method for CITE-seq. CITE-sort is aware of and is robust to multiplet-induced ACT. We benchmarked CITE-sort with real and simulated CITE-seq datasets and compared CITE-sort against canonical clustering methods. We show that CITE-sort produces the best clustering performance across the board. CITE-sort not only accurately identifies real biological cell types (BCT) but also consistently and reliably separates multiplet-induced artificial-cell-type droplet clusters from real BCT droplet clusters. In addition, CITE-sort organizes its clustering process with a binary tree, which facilitates easy interpretation and verification of its clustering result and simplifies cell-type annotation with domain knowledge in CITE-seq. Availability and implementation http://github.com/QiuyuLian/CITE-sort. Supplementary information Supplementary data is available at Bioinformatics online.

Download Full-text

Non-Parametric Bayesian Subspace Models for Acoustic Unit Discovery

10.36227/techrxiv.16618135 ◽

2021 ◽

Author(s):

Lucas Ondel

Keyword(s):

Dirichlet Process ◽

Experimental Results ◽

Parametric Models ◽

Dirichlet Process Mixture ◽

Clustering Quality ◽

Segmentation Accuracy ◽

Target Data ◽

Non Parametric

This work investigates subspace non-parametric models for the task of learning a set of acoustic units from unlabeled speech recordings. We constrain the base-measure of a Dirichlet-Process mixture with a phonetic subspace---estimated from other source languages---to build an \emph{educated prior}, thereby forcing the learned acoustic units to resemble phones of known source languages. Two types of models are proposed: (i) the Subspace HMM (SHMM) which assumes that the phonetic subspace is the same for every language, (ii) the Hierarchical-Subspace HMM (H-SHMM) which relaxes this assumption and allows to have a language-specific subspace estimated on the unlabeled target data. These models are applied on 3 languages: English, Yoruba and Mboshi and they are compared with various competitive acoustic units discovery baselines. Experimental results show that both subspace models outperform other systems in terms of clustering quality and segmentation accuracy. Moreover, we observe that the H-SHMM provides results superior to the SHMM supporting the idea that language-specific priors are preferable to language-agnostic priors for acoustic unit discovery.

Download Full-text

Bayesian non-parametric clustering of single-cell mutation profiles

10.1101/2020.01.15.907345 ◽

2020 ◽

Cited By ~ 1

Author(s):

Nico Borgsmüller ◽

Jose Bonet ◽

Francesco Marass ◽

Abel Gonzalez-Perez ◽

Nuria Lopez-Bigas ◽

...

Keyword(s):

Single Cell ◽

Dirichlet Process ◽

Tumor Heterogeneity ◽

Missing Values ◽

Parametric Method ◽

Simulated Data ◽

Error Rates ◽

Data Sets ◽

Dirichlet Process Mixture ◽

Non Parametric

AbstractThe high resolution of single-cell DNA sequencing (scDNA-seq) offers great potential to resolve intra-tumor heterogeneity by distinguishing clonal populations based on their mutation profiles. However, the increasing size of scDNA-seq data sets and technical limitations, such as high error rates and a large proportion of missing values, complicate this task and limit the applicability of existing methods. Here we introduce BnpC, a novel non-parametric method to cluster individual cells into clones and infer their genotypes based on their noisy mutation profiles. BnpC employs a Dirichlet process mixture model coupled with a Markov chain Monte Carlo sampling scheme, including a modified split-merge move and a novel posterior estimator to predict clones and genotypes. We benchmarked our method comprehensively against state-of-the-art methods on simulated data using various data sizes, and applied it to three cancer scDNA-seq data sets. On simulated data, BnpC compared favorably against current methods in terms of accuracy, runtime, and scalability. Its inferred genotypes were the most accurate, and it was the only method able to run and produce results on data sets with 10,000 cells. On tumor scDNA-seq data, BnpC was able to identify clonal populations missed by the original cluster analysis but supported by supplementary experimental data. With ever growing scDNA-seq data sets, scalable and accurate methods such as BnpC will become increasingly relevant, not only to resolve intra-tumor heterogeneity but also as a pre-processing step to reduce data size. BnpC is freely available under MIT license at https://github.com/cbg-ethz/BnpC.

Download Full-text