Fast Hierarchical Bayesian Analysis of Population Structure

Mapping Intimacies ◽

10.1101/454355 ◽

2018 ◽

Cited By ~ 1

Author(s):

Gerry Tonkin-Hill ◽

John A. Lees ◽

Stephen D. Bentley ◽

Simon D.W. Frost ◽

Jukka Corander

Keyword(s):

Dirichlet Process ◽

Phylogenetic Trees ◽

Marginal Likelihood ◽

Simulated Data ◽

R Package ◽

Multilocus Genotype ◽

Dirichlet Process Mixture ◽

Model Based Clustering ◽

Hierarchical Bayesian Analysis ◽

Model Based

We present fastbaps, a fast solution to the genetic clustering problem. Fastbaps rapidly identifies an approximate fit to a Dirichlet Process Mixture model (DPM) for clustering multilocus genotype data. Our efficient model-based clustering approach is able to cluster datasets 10-100 times larger than the existing model-based methods, which we demonstrate by analysing an alignment of over 110,000 sequences of HIV-1 pol genes. We also provide a method for rapidly partitioning an existing hierarchy in order to maximise the DPM model marginal likelihood, allowing us to split phylogenetic trees into clades and subclades using a population genomic model. Extensive tests on simulated data as well as a diverse set of real bacterial and viral datasets show that fastbaps provides comparable or improved solutions to previous model-based methods, while generally being significantly faster. The method is made freely available under an open source MIT licence as an easy to use R package at https://github.com/gtonkinhill/fastbaps.

Download Full-text

Model-Based Clustering for Expression Data via a Dirichlet Process Mixture Model

Bayesian Inference for Gene Expression and Proteomics ◽

10.1017/cbo9780511584589.011 ◽

2009 ◽

pp. 201-218 ◽

Cited By ~ 70

Author(s):

David B. Dahl ◽

Marina Vannucci

Keyword(s):

Mixture Model ◽

Dirichlet Process ◽

Expression Data ◽

Dirichlet Process Mixture ◽

Dirichlet Process Mixture Model ◽

Model Based Clustering ◽

Model Based

Download Full-text

An Approach to Detect the Internet Water Army via Dirichlet Process Mixture Model Based GSP Algorithm

Applications and Techniques in Information Security - Communications in Computer and Information Science ◽

10.1007/978-3-662-45670-5_9 ◽

2014 ◽

pp. 82-95 ◽

Cited By ~ 1

Author(s):

Dan Li ◽

Qian Li ◽

Yue Hu ◽

Wenjia Niu ◽

Jianlong Tan ◽

...

Keyword(s):

Mixture Model ◽

Dirichlet Process ◽

The Internet ◽

Dirichlet Process Mixture ◽

Dirichlet Process Mixture Model ◽

Model Based

Download Full-text

Fast approximate inference for variable selection in Dirichlet process mixtures, with an application to pan-cancer proteomics

Statistical Applications in Genetics and Molecular Biology ◽

10.1515/sagmb-2018-0065 ◽

2019 ◽

Vol 18 (6) ◽

Author(s):

Oliver M. Crook ◽

Laurent Gatto ◽

Paul D. W. Kirk

Keyword(s):

Variable Selection ◽

Dirichlet Process ◽

Bayesian Model ◽

Bayesian Model Averaging ◽

Model Averaging ◽

R Package ◽

The Cancer Genome Atlas ◽

Fast Method ◽

Model Based Clustering ◽

Pan Cancer

Abstract The Dirichlet Process (DP) mixture model has become a popular choice for model-based clustering, largely because it allows the number of clusters to be inferred. The sequential updating and greedy search (SUGS) algorithm (Wang & Dunson, 2011) was proposed as a fast method for performing approximate Bayesian inference in DP mixture models, by posing clustering as a Bayesian model selection (BMS) problem and avoiding the use of computationally costly Markov chain Monte Carlo methods. Here we consider how this approach may be extended to permit variable selection for clustering, and also demonstrate the benefits of Bayesian model averaging (BMA) in place of BMS. Through an array of simulation examples and well-studied examples from cancer transcriptomics, we show that our method performs competitively with the current state-of-the-art, while also offering computational benefits. We apply our approach to reverse-phase protein array (RPPA) data from The Cancer Genome Atlas (TCGA) in order to perform a pan-cancer proteomic characterisation of 5157 tumour samples. We have implemented our approach, together with the original SUGS algorithm, in an open-source R package named sugsvarsel, which accelerates analysis by performing intensive computations in C++ and provides automated parallel processing. The R package is freely available from: https://github.com/ococrook/sugsvarsel

Download Full-text

BoskR – Testing Adequacy of Diversification Models Using Tree Shape

10.1101/2020.12.21.423829 ◽

2020 ◽

Cited By ~ 1

Author(s):

Orlando Schwery ◽

Brian C. O’Meara

Keyword(s):

Phylogenetic Trees ◽

Graph Laplacian ◽

R Package ◽

Laplacian Spectrum ◽

Tree Shape ◽

Model Adequacy ◽

Model Based ◽

Absolute Sense ◽

Best Fit

AbstractThe study of diversification largely relies on model-based approaches, estimating rates of speciation and extinction from phylogenetic trees. While a plethora of different models exist – all with different features, strengths and weaknesses – there is increasing concern about the reliability of the inference we gain from them. Apart from simply finding the model with the best fit for the data, we should find ways to assess a model’s suitability to describe the data in an absolute sense. The R package BoskR implements a simple way of judging a model’s adequacy for a given phylogeny using metrics for tree shape, assuming that a model is inadequate for a phylogeny if it produces trees that are consistently dissimilar in shape from the tree that should be analyzed. Tree shape is assessed via metrics derived from the tree’s modified graph Laplacian spectrum, as provided by RPANDA. We exemplify the use of the method using simulated and empirical example phylogenies. BoskR was mostly able to correctly distinguish trees simulated under clearly different models and revealed that not all models are adequate for the empirical example trees. We believe the metrics of tree shape to be an intuitive and relevant means of assessing diversification model adequacy. Furthermore, by implementing the approach in an openly available R package, we enable and encourage researchers to adopt adequacy testing into their workflow.

Download Full-text

MatTransMix: an R Package for Matrix Model-Based Clustering and Parsimonious Mixture Modeling

Journal of Classification ◽

10.1007/s00357-021-09401-9 ◽

2021 ◽

Author(s):

Xuwen Zhu ◽

Shuchismita Sarkar ◽

Volodymyr Melnykov

Keyword(s):

Matrix Model ◽

R Package ◽

Mixture Modeling ◽

Model Based Clustering ◽

Model Based

Download Full-text

Model-based clustering for populations of networks

Statistical Modelling ◽

10.1177/1471082x19871128 ◽

2019 ◽

Vol 20 (1) ◽

pp. 9-29 ◽

Cited By ~ 2

Author(s):

Mirko Signorelli ◽

Ernst C. Wit

Keyword(s):

Mixed Models ◽

Large Scale ◽

Simulated Data ◽

Likelihood Estimation ◽

Automatic Monitoring ◽

Model Based Clustering ◽

Model Based ◽

Proposed Model ◽

The Em Algorithm ◽

Monitoring Devices

Until recently obtaining data on populations of networks was typically rare. However, with the advancement of automatic monitoring devices and the growing social and scientific interest in networks, such data has become more widely available. From sociological experiments involving cognitive social structures to fMRI scans revealing large-scale brain networks of groups of patients, there is a growing awareness that we urgently need tools to analyse populations of networks and particularly to model the variation between networks due to covariates. We propose a model-based clustering method based on mixtures of generalized linear (mixed) models that can be employed to describe the joint distribution of a populations of networks in a parsimonious manner and to identify subpopulations of networks that share certain topological properties of interest (degree distribution, community structure, effect of covariates on the presence of an edge, etc.). Maximum likelihood estimation for the proposed model can be efficiently carried out with an implementation of the EM algorithm. We assess the performance of this method on simulated data and conclude with an example application on advice networks in a small business.

Download Full-text

Variable selection in model-based clustering using multilocus genotype data

Advances in Data Analysis and Classification ◽

10.1007/s11634-009-0043-x ◽

2009 ◽

Vol 3 (2) ◽

pp. 109-134 ◽

Cited By ~ 6

Author(s):

Wilson Toussile ◽

Elisabeth Gassiat

Keyword(s):

Variable Selection ◽

Multilocus Genotype ◽

Genotype Data ◽

Model Based Clustering ◽

Model Based

Download Full-text

Model-based clustering with mclust R package: Multivariate assessment of mathematics performance of students in Qatar

10.36334/modsim.2021.a1.alzahrani ◽

2021 ◽

Keyword(s):

R Package ◽

Mathematics Performance ◽

Model Based Clustering ◽

Model Based

Download Full-text

BayesBinMix: an R Package for Model Based Clustering of Multivariate Binary Data

The R Journal ◽

10.32614/rj-2017-022 ◽

2017 ◽

Vol 9 (1) ◽

pp. 403 ◽

Cited By ~ 5

Author(s):

Panagiotis Papastamoulis ◽

Magnus Rattray

Keyword(s):

Binary Data ◽

R Package ◽

Model Based Clustering ◽

Model Based ◽

Multivariate Binary Data

Download Full-text

Bayesian non-parametric clustering of single-cell mutation profiles

10.1101/2020.01.15.907345 ◽

2020 ◽

Cited By ~ 1

Author(s):

Nico Borgsmüller ◽

Jose Bonet ◽

Francesco Marass ◽

Abel Gonzalez-Perez ◽

Nuria Lopez-Bigas ◽

...

Keyword(s):

Single Cell ◽

Dirichlet Process ◽

Tumor Heterogeneity ◽

Missing Values ◽

Parametric Method ◽

Simulated Data ◽

Error Rates ◽

Data Sets ◽

Dirichlet Process Mixture ◽

Non Parametric

AbstractThe high resolution of single-cell DNA sequencing (scDNA-seq) offers great potential to resolve intra-tumor heterogeneity by distinguishing clonal populations based on their mutation profiles. However, the increasing size of scDNA-seq data sets and technical limitations, such as high error rates and a large proportion of missing values, complicate this task and limit the applicability of existing methods. Here we introduce BnpC, a novel non-parametric method to cluster individual cells into clones and infer their genotypes based on their noisy mutation profiles. BnpC employs a Dirichlet process mixture model coupled with a Markov chain Monte Carlo sampling scheme, including a modified split-merge move and a novel posterior estimator to predict clones and genotypes. We benchmarked our method comprehensively against state-of-the-art methods on simulated data using various data sizes, and applied it to three cancer scDNA-seq data sets. On simulated data, BnpC compared favorably against current methods in terms of accuracy, runtime, and scalability. Its inferred genotypes were the most accurate, and it was the only method able to run and produce results on data sets with 10,000 cells. On tumor scDNA-seq data, BnpC was able to identify clonal populations missed by the original cluster analysis but supported by supplementary experimental data. With ever growing scDNA-seq data sets, scalable and accurate methods such as BnpC will become increasingly relevant, not only to resolve intra-tumor heterogeneity but also as a pre-processing step to reduce data size. BnpC is freely available under MIT license at https://github.com/cbg-ethz/BnpC.

Download Full-text