BinSanity: Unsupervised Clustering of Environmental Microbial Assemblies Using Coverage and Affinity Propagation

Mapping Intimacies ◽

10.1101/069567 ◽

2016 ◽

Cited By ~ 1

Author(s):

Elaina Graham ◽

John Heidelberg ◽

Benjamin Tully

Keyword(s):

Neural Networks ◽

Microbial Community ◽

Clustering Algorithm ◽

Gaussian Mixture Models ◽

Gaussian Mixture ◽

Affinity Propagation ◽

Adjusted Rand Index ◽

Two Samples ◽

Low Coverage ◽

Gut Metagenome

AbstractMetagenomics has become an integral part of defining microbial diversity in various environments. Many ecosystems have characteristically low biomass and few cultured representatives. Linking potential metabolisms to phylogeny in environmental microorganisms is important for interpreting microbial community functions and the impacts these communities have on geochemical cycles. However, with metagenomic studies there is the computational hurdle of ‘binning’ contigs into phylogenetically related units or putative genomes. Binning methods have been implemented with varying approaches such as k-means clustering, Gaussian mixture models, hierarchical clustering, neural networks, and two-way clustering; however, many of these suffer from biases against low coverage/abundance organisms and closely related taxa/strains. We are introducing a new binning method, BinSanity, that utilizes the clustering algorithm affinity propagation (AP), to cluster assemblies using coverage alone, removing potential composition based biases in clustering contigs, but requires a minimum of two samples. To increase fidelity, a refinement script was developed that uses composition data (tetranucleotide frequency and %G+C content) to refine bins containing multiple source organisms. This separation of composition and coverage based signatures reduces clustering bias for closely related taxa. BinSanity was developed and tested on artificial metagenomes varying in size and complexity. Results indicate that this implementation of AP lead to a higher precision, recall, and Adjusted Rand Index over five commonly implemented methods. When tested on a previously published infant gut metagenome, BinSanity generated high completion and low redundancy bins corresponding with the published metagenome-assembled genomes.

Download Full-text

BinSanity: unsupervised clustering of environmental microbial assemblies using coverage and affinity propagation

PeerJ ◽

10.7717/peerj.3035 ◽

2017 ◽

Vol 5 ◽

pp. e3035 ◽

Cited By ~ 58

Author(s):

Elaina D. Graham ◽

John F. Heidelberg ◽

Benjamin J. Tully

Keyword(s):

Neural Networks ◽

Microbial Community ◽

Mixture Models ◽

Clustering Algorithm ◽

Gaussian Mixture Models ◽

Gc Content ◽

Gaussian Mixture ◽

Affinity Propagation ◽

Adjusted Rand Index ◽

Low Coverage

Metagenomics has become an integral part of defining microbial diversity in various environments. Many ecosystems have characteristically low biomass and few cultured representatives. Linking potential metabolisms to phylogeny in environmental microorganisms is important for interpreting microbial community functions and the impacts these communities have on geochemical cycles. However, with metagenomic studies there is the computational hurdle of ‘binning’ contigs into phylogenetically related units or putative genomes. Binning methods have been implemented with varying approaches such as k-means clustering, Gaussian mixture models, hierarchical clustering, neural networks, and two-way clustering; however, many of these suffer from biases against low coverage/abundance organisms and closely related taxa/strains. We are introducing a new binning method, BinSanity, that utilizes the clustering algorithm affinity propagation (AP), to cluster assemblies using coverage with compositional based refinement (tetranucleotide frequency and percent GC content) to optimize bins containing multiple source organisms. This separation of composition and coverage based clustering reduces bias for closely related taxa. BinSanity was developed and tested on artificial metagenomes varying in size and complexity. Results indicate that BinSanity has a higher precision, recall, and Adjusted Rand Index compared to five commonly implemented methods. When tested on a previously published environmental metagenome, BinSanity generated high completion and low redundancy bins corresponding with the published metagenome-assembled genomes.

Download Full-text

Comparison of dimensionality reduction and clustering methods for SARS-CoV-2 genome

Bulletin of Electrical Engineering and Informatics ◽

10.11591/eei.v10i4.2803 ◽

2021 ◽

Vol 10 (4) ◽

pp. 2170-2180

Author(s):

Untari N. Wisesty ◽

Tati Rajab Mengko

Keyword(s):

Dimensionality Reduction ◽

Dimensional Reduction ◽

Clustering Algorithm ◽

Sequence Data ◽

Clustering Algorithms ◽

Gaussian Mixture Models ◽

Reduction Process ◽

Principal Component ◽

Gaussian Mixture ◽

Clustering Methods

This paper aims to conduct an analysis of the SARS-CoV-2 genome variation was carried out by comparing the results of genome clustering using several clustering algorithms and distribution of sequence in each cluster. The clustering algorithms used are K-means, Gaussian mixture models, agglomerative hierarchical clustering, mean-shift clustering, and DBSCAN. However, the clustering algorithm has a weakness in grouping data that has very high dimensions such as genome data, so that a dimensional reduction process is needed. In this research, dimensionality reduction was carried out using principal component analysis (PCA) and autoencoder method with three models that produce 2, 10, and 50 features. The main contributions achieved were the dimensional reduction and clustering scheme of SARS-CoV-2 sequence data and the performance analysis of each experiment on each scheme and hyper parameters for each method. Based on the results of experiments conducted, PCA and DBSCAN algorithm achieve the highest silhouette score of 0.8770 with three clusters when using two features. However, dimensionality reduction using autoencoder need more iterations to converge. On the testing process with Indonesian sequence data, more than half of them enter one cluster and the rest are distributed in the other two clusters.

Download Full-text

flowEMMi: an automated model-based clustering tool for microbial cytometric data

BMC Bioinformatics ◽

10.1186/s12859-019-3152-3 ◽

2019 ◽

Vol 20 (1) ◽

Cited By ~ 2

Author(s):

Joachim Ludwig ◽

Christian Höner zu Siederdissen ◽

Zishu Liu ◽

Peter F. Stadler ◽

Susann Müller

Keyword(s):

Microbial Community ◽

Gaussian Mixture Models ◽

Gaussian Mixture ◽

Medical Diagnostics ◽

Main Concern ◽

Accurate Identification ◽

Technical Noise ◽

Running Time ◽

Model Based Clustering ◽

Model Based

Abstract Background Flow cytometry (FCM) is a powerful single-cell based measurement method to ascertain multidimensional optical properties of millions of cells. FCM is widely used in medical diagnostics and health research. There is also a broad range of applications in the analysis of complex microbial communities. The main concern in microbial community analyses is to track the dynamics of microbial subcommunities. So far, this can be achieved with the help of time-consuming manual clustering procedures that require extensive user-dependent input. In addition, several tools have recently been developed by using different approaches which, however, focus mainly on the clustering of medical FCM data or of microbial samples with a well-known background, while much less work has been done on high-throughput, online algorithms for two-channel FCM. Results We bridge this gap with , a model-based clustering tool based on multivariate Gaussian mixture models with subsampling and foreground/background separation. These extensions provide a fast and accurate identification of cell clusters in FCM data, in particular for microbial community FCM data that are often affected by irrelevant information like technical noise, beads or cell debris. outperforms other available tools with regard to running time and information content of the clustering results and provides near-online results and optional heuristics to reduce the running-time further. Conclusions is a useful tool for the automated cluster analysis of microbial FCM data. It overcomes the user-dependent and time-consuming manual clustering procedure and provides consistent results with ancillary information and statistical proof.

Download Full-text

Tropospheric Planetary Wave Dynamics and Mixture Modeling: Two Preferred Regimes and a Regime Shift

Journal of the Atmospheric Sciences ◽

10.1175/jas4045.1 ◽

2007 ◽

Vol 64 (10) ◽

pp. 3521-3541 ◽

Cited By ~ 28

Author(s):

A. Hannachi

Keyword(s):

General Circulation ◽

Regime Shift ◽

Planetary Wave ◽

Clustering Algorithm ◽

Spatial Clustering ◽

Gaussian Mixture Models ◽

Circulation Model ◽

Gaussian Mixture ◽

Stochastic Dynamical Systems ◽

Wave Dynamics

Abstract Investigation of preferred structures of planetary wave dynamics is addressed using multivariate Gaussian mixture models. The number of components in the mixture is obtained using order statistics of the mixing proportions, hence avoiding previous difficulties related to sample sizes and independence issues. The method is first applied to a few low-order stochastic dynamical systems and data from a general circulation model. The method is next applied to winter daily 500-hPa heights from 1949 to 2003 over the Northern Hemisphere. A spatial clustering algorithm is first applied to the leading two principal components (PCs) and shows significant clustering. The clustering is particularly robust for the first half of the record and less for the second half. The mixture model is then used to identify the clusters. Two highly significant extratropical planetary-scale preferred structures are obtained within the first two to four EOF state space. The first pattern shows a Pacific–North American (PNA) pattern and a negative North Atlantic Oscillation (NAO), and the second pattern is nearly opposite to the first one. It is also observed that some subspaces show multivariate Gaussianity, compatible with linearity, whereas others show multivariate non-Gaussianity. The same analysis is also applied to two subperiods, before and after 1978, and shows a similar regime behavior, with a slight stronger support for the first subperiod. In addition a significant regime shift is also observed between the two periods as well as a change in the shape of the distribution. The patterns associated with the regime shifts reflect essentially a PNA pattern and an NAO pattern consistent with the observed global warming effect on climate and the observed shift in sea surface temperature around the mid-1970s.

Download Full-text

Spectral Clustering Algorithm of Uncertain Objects

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.300-301.1058 ◽

2013 ◽

Vol 300-301 ◽

pp. 1058-1061

Author(s):

Tong He

Keyword(s):

Mixture Models ◽

Experimental Evaluation ◽

Spectral Clustering ◽

Clustering Algorithm ◽

Gaussian Mixture Models ◽

Gaussian Mixture ◽

Bayesian Probability ◽

Leibler Divergence ◽

Effectiveness And Efficiency ◽

Spectral Clustering Algorithm

By extending classical spectral clustering algorithm, a new clustering algorithm of uncertain objects is proposed in this paper. In the algorithm, each uncertain object is represented as a Gaussian mixture model, and Kullback-Leibler divergence and Bayesian probability are respectively used as similarity measure between Gaussian mixture models. In an extensive experimental evaluation, we not only show the effectiveness and efficiency of the new algorithm and compare it with CLARANS algorithm of uncertain objects.

Download Full-text

PhenoGMM: Gaussian Mixture Modeling of Cytometry Data Quantifies Changes in Microbial Community Structure

mSphere ◽

10.1128/msphere.00530-20 ◽

2021 ◽

Vol 6 (1) ◽

Author(s):

Peter Rubbens ◽

Ruben Props ◽

Frederiek-Maarten Kerckhof ◽

Nico Boon ◽

Willem Waegeman

Keyword(s):

Flow Cytometry ◽

Community Structure ◽

Microbial Community ◽

Microbial Community Structure ◽

Gaussian Mixture Models ◽

Gaussian Mixture ◽

Community Diversity ◽

Data Sets ◽

Natural Ecosystems ◽

Flow Cytometry Data

ABSTRACT Microbial flow cytometry can rapidly characterize the status of microbial communities. Upon measurement, large amounts of quantitative single-cell data are generated, which need to be analyzed appropriately. Cytometric fingerprinting approaches are often used for this purpose. Traditional approaches either require a manual annotation of regions of interest, do not fully consider the multivariate characteristics of the data, or result in many community-describing variables. To address these shortcomings, we propose an automated model-based fingerprinting approach based on Gaussian mixture models, which we call PhenoGMM. The method successfully quantifies changes in microbial community structure based on flow cytometry data, which can be expressed in terms of cytometric diversity. We evaluate the performance of PhenoGMM using data sets from both synthetic and natural ecosystems and compare the method with a generic binning fingerprinting approach. PhenoGMM supports the rapid and quantitative screening of microbial community structure and dynamics. IMPORTANCE Microorganisms are vital components in various ecosystems on Earth. In order to investigate the microbial diversity, researchers have largely relied on the analysis of 16S rRNA gene sequences from DNA. Flow cytometry has been proposed as an alternative technology to characterize microbial community diversity and dynamics. The technology enables a fast measurement of optical properties of individual cells. So-called fingerprinting techniques are needed in order to describe microbial community diversity and dynamics based on flow cytometry data. In this work, we propose a more advanced fingerprinting strategy based on Gaussian mixture models. We evaluated our workflow on data sets from both synthetic and natural ecosystems, illustrating its general applicability for the analysis of microbial flow cytometry data. PhenoGMM supports a rapid and quantitative analysis of microbial community structure using flow cytometry.

Download Full-text

Semi-supervised Gaussian Mixture Models Clustering Algorithm Based on Immune Clonal Selection

Proceedings of the 2015 4th National Conference on Electrical, Electronics and Computer Engineering ◽

10.2991/nceece-15.2016.214 ◽

2016 ◽

Author(s):

Wenlong Huang ◽

Xiaodan Wang

Keyword(s):

Mixture Models ◽

Clustering Algorithm ◽

Clonal Selection ◽

Gaussian Mixture Models ◽

Gaussian Mixture

Download Full-text

Heart Sounds Human Identification and Verification Approaches using Vector Quantization and Gaussian Mixture Models

International Journal of Systems Biology and Biomedical Technologies ◽

10.4018/ijsbbt.2012100106 ◽

2012 ◽

Vol 1 (4) ◽

pp. 74-87

Author(s):

Neveen I. Ghali ◽

Rasha Wahid ◽

Aboul Ella Hassanien

Keyword(s):

Feature Extraction ◽

Gaussian Mixture Models ◽

Gaussian Mixture ◽

Mixture Modeling ◽

Heart Sounds ◽

Experimental Results ◽

Gaussian Mixture Modeling ◽

Two Samples ◽

Robust Feature Extraction ◽

The Difference

In this paper the possibility of using the human heart sounds as a human print is investigated. To evaluate the performance and the uniqueness of the proposed approach, tests using a high resolution auscultation digital stethoscope are done for nearly 80 heart sound samples. The verification approach consists of a robust feature extraction with a specified configuration in conjunction with Gaussian mixture modeling. The similarity of two samples is estimated by measuring the difference between their negative log-likelihood similarities of the features. The experimental results obtained show that the overall accuracy offered by the employed Gaussian mixture modeling reach up to 85%. The identification approach consists of a robust feature extraction with a specified configuration in conjunction with LBG-VQ. The experimental results obtained show that the overall accuracy offered by the employed LBG-VQ reach up to 88.7%.

Download Full-text

Probability Estimation of Direct Hydrocarbon Indicators Using Gaussian Mixture Models

10.21528/cbic2021-131 ◽

2021 ◽

Author(s):

John B. Lemos ◽

Matheus R. S. Barbosa ◽

Edric B. Troccoli ◽

Alexsandro G. Cerqueira

Keyword(s):

Cluster Analysis ◽

Mixture Models ◽

Clustering Algorithm ◽

Gaussian Mixture Models ◽

Principal Component ◽

Gaussian Mixture ◽

Optimal Number ◽

Original Dataset ◽

Pca Algorithm ◽

Optimal Number Of Clusters

This work aims to delimit the Direct Hydrocarbon Indicators (DHI) zones using the Gaussian Mixture Models (GMM) algorithm, an unsupervised machine learning method, over the FS8 seismic horizon in the seismic data of the Dutch F3 Field. The dataset used to perform the cluster analysis was extracted from the 3D seismic dataset. It comprises the following seismic attributes: Sweetness, Spectral Decomposition, Acoustic Impedance, Coherence, and Instantaneous Amplitude. The Principal Component Analysis (PCA) algorithm was applied in the original dataset for dimensionality reduction and noise filtering, and we choose the first three principal components to be the input of the clustering algorithm. The cluster analysis using the Gaussian Mixture Models was performed by varying the number of groups from 2 to 20. The Elbow Method suggested a smaller number of groups than needed to isolate the DHI zones. Therefore, we observed that four is the optimal number of clusters to highlight this seismic feature. Furthermore, it was possible to interpret other clusters related to the lithology through geophysical well log data.

Download Full-text

A COMPARATIVE STUDY ON KERNEL-BASED PROBABILISTIC NEURAL NETWORKS FOR SPEAKER VERIFICATION

International Journal of Neural Systems ◽

10.1142/s0129065702001278 ◽

2002 ◽

Vol 12 (05) ◽

pp. 381-397 ◽

Cited By ~ 4

Author(s):

K. K. YIU ◽

M. W. MAK ◽

S. Y. KUNG

Keyword(s):

Neural Networks ◽

Ad Hoc ◽

Speaker Verification ◽

Gaussian Mixture Models ◽

Gaussian Mixture ◽

Error Rates ◽

Training Algorithm ◽

Probabilistic Neural Networks ◽

Equal Error Rate ◽

Acceptance Rates

This paper compares kernel-based probabilistic neural networks for speaker verification based on 138 speakers of the YOHO corpus. Experimental evaluations using probabilistic decision-based neural networks (PDBNNs), Gaussian mixture models (GMMs) and elliptical basis function networks (EBFNs) as speaker models were conducted. The original training algorithm of PDBNNs was also modified to make PDBNNs appropriate for speaker verification. Results show that the equal error rate obtained by PDBNNs and GMMs is less than that of EBFNs (0.33% vs. 0.48%), suggesting that GMM- and PDBNN-based speaker models outperform the EBFN ones. This work also finds that the globally supervised learning of PDBNNs is able to find decision thresholds that not only maintain the false acceptance rates to a low level but also reduce their variation, whereas the ad-hoc threshold-determination approach used by the EBFNs and GMMs causes a large variation in the error rates. This property makes the performance of PDBNN-based systems more predictable.

Download Full-text