scholarly journals BinSanity: unsupervised clustering of environmental microbial assemblies using coverage and affinity propagation

PeerJ ◽  
2017 ◽  
Vol 5 ◽  
pp. e3035 ◽  
Author(s):  
Elaina D. Graham ◽  
John F. Heidelberg ◽  
Benjamin J. Tully

Metagenomics has become an integral part of defining microbial diversity in various environments. Many ecosystems have characteristically low biomass and few cultured representatives. Linking potential metabolisms to phylogeny in environmental microorganisms is important for interpreting microbial community functions and the impacts these communities have on geochemical cycles. However, with metagenomic studies there is the computational hurdle of ‘binning’ contigs into phylogenetically related units or putative genomes. Binning methods have been implemented with varying approaches such as k-means clustering, Gaussian mixture models, hierarchical clustering, neural networks, and two-way clustering; however, many of these suffer from biases against low coverage/abundance organisms and closely related taxa/strains. We are introducing a new binning method, BinSanity, that utilizes the clustering algorithm affinity propagation (AP), to cluster assemblies using coverage with compositional based refinement (tetranucleotide frequency and percent GC content) to optimize bins containing multiple source organisms. This separation of composition and coverage based clustering reduces bias for closely related taxa. BinSanity was developed and tested on artificial metagenomes varying in size and complexity. Results indicate that BinSanity has a higher precision, recall, and Adjusted Rand Index compared to five commonly implemented methods. When tested on a previously published environmental metagenome, BinSanity generated high completion and low redundancy bins corresponding with the published metagenome-assembled genomes.

2016 ◽  
Author(s):  
Elaina Graham ◽  
John Heidelberg ◽  
Benjamin Tully

AbstractMetagenomics has become an integral part of defining microbial diversity in various environments. Many ecosystems have characteristically low biomass and few cultured representatives. Linking potential metabolisms to phylogeny in environmental microorganisms is important for interpreting microbial community functions and the impacts these communities have on geochemical cycles. However, with metagenomic studies there is the computational hurdle of ‘binning’ contigs into phylogenetically related units or putative genomes. Binning methods have been implemented with varying approaches such as k-means clustering, Gaussian mixture models, hierarchical clustering, neural networks, and two-way clustering; however, many of these suffer from biases against low coverage/abundance organisms and closely related taxa/strains. We are introducing a new binning method, BinSanity, that utilizes the clustering algorithm affinity propagation (AP), to cluster assemblies using coverage alone, removing potential composition based biases in clustering contigs, but requires a minimum of two samples. To increase fidelity, a refinement script was developed that uses composition data (tetranucleotide frequency and %G+C content) to refine bins containing multiple source organisms. This separation of composition and coverage based signatures reduces clustering bias for closely related taxa. BinSanity was developed and tested on artificial metagenomes varying in size and complexity. Results indicate that this implementation of AP lead to a higher precision, recall, and Adjusted Rand Index over five commonly implemented methods. When tested on a previously published infant gut metagenome, BinSanity generated high completion and low redundancy bins corresponding with the published metagenome-assembled genomes.


2013 ◽  
Vol 300-301 ◽  
pp. 1058-1061
Author(s):  
Tong He

By extending classical spectral clustering algorithm, a new clustering algorithm of uncertain objects is proposed in this paper. In the algorithm, each uncertain object is represented as a Gaussian mixture model, and Kullback-Leibler divergence and Bayesian probability are respectively used as similarity measure between Gaussian mixture models. In an extensive experimental evaluation, we not only show the effectiveness and efficiency of the new algorithm and compare it with CLARANS algorithm of uncertain objects.


2021 ◽  
Author(s):  
John B. Lemos ◽  
Matheus R. S. Barbosa ◽  
Edric B. Troccoli ◽  
Alexsandro G. Cerqueira

This work aims to delimit the Direct Hydrocarbon Indicators (DHI) zones using the Gaussian Mixture Models (GMM) algorithm, an unsupervised machine learning method, over the FS8 seismic horizon in the seismic data of the Dutch F3 Field. The dataset used to perform the cluster analysis was extracted from the 3D seismic dataset. It comprises the following seismic attributes: Sweetness, Spectral Decomposition, Acoustic Impedance, Coherence, and Instantaneous Amplitude. The Principal Component Analysis (PCA) algorithm was applied in the original dataset for dimensionality reduction and noise filtering, and we choose the first three principal components to be the input of the clustering algorithm. The cluster analysis using the Gaussian Mixture Models was performed by varying the number of groups from 2 to 20. The Elbow Method suggested a smaller number of groups than needed to isolate the DHI zones. Therefore, we observed that four is the optimal number of clusters to highlight this seismic feature. Furthermore, it was possible to interpret other clusters related to the lithology through geophysical well log data.


2018 ◽  
Vol 5 (2) ◽  
pp. 83-100
Author(s):  
María Dolores Luquín-García ◽  
Edith Cecilia Macedo Ruíz ◽  
Omar Rojas-Altamirano ◽  
Carlos López-Hernández

The aim of this article is to determine the socioeconomic level (SEL) with disaggregation of the Basic Statistical Area (BSA) in the Mexican Republic. The methodology used is the one established by the Mexican Association of Market Research Agencies (AMAI) along with the National Institute of Statistics and Geography (INEGI). The Clustering of the BSAs was carried out according to variables contained in the Population and Housing Census of 2010, through Gaussian mixture models, learning neural networks and finally, by defining the labels corresponding to each SEL. We found the existence of a representative SEL for each BSA. In addition, the definition of each socioeconomic level shows good results with an average of 90.86% of correctly labeled elements.


2012 ◽  
Vol 45 (11) ◽  
pp. 3950-3961 ◽  
Author(s):  
Miin-Shen Yang ◽  
Chien-Yo Lai ◽  
Chih-Ying Lin

Sign in / Sign up

Export Citation Format

Share Document