scholarly journals Clustering on Human Microbiome Sequencing Data: A Distance-Based Unsupervised Learning Model

2020 ◽  
Vol 8 (10) ◽  
pp. 1612
Author(s):  
Dongyang Yang ◽  
Wei Xu

Modeling and analyzing human microbiome allows the assessment of the microbial community and its impacts on human health. Microbiome composition can be quantified using 16S rRNA technology into sequencing data, which are usually skewed and heavy-tailed with excess zeros. Clustering methods are useful in personalized medicine by identifying subgroups for patients stratification. However, there is currently a lack of standardized clustering method for the complex microbiome sequencing data. We propose a clustering algorithm with a specific beta diversity measure that can address the presence-absence bias encountered for sparse count data and effectively measure the sample distances for sample stratification. Our distance measure used for clustering is derived from a parametric based mixture model producing sample-specific distributions conditional on the observed operational taxonomic unit (OTU) counts and estimated mixture weights. The method can provide accurate estimates of the true zero proportions and thus construct a precise beta diversity measure. Extensive simulation studies have been conducted and suggest that the proposed method achieves substantial clustering improvement compared with some widely used distance measures when a large proportion of zeros is presented. The proposed algorithm was implemented to a human gut microbiome study on Parkinson’s diseases to identify distinct microbiome states with biological interpretations.

Author(s):  
Carlotta Domeniconi

In an effort to achieve improved classifier accuracy, extensive research has been conducted in classifier ensembles. Very recently, cluster ensembles have emerged. It is well known that off-the-shelf clustering methods may discover different structures in a given set of data. This is because each clustering algorithm has its own bias resulting from the optimization of different criteria. Furthermore, there is no ground truth against which the clustering result can be validated. Thus, no cross-validation technique can be carried out to tune input parameters involved in the clustering process. As a consequence, the user is not equipped with any guidelines for choosing the proper clustering method for a given dataset. Cluster ensembles offer a solution to challenges inherent to clustering arising from its ill-posed nature. Cluster ensembles can provide more robust and stable solutions by leveraging the consensus across multiple clustering results, while averaging out emergent spurious structures that arise due to the various biases to which each participating algorithm is tuned. In this chapter, we discuss the problem of combining multiple weighted clusters, discovered by a locally adaptive algorithm (Domeniconi, Papadopoulos, Gunopulos, & Ma, 2004) which detects clusters in different subspaces of the input space. We believe that our approach is the first attempt to design a cluster ensemble for subspace clustering (Al-Razgan & Domeniconi, 2006). Recently, several subspace clustering methods have been proposed (Parsons, Haque, & Liu, 2004). They all attempt to dodge the curse of dimensionality which affects any algorithm in high dimensional spaces. In high dimensional spaces, it is highly likely that, for any given pair of points within the same cluster, there exist at least a few dimensions on which the points are far apart from each other. As a consequence, distance functions that equally use all input features may not be effective. Furthermore, several clusters may exist in different subspaces comprised of different combinations of features. In many real-world problems, some points are correlated with respect to a given set of dimensions, while others are correlated with respect to different dimensions. Each dimension could be relevant to at least one of the clusters. Global dimensionality reduction techniques are unable to capture local correlations of data. Thus, a proper feature selection procedure should operate locally in input space. Local feature selection allows one to embed different distance measures in different regions of the input space; such distance metrics reflect local correlations of data. In (Domeniconi, Papadopoulos, Gunopulos, & Ma, 2004) we proposed a soft feature selection procedure (called LAC) that assigns weights to features according to the local correlations of data along each dimension. Dimensions along which data are loosely correlated receive a small weight, which has the effect of elongating distances along that dimension. Features along which data are strongly correlated receive a large weight, which has the effect of constricting distances along that dimension. Thus the learned weights perform a directional local reshaping of distances which allows a better separation of clusters, and therefore the discovery of different patterns in different subspaces of the original input space.


2018 ◽  
Vol 17 (2) ◽  
pp. 695-732 ◽  
Author(s):  
Sandro Vieira Soares ◽  
Victor Pereira Silva ◽  
Silvia Pereira de Castro Casa Nova ◽  
Alan Diógenes Góis

Resumo: Havia no Brasil, durante o triênio 2010-2012, 17 programas de pós-graduação stricto sensu acadêmicos em Contabilidade e Controladoria, espalhados por quatro regiões do Brasil e por quatro níveis de conceitos da Capes. Juntos, docentes e discentes desses programas publicaram mais de dois mil artigos em periódicos. A questão que norteou esta pesquisa foi: como se agrupam os programas de pós-graduação em Contabilidade, de acordo com as características das publicações veiculadas em periódicos, no período de 2010 a 2012? O objetivo com esta pesquisa foi simular os diversos agrupamentos desses programas, utilizando a análise de cluster. Para isso, foram feitas simulações usando cinco métodos de agrupamento, com três medidas de distância sobre cinco variáveis propostas por Soares e Múrcia (2016). Foi possível concluir que, dependendo da medida de distância utilizada, muda a disposição dos clusters dos programas. Além disso, percebeu-se que o método “vizinhos mais próximos” gera clusters mais difíceis de serem observados. Por fim, detectou-se a associação recorrente entre alguns programas, como o da UFPR e da Unisinos, assim como da FURB e da UFSC. Desse modo, concluiu-se que os programas do Sul do País apresentam características semelhantes quanto à publicação. A USP se destacou por se isolar dos outros programas. Soares e Casa Nova (2015) já demonstraram haver indicativos concordantes com esses resultados. Neste trabalho avança-se na discussão do uso de indicadores da produção bibliográfica para a avaliação da qualidade de programas de pós-graduação. É argumento dos autores deste trabalho que o sistema segue uma racionalidade focada em indicadores de produtos, e que pode afastar docentes e discentes da concentração de esforços no processo de pesquisa.Palavras-chave: Programas de pós-graduação. Contabilidade. Publicações. Graduate program in Accounting: similarities and differences of bibliographic production Abstract: During the triennium 2010-2012 there were in Brazil seventeen academic stricto sensu graduate programs in Accounting and Controllership spread across four regions of Brazil and across four levels of concepts of Capes. Together, teachers and students of these programs have published more than two thousand papers in journals. The question that guides this research is: How are graduate programs in Accounting grouped according to the characteristics of their publications published in journals in 2010-2012? This research aims to simulate several clusters of these graduate programs using cluster analysis. Therefore, simulations were made using five clustering methods with three distance measures on five variables proposed by Soares e Múrcia (2016). We conclude that depending on the distance measure used, the cluster layout of the graduate programs changes. In addition, we show that the nearest neighbors method generates clusters more difficult to be observed. Finally, we detected the recurrent association between some graduate programs as UFPR and Unisinos, and FURB and UFSC. Thus, the graduate programs in the south of Brazil have similar characteristics regarding publication. The USP stands out for isolating others graduate programs. Soares and Casa Nova (2015) have already shown to be indicative of these findings. This paper advances the discussion of the use of indicators of the bibliographic production for the evaluation of the quality of graduate programs. It is the argument of the authors of this paper that the System follows a rationality focused on product indicators, and that can deviate teachers and students from the concentration of efforts in the research process.Keywords: Graduate programs. Accounting. Publications.


Symmetry ◽  
2019 ◽  
Vol 11 (6) ◽  
pp. 753
Author(s):  
Wenyuan Zhang ◽  
Xijuan Guo ◽  
Tianyu Huang ◽  
Jiale Liu ◽  
Jun Chen

The spatial constrained Fuzzy C-means clustering (FCM) is an effective algorithm for image segmentation. Its background information improves the insensitivity to noise to some extent. In addition, the membership degree of Euclidean distance is not suitable for revealing the non-Euclidean structure of input data, since it still lacks enough robustness to noise and outliers. In order to overcome the problem above, this paper proposes a new kernel-based algorithm based on the Kernel-induced Distance Measure, which we call it Kernel-based Robust Bias-correction Fuzzy Weighted C-ordered-means Clustering Algorithm (KBFWCM). In the construction of the objective function, KBFWCM algorithm comprehensively takes into account that the spatial constrained FCM clustering algorithm is insensitive to image noise and involves a highly intensive computation. Aiming at the insensitivity of spatial constrained FCM clustering algorithm to noise and its image detail processing, the KBFWCM algorithm proposes a comprehensive algorithm combining fuzzy local similarity measures (space and grayscale) and the typicality of data attributes. Aiming at the poor robustness of the original algorithm to noise and outliers and its highly intensive computation, a Kernel-based clustering method that includes a class of robust non-Euclidean distance measures is proposed in this paper. The experimental results show that the KBFWCM algorithm has a stronger denoising and robust effect on noise image.


mSystems ◽  
2016 ◽  
Vol 1 (1) ◽  
Author(s):  
Evguenia Kopylova ◽  
Jose A. Navas-Molina ◽  
Céline Mercier ◽  
Zhenjiang Zech Xu ◽  
Frédéric Mahé ◽  
...  

ABSTRACT Massive collections of next-generation sequencing data call for fast, accurate, and easily accessible bioinformatics algorithms to perform sequence clustering. A comprehensive benchmark is presented, including open-source tools and the popular USEARCH suite. Simulated, mock, and environmental communities were used to analyze sensitivity, selectivity, species diversity (alpha and beta), and taxonomic composition. The results demonstrate that recent clustering algorithms can significantly improve accuracy and preserve estimated diversity without the application of aggressive filtering. Moreover, these tools are all open source, apply multiple levels of multithreading, and scale to the demands of modern next-generation sequencing data, which is essential for the analysis of massive multidisciplinary studies such as the Earth Microbiome Project (EMP) (J. A. Gilbert, J. K. Jansson, and R. Knight, BMC Biol 12:69, 2014, http://dx.doi.org/10.1186/s12915-014-0069-1 ). Sequence clustering is a common early step in amplicon-based microbial community analysis, when raw sequencing reads are clustered into operational taxonomic units (OTUs) to reduce the run time of subsequent analysis steps. Here, we evaluated the performance of recently released state-of-the-art open-source clustering software products, namely, OTUCLUST, Swarm, SUMACLUST, and SortMeRNA, against current principal options (UCLUST and USEARCH) in QIIME, hierarchical clustering methods in mothur, and USEARCH’s most recent clustering algorithm, UPARSE. All the latest open-source tools showed promising results, reporting up to 60% fewer spurious OTUs than UCLUST, indicating that the underlying clustering algorithm can vastly reduce the number of these derived OTUs. Furthermore, we observed that stringent quality filtering, such as is done in UPARSE, can cause a significant underestimation of species abundance and diversity, leading to incorrect biological results. Swarm, SUMACLUST, and SortMeRNA have been included in the QIIME 1.9.0 release. IMPORTANCE Massive collections of next-generation sequencing data call for fast, accurate, and easily accessible bioinformatics algorithms to perform sequence clustering. A comprehensive benchmark is presented, including open-source tools and the popular USEARCH suite. Simulated, mock, and environmental communities were used to analyze sensitivity, selectivity, species diversity (alpha and beta), and taxonomic composition. The results demonstrate that recent clustering algorithms can significantly improve accuracy and preserve estimated diversity without the application of aggressive filtering. Moreover, these tools are all open source, apply multiple levels of multithreading, and scale to the demands of modern next-generation sequencing data, which is essential for the analysis of massive multidisciplinary studies such as the Earth Microbiome Project (EMP) (J. A. Gilbert, J. K. Jansson, and R. Knight, BMC Biol 12:69, 2014, http://dx.doi.org/10.1186/s12915-014-0069-1 ).


2020 ◽  
Author(s):  
Andrea Vázquez ◽  
Narciso López-López ◽  
Josselin Houenou ◽  
Cyril Poupon ◽  
Jean-François Mangin ◽  
...  

Abstract Background: Diffusion MRI is the preferred non-invasive in vivo modality for the study of brain white matter connections. Tractography datasets contain 3D streamlines that can be analyzed to study the main brain white matter tracts. Fiber clustering methods have been used to automatically regroup similar fibers into clusters. However, due to inter-subject variability and artifacts, the resulting clusters are difficult to process for finding common connections across subjects, specially for superficial white matter. Methods: We present an automatic method for labeling of short association bundles on a group of subjects. The method is based on an intra-subject fiber clustering that generates compact fiber clusters. Posteriorly, the clusters are labeled based on the cortical connectivity of the fibers, taking as reference the Desikan-Killiany atlas, and named according to their relative position along one axis. Finally, two different strategies were applied and compared for the labeling of inter-subject bundles: a matching with the Hungarian algorithm, and a well-known fiber clustering algorithm, called QuickBundles. Results: Individual labeling was executed over four subjects, with an execution time of 3.6 minutes. An inspection of individual labeling based on a distance measure, showed good correspondence among the four tested subjects. Two inter-subject labeling were successfully implemented and applied to 20 subjects, and compared using a set of distance thresholds, ranging from a conservative value of 10 mm to a moderate value of 21 mm. Hungarian algorithm led to high correspondence, but low reproducibility for all the thresholds, with 96 seconds of execution time. QuickBundles led to better correspondence, reproducibility and short execution time of 9 seconds. Hence, the whole processing for the inter-subject labeling over 20 subjects takes 1.17 hours. Conclusion: We implemented a method for the automatic labeling of short bundles in individuals, based on an intra-subject clustering and the connectivity of the clusters with the cortex. The labels provide useful information for the visualization and analysis of individual connections, what is very difficult without any additional information. Furthermore, we provide two fast inter-subject bundle labeling methods. The obtained clusters could be used for performing manual or automatic connectivity analysis in individuals or across subjects. Keywords: fiber labeling; clustering; fiber bundle; tractography; superficial white matter


The distance measure is the core idea of data mining techniques such as classification, clustering, and statistical analysis and so on. All clustering taxonomies such as partition, hierarchical, density, grid, model, fuzzy and graphs used to distance measures for the data point’s categorization under difference cluster, cluster construction and validation. Big data mining is the advanced concept of data mining respect to the big data dimensions. When traditional clustering algorithm is used under the big data mining the distance measure is needed for scalable under big data mining and support to a huge size dataset, heterogeneous data and sources, and velocity characteristics of the big data. From a theoretically, practically and the existing research perspective, the paper focuses on volume, variety, and velocity big data criterion for identifying a distance measure for the big data mining and recognize how to distance measure works under clustering taxonomy. This study also analyzed all distance measures accuracy with the help of a confusion matrix through clustering.


2019 ◽  
Vol 2019 ◽  
pp. 1-21 ◽  
Author(s):  
Cong Liu ◽  
Qianqian Chen ◽  
Yingxia Chen ◽  
Jie Liu

Most of the existing clustering algorithms are often based on Euclidean distance measure. However, only using Euclidean distance measure may not be sufficient enough to partition a dataset with different structures. Thus, it is necessary to combine multiple distance measures into clustering. However, the weights for different distance measures are hard to set. Accordingly, it appears natural to keep multiple distance measures separately and to optimize them simultaneously by applying a multiobjective optimization technique. Recently a new clustering algorithm called ‘multiobjective evolutionary clustering based on combining multiple distance measures’ (MOECDM) was proposed to integrate Euclidean and Path distance measures together for partitioning the dataset with different structures. However, it is time-consuming due to the large-sized genes. This paper proposes a fast multiobjective fuzzy clustering algorithm for partitioning the dataset with different structures. In this algorithm, a real encoding scheme is adopted to represent the individual. Two fuzzy clustering objective functions are designed based on Euclidean and Path distance measures, respectively, to evaluate the goodness of each individual. An improved evolutionary operator is also introduced accordingly to increase the convergence speed and the diversity of the population. In the final generation, a set of nondominated solutions can be obtained. The best solution and the best distance measure are selected by using a semisupervised method. Afterwards, an updated algorithm is also designed to detect the optimal cluster number automatically. The proposed algorithms are applied to many datasets with different structures, and the results of eight artificial and six real-life datasets are shown in experiments. Experimental results have shown that the proposed algorithms can not only successfully partition the dataset with different structures, but also reduce the computational cost.


2014 ◽  
Vol 23 (4) ◽  
pp. 379-389 ◽  
Author(s):  
Jun Ye

AbstractClustering plays an important role in data mining, pattern recognition, and machine learning. Single-valued neutrosophic sets (SVNSs) are useful means to describe and handle indeterminate and inconsistent information that fuzzy sets and intuitionistic fuzzy sets cannot describe and deal with. To cluster the data represented by single-valued neutrosophic information, this article proposes single-valued neutrosophic clustering methods based on similarity measures between SVNSs. First, we define a generalized distance measure between SVNSs and propose two distance-based similarity measures of SVNSs. Then, we present a clustering algorithm based on the similarity measures of SVNSs to cluster single-valued neutrosophic data. Finally, an illustrative example is given to demonstrate the application and effectiveness of the developed clustering methods.


2000 ◽  
Vol 09 (04) ◽  
pp. 509-526 ◽  
Author(s):  
OLFA NASRAOUI ◽  
HICHEM FRIGUI ◽  
RAGHU KRISHNAPURAM ◽  
ANUPAM JOSHI

The proliferation of information on the World Wide Web has made the personalization of this information space a necessity. An important component of Web personalization is to mine typical user profiles from the vast amount of historical data stored in access logs. In the absence of any a priori knowledge, unsupervised classification or clustering methods seem to be ideally suited to analyze the semi-structured log data of user accesses. In this paper, we define the notion of a "user session" as being a temporally compact sequence of Web accesses by a user. We also define a new distance measure between two Web sessions that captures the organization of a Web site. The Competitive Agglomeration clustering algorithm which can automatically cluster data into the optimal number of components is extended so that it can work on relational data. The resulting Competitive Agglomeration for Relational Data (CARD) algorithm can deal with complex, non-Euclidean, distance/similarity measures. This algorithm was used to analyze Web server access logs successfully and obtain typical session profiles of users.


Symmetry ◽  
2021 ◽  
Vol 13 (3) ◽  
pp. 436
Author(s):  
Ruirui Zhao ◽  
Minxia Luo ◽  
Shenggang Li

Picture fuzzy sets, which are the extension of intuitionistic fuzzy sets, can deal with inconsistent information better in practical applications. A distance measure is an important mathematical tool to calculate the difference degree between picture fuzzy sets. Although some distance measures of picture fuzzy sets have been constructed, there are some unreasonable and counterintuitive cases. The main reason is that the existing distance measures do not or seldom consider the refusal degree of picture fuzzy sets. In order to solve these unreasonable and counterintuitive cases, in this paper, we propose a dynamic distance measure of picture fuzzy sets based on a picture fuzzy point operator. Through a numerical comparison and multi-criteria decision-making problems, we show that the proposed distance measure is reasonable and effective.


Sign in / Sign up

Export Citation Format

Share Document