A Hierarchical Clustering Algorithm Based on Silhouette Index for Cancer Subtype Discovery from Omics Data
AbstractCancer subtype discovery fromomicsdata requires techniques to estimate the number of natural clusters in the data. Automatically estimating the number of clusters has been a challenging problem in Machine Learning. Using clustering algorithms together with internal cluster validity indexes have been a popular method of estimating the number of clusters in biomolecular data. We propose a Hierarchical Agglomerative Clustering algorithm, namedSilHAC, which can automatically estimate the number of natural clusters and can find the associated clustering solution.SilHACis parameterless. We also present two hybrids ofSilHACwithSpectral ClusteringandK-Meansrespectively as components.SilHACand the hybrids could find reasonable estimates for the number of clusters and the associated clustering solution when applied to a collection of cancer gene expression datasets. The proposed methods are better alternatives to the ‘clustering algorithm - internal cluster validity index’ pipelines for estimating the number of natural clusters.