scholarly journals Clustering Algorithms: An Exploratory Review

2021 ◽  
Author(s):  
R.S.M. Lakshmi Patibandla ◽  
Veeranjaneyulu N

A process of similar data items into groups is called data clustering. Partitioning a Data Set into some groups based on the resemblance within a group by using various algorithms. Partition Based algorithms key idea is to split the data points into partitions and each one replicates one cluster. The performance of partition depends on certain objective functions. Evolutionary algorithms are used for the evolution of social aspects and to provide optimum solutions for huge optimization problems. In this paper, a survey of various partitioning and evolutionary algorithms can be implemented on a benchmark dataset and proposed to apply some validation criteria methods such as Root-Mean-Square Standard Deviation, R-square and SSD, etc., on some algorithms like Leader, ISODATA, SGO and PSO, and so on.

2017 ◽  
Vol 48 (1) ◽  
Author(s):  
Josana Andreia Langner ◽  
Nereu Augusto Streck ◽  
Angelica Durigon ◽  
Stefanía Dalmolin da Silva ◽  
Isabel Lago ◽  
...  

ABSTRACT: The objective of this study was to compare the simulations of leaf appearance of landrace and improved maize cultivars using the CSM-CERES-Maize (linear) and the Wang and Engel models (nonlinear). The coefficients of the models were calibrated using a data set of total leaf number collected in the 11/04/2013 sowing date for the landrace varieties ‘Cinquentinha’ and ‘Bico de Ouro’ and the simple hybrid ‘AS 1573PRO’. For the ‘BRS Planalto’ variety, model coefficients were estimated with data from 12/13/2014 sowing date. Evaluation of the models was with independent data sets collected during the growing seasons of 2013/2014 (Experiment 1) and 2014/2015 (Experiment 2) in Santa Maria, RS, Brazil. Total number of leaves for both landrace and improved maize varieties was better estimated with the Wang and Engel model, with a root mean square error of 1.0 leaf, while estimations with the CSM-CERES-Maize model had a root mean square error of 1.5 leaf.


2018 ◽  
Vol 10 (4) ◽  
pp. 55 ◽  
Author(s):  
Chuki Sangalugeme ◽  
Philbert Luhunga ◽  
Agness Kijazi ◽  
Hamza Kabelwa

The WAVEWATCH III model is a third generation wave model and is commonly used for wave forecasting over different oceans. In this study, the performance of WAVEWATCH III to simulate Ocean wave characteristics (wavelengths, and wave heights (amplitudes)) over the western Indian Ocean in the Coast of East African countries was validated against satellite observation data. Simulated significant wave heights (SWH) and wavelengths over the South West Indian Ocean domain during the month of June 2014 was compared with satellite observation. Statistical measures of model performance that includes bias, Mean Error (ME), Root Mean Square Error (RMSE), Standard Deviation of error (SDE) and Correlation Coefficient (r) are used. It is found that in June 2014, when the WAVEWATCH III model was forced by wind data from the Global Forecasting System (GFS), simulated the wave heights over the Coast of East African countries with biases, Mean Error (ME), Root Mean Square Error (RMSE), Correlation Coefficient (r) and Standard Deviation of error (SDE) in the range of -0.25 to -0.39 m, 0.71 to 3.38 m, 0.84 to 1.84 m, 0.55 to 0.76 and 0.38 to 0.44 respectively. While, when the model was forced by wind data from the European Centre for Medium Range Weather Foresting (ECMWF) simulated wave height with biases, Mean Error (ME), Root Mean Square Error (RMSE), Correlation Coefficient (r) and Standard Deviation of error (SDE) in the range of -0.034 to 0.008 m, 0.0006 to 0.049 m, 0.026 to 0.22 m, 0.76 to 0.89 and 0.31 to 0.41 respectively. This implies that the WAVEWATCH III model performs better in simulating wave characteristics over the South West of Indian Ocean when forced by the boundary condition from ECMWF than from GFS.


1979 ◽  
Vol 25 (3) ◽  
pp. 432-438 ◽  
Author(s):  
P J Cornbleet ◽  
N Gochman

Abstract The least-squares method is frequently used to calculate the slope and intercept of the best line through a set of data points. However, least-squares regression slopes and intercepts may be incorrect if the underlying assumptions of the least-squares model are not met. Two factors in particular that may result in incorrect least-squares regression coefficients are: (a) imprecision in the measurement of the independent (x-axis) variable and (b) inclusion of outliers in the data analysis. We compared the methods of Deming, Mandel, and Bartlett in estimating the known slope of a regression line when the independent variable is measured with imprecision, and found the method of Deming to be the most useful. Significant error in the least-squares slope estimation occurs when the ratio of the standard deviation of measurement of a single x value to the standard deviation of the x-data set exceeds 0.2. Errors in the least-squares coefficients attributable to outliers can be avoided by eliminating data points whose vertical distance from the regression line exceed four times the standard error the estimate.


Author(s):  
UREERAT WATTANACHON ◽  
CHIDCHANOK LURSINSAP

Existing clustering algorithms, such as single-link clustering, k-means, CURE, and CSM are designed to find clusters based on predefined parameters specified by users. These algorithms may be unsuccessful if the choice of parameters is inappropriate with respect to the data set being clustered. Most of these algorithms work very well for compact and hyper-spherical clusters. In this paper, a new hybrid clustering algorithm called Self-Partition and Self-Merging (SPSM) is proposed. The SPSM algorithm partitions the input data set into several subclusters in the first phase and, then, removes the noisy data in the second phase. In the third phase, the normal subclusters are continuously merged to form the larger clusters based on the inter-cluster distance and intra-cluster distance criteria. From the experimental results, the SPSM algorithm is very efficient to handle the noisy data set, and to cluster the data sets of arbitrary shapes of different density. Several examples for color image show the versatility of the proposed method and compare with results described in the literature for the same images. The computational complexity of the SPSM algorithm is O(N2), where N is the number of data points.


Author(s):  
Shapol M. Mohammed ◽  
Karwan Jacksi ◽  
Subhi R. M. Zeebaree

<p><span>Semantic similarity is the process of identifying relevant data semantically. The traditional way of identifying document similarity is by using synonymous keywords and syntactician. In comparison, semantic similarity is to find similar data using meaning of words and semantics. Clustering is a concept of grouping objects that have the same features and properties as a cluster and separate from those objects that have different features and properties. In semantic document clustering, documents are clustered using semantic similarity techniques with similarity measurements. One of the common techniques to cluster documents is the density-based clustering algorithms using the density of data points as a main strategic to measure the similarity between them. In this paper, a state-of-the-art survey is presented to analyze the density-based algorithms for clustering documents. Furthermore, the similarity and evaluation measures are investigated with the selected algorithms to grasp the common ones. The delivered review revealed that the most used density-based algorithms in document clustering are DBSCAN and DPC. The most effective similarity measurement has been used with density-based algorithms, specifically DBSCAN and DPC, is Cosine similarity with F-measure for performance and accuracy evaluation.</span></p>


2021 ◽  
Vol 12 ◽  
Author(s):  
Mohsen Shahhosseini ◽  
Guiping Hu ◽  
Saeed Khaki ◽  
Sotirios V. Archontoulis

We investigate the predictive performance of two novel CNN-DNN machine learning ensemble models in predicting county-level corn yields across the US Corn Belt (12 states). The developed data set is a combination of management, environment, and historical corn yields from 1980 to 2019. Two scenarios for ensemble creation are considered: homogenous and heterogenous ensembles. In homogenous ensembles, the base CNN-DNN models are all the same, but they are generated with a bagging procedure to ensure they exhibit a certain level of diversity. Heterogenous ensembles are created from different base CNN-DNN models which share the same architecture but have different hyperparameters. Three types of ensemble creation methods were used to create several ensembles for either of the scenarios: Basic Ensemble Method (BEM), Generalized Ensemble Method (GEM), and stacked generalized ensembles. Results indicated that both designed ensemble types (heterogenous and homogenous) outperform the ensembles created from five individual ML models (linear regression, LASSO, random forest, XGBoost, and LightGBM). Furthermore, by introducing improvements over the heterogenous ensembles, the homogenous ensembles provide the most accurate yield predictions across US Corn Belt states. This model could make 2019 yield predictions with a root mean square error of 866 kg/ha, equivalent to 8.5% relative root mean square and could successfully explain about 77% of the spatio-temporal variation in the corn grain yields. The significant predictive power of this model can be leveraged for designing a reliable tool for corn yield prediction which will in turn assist agronomic decision makers.


Now a day’s clustering plays vital role in big data. It is very difficult to analyze and cluster large volume of data. Clustering is a procedure for grouping similar data objects of a data set. We make sure that inside the cluster high intra cluster similarity and outside the cluster high inter similarity. Clustering used in statistical analysis, geographical maps, biology cell analysis and in google maps. The various approaches for clustering grid clustering, density based clustering, hierarchical methods, partitioning approaches. In this survey paper we focused on all these algorithms for large datasets like big data and make a report on comparison among them. The main metric is time complexity to differentiate all algorithms.


1983 ◽  
Vol 73 (2) ◽  
pp. 615-632
Author(s):  
Martin W. McCann ◽  
David M. Boore

abstract Data from the 1971 San Fernando, California, earthquake provided the opportunity to study the variation of ground motions on a local scale. The uncertainty in ground motion was analyzed by studying the residuals about a regression with distance and by utilizing the network of strong-motion instruments in three local geographic regions in the Los Angeles area. Our objectives were to compare the uncertainty in the peak ground acceleration (PGA) and root mean square acceleration (RMSa) about regressions on distance, and to isolate components of the variance. We find that the RMSa has only a slightly lower logarithmic standard deviation than the PGA and conclude that the RMSa does not provide a more stable measure of ground motion than does the PGA (as is commonly assumed). By conducting an analysis of the residuals, we have estimated contributions to the scatter in high-frequency ground motion due to phenomena local to the recording station, building effects defined by the depth of instrument embedment, and propagation-path effects. We observe a systematic decrease in both PGA and RMSa with increasing embedment depth. After removing this effect, we still find a significant variation (a standard deviation equivalent to a factor of up to 1.3) in the ground motions within small regions (circles of 0.5 km radius). We conclude that detailed studies which account for local site effects, including building effects, could reduce the uncertainty in ground motion predictions (as much as a factor of 1.3) attributable to these components. However, an irreducible component of the scatter in attenuation remains due to the randomness of stress release along faults during earthquakes. In a recent paper, Joyner and Boore (1981) estimate that the standard deviation associated with intra-earthquake variability corresponds to a factor of 1.35.


2019 ◽  
Vol 31 (3) ◽  
pp. 596-612 ◽  
Author(s):  
DJ Strouse ◽  
David J. Schwab

The information bottleneck (IB) approach to clustering takes a joint distribution [Formula: see text] and maps the data [Formula: see text] to cluster labels [Formula: see text], which retain maximal information about [Formula: see text] (Tishby, Pereira, & Bialek, 1999 ). This objective results in an algorithm that clusters data points based on the similarity of their conditional distributions [Formula: see text]. This is in contrast to classic geometric clustering algorithms such as [Formula: see text]-means and gaussian mixture models (GMMs), which take a set of observed data points [Formula: see text] and cluster them based on their geometric (typically Euclidean) distance from one another. Here, we show how to use the deterministic information bottleneck (DIB) (Strouse & Schwab, 2017 ), a variant of IB, to perform geometric clustering by choosing cluster labels that preserve information about data point location on a smoothed data set. We also introduce a novel intuitive method to choose the number of clusters via kinks in the information curve. We apply this approach to a variety of simple clustering problems, showing that DIB with our model selection procedure recovers the generative cluster labels. We also show that, in particular limits of our model parameters, clustering with DIB and IB is equivalent to [Formula: see text]-means and EM fitting of a GMM with hard and soft assignments, respectively. Thus, clustering with (D)IB generalizes and provides an information-theoretic perspective on these classic algorithms.


2002 ◽  
Vol 6 (4) ◽  
pp. 685-694 ◽  
Author(s):  
M. J. Hall ◽  
A. W. Minns ◽  
A. K. M. Ashrafuzzaman

Abstract. Flood quantile estimation for ungauged catchment areas continues to be a routine problem faced by the practising Engineering Hydrologist, yet the hydrometric networks in many countries are reducing rather than expanding. The result is an increasing reliance on methods for regionalising hydrological variables. Among the most widely applied techniques is the Method of Residuals, an iterative method of classifying catchment areas by their geographical proximity based upon the application of Multiple Linear Regression Analysis (MLRA). Alternative classification techniques, such as cluster analysis, have also been applied but not on a routine basis. However, hydrological regionalisation can also be regarded as a problem in data mining — a search for useful knowledge and models embedded within large data sets. In particular, Artificial Neural Networks (ANNs) can be applied both to classify catchments according to their geomorphological and climatic characteristics and to relate flow quantiles to those characteristics. This approach has been applied to three data sets from the south-west of England and Wales; to England, Wales and Scotland (EWS); and to the islands of Java and Sumatra in Indonesia. The results demonstrated that hydrologically plausible clusters can be obtained under contrasting conditions of climate. The four classes of catchment found in the EWS data set were found to be compatible with the three classes identified in the earlier study of a smaller data set from south-west England and Wales. Relationships for the parameters of the at-site distribution of annual floods can be developed that are superior to those based upon MLRA in terms of root mean square errors of validation data sets. Indeed, the results from Java and Sumatra demonstrate a clear advantage in reduced root mean square error of the dependent flow variable through recognising the presence of three classes of catchment. Wider evaluation of this methodology is recommended. Keywords: regionalisation, floods, catchment characteristics, data mining, artificial neural networks


Sign in / Sign up

Export Citation Format

Share Document