Atmospheric Circulation Regimes: Can Cluster Analysis Provide the Number?

Bo Christiansen

doi:10.1175/jcli4107.1

Atmospheric Circulation Regimes: Can Cluster Analysis Provide the Number?

Journal of Climate ◽

10.1175/jcli4107.1 ◽

2007 ◽

Vol 20 (10) ◽

pp. 2229-2250 ◽

Cited By ~ 86

Author(s):

Bo Christiansen

Keyword(s):

Sample Size ◽

Mixture Model ◽

Clustering Algorithm ◽

Statistical Significance ◽

Skewed Distribution ◽

Clustering Methods ◽

Number Of Clusters ◽

Multiple Regimes ◽

Atmospheric Data ◽

Multiple Clusters

Abstract The existence of multiple regimes in the extratropical tropospheric circulation is a hypothesis of theoretical importance with potential practical consequences. It is also a controversial hypothesis, and an abundance of conflicting results regarding both the existence and the number of regimes can be found in the literature. Studies of atmospheric regime behavior are often based on clustering methods such as k-means and mixture models. In the basic implementation of these methods the number of clusters has to be specified a priori and “How many clusters?” is a highly nontrivial question. For the mixture model a procedure to assess the number of clusters by cross validation has recently been introduced. For the k-means model a Monte Carlo test is introduced that compares the clustering of the original data with the clustering of Gaussian distributed surrogate data. The robustness of these methods and their ability to produce the right number of clusters is critically assessed. The study is based on both idealized data and atmospheric data. It is shown that applying the clustering methods to the Northern Hemisphere winter tropospheric geopotential heights gives conflicting and fragile results. In particular the number of clusters depends both on the clustering algorithm and on the period considered. Furthermore, the clustering methods find multiple clusters when applied to data similar to the atmospheric data but drawn from a unimodal, skewed distribution. It is also shown that both clustering methods report multiple clusters for idealized data drawn from distributions that are skewed or platykurtic but otherwise smooth and without bumps or shoulders. In these cases the number of clusters found depends on the sample size. In particular, for the mixture model the number of clusters increases without bounds with increasing sample size. It is concluded that in the atmospheric dataset studied the clustering methods provide only weak evidence for multiple regimes although the data is non-Gaussian with high statistical significance. It is also concluded that statistical models with basically unknown properties should be approached with utmost care or avoided completely.

Download Full-text

An Effective Clustering Algorithm Using Adaptive Neighborhood and Border Peeling Method

Computational Intelligence and Neuroscience ◽

10.1155/2021/6785580 ◽

2021 ◽

Vol 2021 ◽

pp. 1-11

Author(s):

Ji Feng ◽

Bokai Zhang ◽

Ruisheng Ran ◽

Wanli Zhang ◽

Degang Yang

Keyword(s):

Clustering Algorithm ◽

Stable State ◽

Clustering Methods ◽

Number Of Clusters ◽

Neighborhood Information ◽

The Core ◽

Data Clusters ◽

Core Idea ◽

Adaptive Neighborhood ◽

Core Points

Traditional clustering methods often cannot avoid the problem of selecting neighborhood parameters and the number of clusters, and the optimal selection of these parameters varies among different shapes of data, which requires prior knowledge. To address the above parameter selection problem, we propose an effective clustering algorithm based on adaptive neighborhood, which can obtain satisfactory clustering results without setting the neighborhood parameters and the number of clusters. The core idea of the algorithm is to first iterate adaptively to a logarithmic stable state and obtain neighborhood information according to the distribution characteristics of the dataset, and then mark and peel the boundary points according to this neighborhood information, and finally cluster the data clusters with the core points as the centers. We have conducted extensive comparative experiments on datasets of different sizes and different distributions and achieved satisfactory experimental results.

Download Full-text

An On-Line Agglomerative Clustering Method for Nonstationary Data

Neural Computation ◽

10.1162/089976699300016755 ◽

1999 ◽

Vol 11 (2) ◽

pp. 521-540 ◽

Cited By ~ 41

Author(s):

Isaac David Guedalia ◽

Mickey London ◽

Michael Werman

Keyword(s):

Clustering Algorithm ◽

Small Mass ◽

Good Representation ◽

Clustering Methods ◽

Agglomerative Clustering ◽

Number Of Clusters ◽

Local Distortion ◽

On Line ◽

Nonstationary Data ◽

Computationally Intensive

An on-line agglomerative clustering algorithm for nonstationary data is described. Three issues are addressed. The first regards the temporal aspects of the data. The clustering of stationary data by the proposed algorithm is comparable to the other popular algorithms tested (batch and on-line). The second issue addressed is the number of clusters required to represent the data. The algorithm provides an efficient framework to determine the natural number of clusters given the scale of the problem. Finally, the proposed algorithm implicitly minimizes the local distortion, a measure that takes into account clusters with relatively small mass. In contrast, most existing on-line clustering methods assume stationarity of the data. When used to cluster nonstationary data, these methods fail to generate a good representation. Moreover, most current algorithms are computationally intensive when determining the correct number of clusters. These algorithms tend to neglect clusters of small mass due to their minimization of the global distortion (Energy).

Download Full-text

Improved Estimation of Entropy for Evaluation of Word Sense Induction

Computational Linguistics ◽

10.1162/coli_a_00196 ◽

2014 ◽

Vol 40 (3) ◽

pp. 671-685

Author(s):

Linlin Li ◽

Ivan Titov ◽

Caroline Sporleder

Keyword(s):

Sample Size ◽

Ground Truth ◽

Maximum Likelihood Estimates ◽

Word Sense ◽

Clustering Methods ◽

Number Of Clusters ◽

Information Theoretic ◽

Word Sense Induction ◽

Information Theoretic Measures ◽

Standard Techniques

Information-theoretic measures are among the most standard techniques for evaluation of clustering methods including word sense induction (WSI) systems. Such measures rely on sample-based estimates of the entropy. However, the standard maximum likelihood estimates of the entropy are heavily biased with the bias dependent on, among other things, the number of clusters and the sample size. This makes the measures unreliable and unfair when the number of clusters produced by different systems vary and the sample size is not exceedingly large. This corresponds exactly to the setting of WSI evaluation where a ground-truth cluster sense number arguably does not exist and the standard evaluation scenarios use a small number of instances of each word to compute the score. We describe more accurate entropy estimators and analyze their performance both in simulations and on evaluation of WSI systems.

Download Full-text

Method for determining optimal number of clusters in K-means clustering algorithm

Journal of Computer Applications ◽

10.3724/sp.j.1087.2010.01995 ◽

2010 ◽

Vol 30 (8) ◽

pp. 1995-1998 ◽

Cited By ~ 18

Author(s):

Shi-bing ZHOU ◽

Zhen-yuan XU ◽

Xu-qing TANG

Keyword(s):

Clustering Algorithm ◽

Optimal Number ◽

Number Of Clusters ◽

Optimal Number Of Clusters

Download Full-text

A novel bidirectional clustering algorithm based on local density

Scientific Reports ◽

10.1038/s41598-021-93244-2 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Baicheng Lyu ◽

Wenhua Wu ◽

Zhiqiang Hu

Keyword(s):

Clustering Algorithm ◽

Local Density ◽

Clustering Algorithms ◽

Cluster Number ◽

Denoising Method ◽

Number Of Clusters ◽

Data Points ◽

Cutoff Distance ◽

Large Clusters ◽

Small Clusters

AbstractWith the widely application of cluster analysis, the number of clusters is gradually increasing, as is the difficulty in selecting the judgment indicators of cluster numbers. Also, small clusters are crucial to discovering the extreme characteristics of data samples, but current clustering algorithms focus mainly on analyzing large clusters. In this paper, a bidirectional clustering algorithm based on local density (BCALoD) is proposed. BCALoD establishes the connection between data points based on local density, can automatically determine the number of clusters, is more sensitive to small clusters, and can reduce the adjusted parameters to a minimum. On the basis of the robustness of cluster number to noise, a denoising method suitable for BCALoD is proposed. Different cutoff distance and cutoff density are assigned to each data cluster, which results in improved clustering performance. Clustering ability of BCALoD is verified by randomly generated datasets and city light satellite images.

Download Full-text

Cavalier Use of Inferential Statistics Is a Major Source of False and Irreproducible Scientific Findings

Mathematics ◽

10.3390/math9060603 ◽

2021 ◽

Vol 9 (6) ◽

pp. 603

Author(s):

Leonid Hanin

Keyword(s):

Sample Size ◽

Gaussian Approximation ◽

Statistical Significance ◽

Statistical Analyses ◽

Random Sample Size ◽

P Values ◽

The Central Limit Theorem ◽

Fixed Sample ◽

Large Numbers ◽

Significance Levels

I uncover previously underappreciated systematic sources of false and irreproducible results in natural, biomedical and social sciences that are rooted in statistical methodology. They include the inevitably occurring deviations from basic assumptions behind statistical analyses and the use of various approximations. I show through a number of examples that (a) arbitrarily small deviations from distributional homogeneity can lead to arbitrarily large deviations in the outcomes of statistical analyses; (b) samples of random size may violate the Law of Large Numbers and thus are generally unsuitable for conventional statistical inference; (c) the same is true, in particular, when random sample size and observations are stochastically dependent; and (d) the use of the Gaussian approximation based on the Central Limit Theorem has dramatic implications for p-values and statistical significance essentially making pursuit of small significance levels and p-values for a fixed sample size meaningless. The latter is proven rigorously in the case of one-sided Z test. This article could serve as a cautionary guidance to scientists and practitioners employing statistical methods in their work.

Download Full-text

The Numbers Will Love You Back in Return—I Promise

International Journal of Sports Physiology and Performance ◽

10.1123/ijspp.2016-0214 ◽

2016 ◽

Vol 11 (4) ◽

pp. 551-554 ◽

Cited By ~ 53

Author(s):

Martin Buchheit

Keyword(s):

Sample Size ◽

Null Hypothesis ◽

Clinical Medicine ◽

Statistical Significance ◽

Significance Testing ◽

Null Hypothesis Significance Testing ◽

Sport Science ◽

Size Dependent ◽

Research Questions ◽

Per Se

The first sport-science-oriented and comprehensive paper on magnitude-based inferences (MBI) was published 10 y ago in the first issue of this journal. While debate continues, MBI is today well established in sport science and in other fields, particularly clinical medicine, where practical/clinical significance often takes priority over statistical significance. In this commentary, some reasons why both academics and sport scientists should abandon null-hypothesis significance testing and embrace MBI are reviewed. Apparent limitations and future areas of research are also discussed. The following arguments are presented: P values and, in turn, study conclusions are sample-size dependent, irrespective of the size of the effect; significance does not inform on magnitude of effects, yet magnitude is what matters the most; MBI allows authors to be honest with their sample size and better acknowledge trivial effects; the examination of magnitudes per se helps provide better research questions; MBI can be applied to assess changes in individuals; MBI improves data visualization; and MBI is supported by spreadsheets freely available on the Internet. Finally, recommendations to define the smallest important effect and improve the presentation of standardized effects are presented.

Download Full-text

CLUSTERING USING AN IMPROVED HYBRID GENETIC ALGORITHM

International Journal of Artificial Intelligence Tools ◽

10.1142/s021821300700362x ◽

2007 ◽

Vol 16 (06) ◽

pp. 919-934

Author(s):

YONGGUO LIU ◽

XIAORONG PU ◽

YIDONG SHEN ◽

ZHANG YI ◽

XIAOFENG LIAO

Keyword(s):

Genetic Algorithm ◽

Clustering Algorithm ◽

Hybrid Genetic Algorithm ◽

Sum Of Squares ◽

Clustering Methods ◽

Clustering Problem ◽

Mutation Operation ◽

Iteration Methods ◽

Genetic Clustering ◽

The Individual

In this article, a new genetic clustering algorithm called the Improved Hybrid Genetic Clustering Algorithm (IHGCA) is proposed to deal with the clustering problem under the criterion of minimum sum of squares clustering. In IHGCA, the improvement operation including five local iteration methods is developed to tune the individual and accelerate the convergence speed of the clustering algorithm, and the partition-absorption mutation operation is designed to reassign objects among different clusters. By experimental simulations, its superiority over some known genetic clustering methods is demonstrated.

Download Full-text

W.S. Gosset and Some Neglected Concepts in Experimental Statistics: Guinnessometrics II

Journal of Wine Economics ◽

10.1017/s1931436100001632 ◽

2011 ◽

Vol 6 (2) ◽

pp. 252-277 ◽

Cited By ~ 3

Author(s):

Stephen T. Ziliak

Keyword(s):

Experimental Design ◽

Sample Size ◽

Large Scale ◽

Statistical Significance ◽

Small Sample ◽

Small Samples ◽

Significant Advance ◽

Economic Approach ◽

Barley Malt ◽

Level Of Significance

AbstractStudent's exacting theory of errors, both random and real, marked a significant advance over ambiguous reports of plant life and fermentation asserted by chemists from Priestley and Lavoisier down to Pasteur and Johannsen, working at the Carlsberg Laboratory. One reason seems to be that William Sealy Gosset (1876–1937) aka “Student” – he of Student'st-table and test of statistical significance – rejected artificial rules about sample size, experimental design, and the level of significance, and took instead an economic approach to the logic of decisions made under uncertainty. In his job as Apprentice Brewer, Head Experimental Brewer, and finally Head Brewer of Guinness, Student produced small samples of experimental barley, malt, and hops, seeking guidance for industrial quality control and maximum expected profit at the large scale brewery. In the process Student invented or inspired half of modern statistics. This article draws on original archival evidence, shedding light on several core yet neglected aspects of Student's methods, that is, Guinnessometrics, not discussed by Ronald A. Fisher (1890–1962). The focus is on Student's small sample, economic approach to real error minimization, particularly in field and laboratory experiments he conducted on barley and malt, 1904 to 1937. Balanced designs of experiments, he found, are more efficient than random and have higher power to detect large and real treatment differences in a series of repeated and independent experiments. Student's world-class achievement poses a challenge to every science. Should statistical methods – such as the choice of sample size, experimental design, and level of significance – follow the purpose of the experiment, rather than the other way around? (JEL classification codes: C10, C90, C93, L66)

Download Full-text

A dynamic genetic clustering algorithm for automatic choice of the number of clusters

2011 9th IEEE International Conference on Control and Automation (ICCA) ◽

10.1109/icca.2011.6137921 ◽

2011 ◽

Cited By ~ 2

Author(s):

Hong He ◽

Yonghong Tan

Keyword(s):

Clustering Algorithm ◽

Number Of Clusters ◽

Genetic Clustering ◽

Automatic Choice

Download Full-text