scholarly journals Selection of the number of clusters via the bootstrap method

2012 ◽  
Vol 56 (3) ◽  
pp. 468-477 ◽  
Author(s):  
Yixin Fang ◽  
Junhui Wang
2021 ◽  
Vol 53 (2) ◽  
pp. 1-10
Author(s):  
Aparecido De Moraes ◽  
Matheus Henrique Silveira Mendes ◽  
Mauro Sérgio de Oliveira Leite ◽  
Regis De Castro Carvalho ◽  
Flávia Maria Avelar Gonçalves

The purpose of this study was to identify the ideal sample size representing a family in its potential, to identify superior families and, in parallel, determine in which spatial arrangement they may have a better accuracy in the selection of new varieties of sugarcane. For such purpose, five families of full-sibs were evaluated, each with 360 individuals, in the randomized blocks design, with three replications in three different spacing among plants in the row (50 cm, 75 cm, and 100 cm) and 150 cm between the rows. To determine the ideal sample size, as well as the better spacing for evaluation, the bootstrap method was adopted. It was observed that 100 cm spacings provided the best average for the stalk numbers, stalk diameter and for estimated weight of stalks in the stool. The spacing of 75 cm between the plants allowed a better power of discrimination among the families for all characters evaluated. At this 75 cm spacing  was also possible to identify superior families with a sample of 30 plants each plot and 3 reps in the trial. Highlights The bootstrap method was efficient to determine the ideal sample size, as well as the best spacing for evaluation. The 75-cm spacing had the highest power of discrimination among families, indicating that this spacing is the most efficient in evaluating sugarcane families for selection purposes. From all the results and considering selective accuracy as the guiding parameter for decision making, the highest values obtained considering the number of stalks and weight of stalks in the stools were found at the 75-cm spacing.


1990 ◽  
Vol 29 (03) ◽  
pp. 200-204 ◽  
Author(s):  
J. A. Koziol

AbstractA basic problem of cluster analysis is the determination or selection of the number of clusters evinced in any set of data. We address this issue with multinomial data using Akaike’s information criterion and demonstrate its utility in identifying an appropriate number of clusters of tumor types with similar profiles of cell surface antigens.


Universe ◽  
2021 ◽  
Vol 7 (1) ◽  
pp. 8
Author(s):  
Alessandro Montoli ◽  
Marco Antonelli ◽  
Brynmor Haskell ◽  
Pierre Pizzochero

A common way to calculate the glitch activity of a pulsar is an ordinary linear regression of the observed cumulative glitch history. This method however is likely to underestimate the errors on the activity, as it implicitly assumes a (long-term) linear dependence between glitch sizes and waiting times, as well as equal variance, i.e., homoscedasticity, in the fit residuals, both assumptions that are not well justified from pulsar data. In this paper, we review the extrapolation of the glitch activity parameter and explore two alternatives: the relaxation of the homoscedasticity hypothesis in the linear fit and the use of the bootstrap technique. We find a larger uncertainty in the activity with respect to that obtained by ordinary linear regression, especially for those objects in which it can be significantly affected by a single glitch. We discuss how this affects the theoretical upper bound on the moment of inertia associated with the region of a neutron star containing the superfluid reservoir of angular momentum released in a stationary sequence of glitches. We find that this upper bound is less tight if one considers the uncertainty on the activity estimated with the bootstrap method and allows for models in which the superfluid reservoir is entirely in the crust.


1998 ◽  
Vol 217 (1) ◽  
Author(s):  
Hans Schneeberger

SummaryWith Efron’s law-school example the bootstrap method is compared with an alternative method, called doubling. It is shown, that the mean deviation of the estimator is always smaller for the doubling method.


1992 ◽  
Vol 82 (1) ◽  
pp. 104-119
Author(s):  
Michéle Lamarre ◽  
Brent Townshend ◽  
Haresh C. Shah

Abstract This paper describes a methodology to assess the uncertainty in seismic hazard estimates at particular sites. A variant of the bootstrap statistical method is used to combine the uncertainty due to earthquake catalog incompleteness, earthquake magnitude, and recurrence and attenuation models used. The uncertainty measure is provided in the form of a confidence interval. Comparisons of this method applied to various sites in California with previous studies are used to confirm the validity of the method.


Complexity ◽  
2018 ◽  
Vol 2018 ◽  
pp. 1-16 ◽  
Author(s):  
Yiwen Zhang ◽  
Yuanyuan Zhou ◽  
Xing Guo ◽  
Jintao Wu ◽  
Qiang He ◽  
...  

The K-means algorithm is one of the ten classic algorithms in the area of data mining and has been studied by researchers in numerous fields for a long time. However, the value of the clustering number k in the K-means algorithm is not always easy to be determined, and the selection of the initial centers is vulnerable to outliers. This paper proposes an improved K-means clustering algorithm called the covering K-means algorithm (C-K-means). The C-K-means algorithm can not only acquire efficient and accurate clustering results but also self-adaptively provide a reasonable numbers of clusters based on the data features. It includes two phases: the initialization of the covering algorithm (CA) and the Lloyd iteration of the K-means. The first phase executes the CA. CA self-organizes and recognizes the number of clusters k based on the similarities in the data, and it requires neither the number of clusters to be prespecified nor the initial centers to be manually selected. Therefore, it has a “blind” feature, that is, k is not preselected. The second phase performs the Lloyd iteration based on the results of the first phase. The C-K-means algorithm combines the advantages of CA and K-means. Experiments are carried out on the Spark platform, and the results verify the good scalability of the C-K-means algorithm. This algorithm can effectively solve the problem of large-scale data clustering. Extensive experiments on real data sets show that the accuracy and efficiency of the C-K-means algorithm outperforms the existing algorithms under both sequential and parallel conditions.


2008 ◽  
Vol 33 (3) ◽  
pp. 257-278 ◽  
Author(s):  
Yuming Liu ◽  
E. Matthew Schulz ◽  
Lei Yu

A Markov chain Monte Carlo (MCMC) method and a bootstrap method were compared in the estimation of standard errors of item response theory (IRT) true score equating. Three test form relationships were examined: parallel, tau-equivalent, and congeneric. Data were simulated based on Reading Comprehension and Vocabulary tests of the Iowa Tests of Basic Skills®. For parallel and congeneric test forms within valid IRT true score ranges, the pattern and magnitude of standard errors of IRT true score equating estimated by the MCMC method were very close to those estimated by the bootstrap method. For tau-equivalent test forms, the pattern of standard errors estimated by the two methods was also similar. Bias and mean square errors of equating produced by the MCMC method were smaller than those produced by the bootstrap method; however, standard errors were larger. In educational testing, the MCMC method may be used as an additional or alternative procedure to the bootstrap method when evaluating the precision of equating results.


Sign in / Sign up

Export Citation Format

Share Document