A Fast Method for Estimating the Number of Clusters Based on Score and the Minimum Distance of the Center Point

Zhenzhen He; Zongpu Jia; Xiaohong Zhang

doi:10.3390/info11010016

A Fast Method for Estimating the Number of Clusters Based on Score and the Minimum Distance of the Center Point

Information ◽

10.3390/info11010016 ◽

2019 ◽

Vol 11 (1) ◽

pp. 16

Author(s):

Zhenzhen He ◽

Zongpu Jia ◽

Xiaohong Zhang

Keyword(s):

Minimum Distance ◽

Learning Algorithm ◽

Kernel Density ◽

Gaussian Kernel ◽

Data Sets ◽

Fast Method ◽

Number Of Clusters ◽

Center Point ◽

Estimation Function ◽

Candidate Set

Clustering is widely used as an unsupervised learning algorithm. However, it is often necessary to manually enter the number of clusters, and the number of clusters has a great impact on the clustering effect. At present, researchers propose some algorithms to determine the number of clusters, but the results are not very good for determining the number of clusters of data sets with complex and scattered shapes. To solve these problems, this paper proposes using the Gaussian Kernel density estimation function to determine the maximum number of clusters, use the change of center point score to get the candidate set of center points, and further use the change of the minimum distance between center points to get the number of clusters. The experiment shows the validity and practicability of the proposed algorithm.

Download Full-text

A Fast Method for Defogging of Outdoor Visual Images

Recent Patents on Computer Science ◽

10.2174/2213275912666190819105422 ◽

2019 ◽

Vol 12 ◽

Author(s):

Tannistha Pal

Keyword(s):

Intelligent Vehicles ◽

Visual Surveillance ◽

Qualitative Assessment ◽

Computational Time ◽

Data Sets ◽

Fast Method ◽

Time Data ◽

Image Defogging ◽

Clear Vision ◽

Dark Channel

Images captured in severe atmospheric catastrophe especially in fog critically degrade the quality of an image and thereby reduces the visibility of an image which in turn affects several computer vision applications like visual surveillance detection, intelligent vehicles, remote sensing, etc. Thus acquiring clear vision is the prime requirement of any image. In the last few years, many approaches have been made towards solving this problem. In this article, a comparative analysis has been made on different existing image defogging algorithms and then a technique has been proposed for image defogging based on dark channel prior strategy. Experimental results show that the proposed method shows efficient results by significantly improving the visual effects of images in foggy weather. Also computational time of the existing techniques are much higher which has been overcame in this paper by using the proposed method. Qualitative assessment evaluation is performed on both benchmark and real time data sets for determining theefficacy of the technique used. Finally, the whole work is concluded with its relative advantages and shortcomings.

Download Full-text

Spectral Convolution Feature-Based SPD Matrix Representation for Signal Detection Using a Deep Neural Network

Entropy ◽

10.3390/e22090949 ◽

2020 ◽

Vol 22 (9) ◽

pp. 949

Author(s):

Jiangyi Wang ◽

Min Liu ◽

Xinwu Zeng ◽

Xiaoqiang Hua

Keyword(s):

Neural Network ◽

Signal Detection ◽

Convolutional Neural Network ◽

Deep Neural Network ◽

Detection Method ◽

Learning Algorithm ◽

Simulated Data ◽

Data Sets ◽

Feature Maps ◽

Simulated Data Sets

Convolutional neural networks have powerful performances in many visual tasks because of their hierarchical structures and powerful feature extraction capabilities. SPD (symmetric positive definition) matrix is paid attention to in visual classification, because it has excellent ability to learn proper statistical representation and distinguish samples with different information. In this paper, a deep neural network signal detection method based on spectral convolution features is proposed. In this method, local features extracted from convolutional neural network are used to construct the SPD matrix, and a deep learning algorithm for the SPD matrix is used to detect target signals. Feature maps extracted by two kinds of convolutional neural network models are applied in this study. Based on this method, signal detection has become a binary classification problem of signals in samples. In order to prove the availability and superiority of this method, simulated and semi-physical simulated data sets are used. The results show that, under low SCR (signal-to-clutter ratio), compared with the spectral signal detection method based on the deep neural network, this method can obtain a gain of 0.5–2 dB on simulated data sets and semi-physical simulated data sets.

Download Full-text

Measuring the Effectiveness of Adaptive Random Forest for Handling Concept Drift in Big Data Streams

Entropy ◽

10.3390/e23070859 ◽

2021 ◽

Vol 23 (7) ◽

pp. 859

Author(s):

Abdulaziz O. AlQabbany ◽

Aqil M. Azmi

Keyword(s):

Big Data ◽

Random Forest ◽

Real Time ◽

Data Streams ◽

Learning Algorithm ◽

Concept Drift ◽

The United States ◽

Careful Consideration ◽

Data Sets ◽

Stream Data

We are living in the age of big data, a majority of which is stream data. The real-time processing of this data requires careful consideration from different perspectives. Concept drift is a change in the data’s underlying distribution, a significant issue, especially when learning from data streams. It requires learners to be adaptive to dynamic changes. Random forest is an ensemble approach that is widely used in classical non-streaming settings of machine learning applications. At the same time, the Adaptive Random Forest (ARF) is a stream learning algorithm that showed promising results in terms of its accuracy and ability to deal with various types of drift. The incoming instances’ continuity allows for their binomial distribution to be approximated to a Poisson(1) distribution. In this study, we propose a mechanism to increase such streaming algorithms’ efficiency by focusing on resampling. Our measure, resampling effectiveness (ρ), fuses the two most essential aspects in online learning; accuracy and execution time. We use six different synthetic data sets, each having a different type of drift, to empirically select the parameter λ of the Poisson distribution that yields the best value for ρ. By comparing the standard ARF with its tuned variations, we show that ARF performance can be enhanced by tackling this important aspect. Finally, we present three case studies from different contexts to test our proposed enhancement method and demonstrate its effectiveness in processing large data sets: (a) Amazon customer reviews (written in English), (b) hotel reviews (in Arabic), and (c) real-time aspect-based sentiment analysis of COVID-19-related tweets in the United States during April 2020. Results indicate that our proposed method of enhancement exhibited considerable improvement in most of the situations.

Download Full-text

Ultrafast calculation of diffuse scattering from atomistic models

Acta Crystallographica Section A Foundations and Advances ◽

10.1107/s2053273318015632 ◽

2019 ◽

Vol 75 (1) ◽

pp. 14-24 ◽

Cited By ~ 6

Author(s):

Joseph A. M. Paddison

Keyword(s):

Diffuse Scattering ◽

Large Data ◽

Large Data Sets ◽

Scattering Data ◽

Frequency Noise ◽

Data Sets ◽

Fast Method ◽

Crystalline Materials ◽

Atomistic Models ◽

Dynamics Simulations

Diffuse scattering is a rich source of information about disorder in crystalline materials, which can be modelled using atomistic techniques such as Monte Carlo and molecular dynamics simulations. Modern X-ray and neutron scattering instruments can rapidly measure large volumes of diffuse-scattering data. Unfortunately, current algorithms for atomistic diffuse-scattering calculations are too slow to model large data sets completely, because the fast Fourier transform (FFT) algorithm has long been considered unsuitable for such calculations [Butler & Welberry (1992). J. Appl. Cryst. 25, 391–399]. Here, a new approach is presented for ultrafast calculation of atomistic diffuse-scattering patterns. It is shown that the FFT can actually be used to perform such calculations rapidly, and that a fast method based on sampling theory can be used to reduce high-frequency noise in the calculations. These algorithms are benchmarked using realistic examples of compositional, magnetic and displacive disorder. They accelerate the calculations by a factor of at least 102, making refinement of atomistic models to large diffuse-scattering volumes practical.

Download Full-text

Self-Adaptive K-Means Based on a Covering Algorithm

Complexity ◽

10.1155/2018/7698274 ◽

2018 ◽

Vol 2018 ◽

pp. 1-16 ◽

Cited By ~ 1

Author(s):

Yiwen Zhang ◽

Yuanyuan Zhou ◽

Xing Guo ◽

Jintao Wu ◽

Qiang He ◽

...

Keyword(s):

Large Scale ◽

Clustering Algorithm ◽

Real Data ◽

Second Phase ◽

Data Sets ◽

Number Of Clusters ◽

Large Scale Data ◽

Long Time ◽

Two Phases ◽

Selection Of

The K-means algorithm is one of the ten classic algorithms in the area of data mining and has been studied by researchers in numerous fields for a long time. However, the value of the clustering number k in the K-means algorithm is not always easy to be determined, and the selection of the initial centers is vulnerable to outliers. This paper proposes an improved K-means clustering algorithm called the covering K-means algorithm (C-K-means). The C-K-means algorithm can not only acquire efficient and accurate clustering results but also self-adaptively provide a reasonable numbers of clusters based on the data features. It includes two phases: the initialization of the covering algorithm (CA) and the Lloyd iteration of the K-means. The first phase executes the CA. CA self-organizes and recognizes the number of clusters k based on the similarities in the data, and it requires neither the number of clusters to be prespecified nor the initial centers to be manually selected. Therefore, it has a “blind” feature, that is, k is not preselected. The second phase performs the Lloyd iteration based on the results of the first phase. The C-K-means algorithm combines the advantages of CA and K-means. Experiments are carried out on the Spark platform, and the results verify the good scalability of the C-K-means algorithm. This algorithm can effectively solve the problem of large-scale data clustering. Extensive experiments on real data sets show that the accuracy and efficiency of the C-K-means algorithm outperforms the existing algorithms under both sequential and parallel conditions.

Download Full-text

Performance of a machine-learning algorithm for fully automatic LGE scar quantification in the large multi-national derivate registry

European Heart Journal - Cardiovascular Imaging ◽

10.1093/ehjci/jeab090.023 ◽

2021 ◽

Vol 22 (Supplement_2) ◽

Author(s):

F Ghanbari ◽

T Joyce ◽

S Kozerke ◽

AI Guaricci ◽

PG Masci ◽

...

Keyword(s):

Machine Learning ◽

Test Data ◽

Learning Algorithm ◽

Test Time ◽

Breath Hold ◽

Human Observer ◽

Data Sets ◽

Observer Variability ◽

General Electric ◽

Total N

Abstract Funding Acknowledgements Type of funding sources: Other. Main funding source(s): J. Schwitter receives research support by “ Bayer Schweiz AG “. C.N.C. received grant by Siemens. Gianluca Pontone received institutional fees by General Electric, Bracco, Heartflow, Medtronic, and Bayer. U.J.S received grand by Astellas, Bayer, General Electric. This work was supported by Italian Ministry of Health, Rome, Italy (RC 2017 R659/17-CCM698). This work was supported by Gyrotools, Zurich, Switzerland. Background Late Gadolinium enhancement (LGE) scar quantification is generally recognized as an accurate and reproducible technique, but it is observer-dependent and time consuming. Machine learning (ML) potentially offers to solve this problem. Purpose to develop and validate a ML-algorithm to allow for scar quantification thereby fully avoiding observer variability, and to apply this algorithm to the prospective international multicentre Derivate cohort. Method The Derivate Registry collected heart failure patients with LV ejection fraction <50% in 20 European and US centres. In the post-myocardial infarction patients (n = 689) quality of the LGE short-axis breath-hold images was determined (good, acceptable, sufficient, borderline, poor, excluded) and ground truth (GT) was produced (endo-epicardial contours, 2 remote reference regions, artefact elimination) to determine mass of non-infarcted myocardium and of dense (≥5SD above mean-remote) and non-dense scar (>2SD to <5SD above mean-remote). Data were divided into the learning (total n = 573; training: n = 289; testing: n = 284) and validation set (n = 116). A Ternaus-network (loss function = average of dice and binary-cross-entropy) produced 4 outputs (initial prediction, test time augmentation (TTA), threshold-based prediction (TB), and TTA + TB) representing normal myocardium, non-dense, and dense scar (Figure 1).Outputs were evaluated by dice metrics, Bland-Altman, and correlations. Results In the validation and test data sets, both not used for training, the dense scar GT was 20.8 ± 9.6% and 21.9 ± 13.3% of LV mass, respectively. The TTA-network yielded the best results with small biases vs GT (-2.2 ± 6.1%, p < 0.02; -1.7 ± 6.0%, p < 0.003, respectively) and 95%CI vs GT in the range of inter-human comparisons, i.e. TTA yielded SD of the differences vs GT in the validation and test data of 6.1 and 6.0 percentage points (%p), respectively (Fig 2), which was comparable to the 7.7%p for the inter-observer comparison (n = 40). For non-dense scar, TTA performance was similar with small biases (-1.9 ± 8.6%, p < 0.0005, -1.4 ± 8.2%, p < 0.0001, in the validation and test sets, respectively, GT 39.2 ± 13.8% and 42.1 ± 14.2%) and acceptable 95%CI with SD of the differences of 8.6 and 8.2%p for TTA vs GT, respectively, and 9.3%p for inter-observer. Conclusions In the large Derivate cohort from 20 centres, performance of the presented ML-algorithm to quantify dense and non-dense scar fully automatically is comparable to that of experienced humans with small bias and acceptable 95%-CI. Such a tool could facilitate scar quantification in clinical routine as it eliminates human observer variability and can handle large data sets.

Download Full-text

Enhanced Dark Block Extraction Method Performed Automatically to Determine the Number of Clusters in Unlabeled Data Sets

International Journal of Computers Communications & Control ◽

10.15837/ijccc.2013.2.308 ◽

2013 ◽

Vol 8 (2) ◽

pp. 275

Author(s):

Puniethaa Prabhu ◽

K. Duraiswamy

Keyword(s):

Extraction Method ◽

Unlabeled Data ◽

Data Sets ◽

Number Of Clusters

Download Full-text

Models for Internal Clustering Validation Indexes Based on Hadoop-MapReduce

International Journal of Distributed Systems and Technologies ◽

10.4018/ijdst.2020070103 ◽

2020 ◽

Vol 11 (3) ◽

pp. 42-67

Author(s):

Soumeya Zerabi ◽

Souham Meshoul ◽

Samia Chikhi Boucherkha

Keyword(s):

Clustering Algorithms ◽

Large Data ◽

Optimal Number ◽

Data Sets ◽

Data Set ◽

Number Of Clusters ◽

Distributed Models ◽

Hadoop Mapreduce ◽

Distributed Solutions ◽

Clustering Validation

Cluster validation aims to both evaluate the results of clustering algorithms and predict the number of clusters. It is usually achieved using several indexes. Traditional internal clustering validation indexes (CVIs) are mainly based in computing pairwise distances which results in a quadratic complexity of the related algorithms. The existing CVIs cannot handle large data sets properly and need to be revisited to take account of the ever-increasing data set volume. Therefore, design of parallel and distributed solutions to implement these indexes is required. To cope with this issue, the authors propose two parallel and distributed models for internal CVIs namely for Silhouette and Dunn indexes using MapReduce framework under Hadoop. The proposed models termed as MR_Silhouette and MR_Dunn have been tested to solve both the issue of evaluating the clustering results and identifying the optimal number of clusters. The results of experimental study are very promising and show that the proposed parallel and distributed models achieve the expected tasks successfully.

Download Full-text

An evaluation of different measures of dynamically recrystallized grain size for paleopiezometry or paleowattometry studies

Solid Earth ◽

10.5194/se-6-475-2015 ◽

2015 ◽

Vol 6 (2) ◽

pp. 475-495 ◽

Cited By ~ 20

Author(s):

M. A. Lopez-Sanchez ◽

S. Llana-Fúnez

Keyword(s):

Grain Size ◽

Kernel Density ◽

Kernel Density Estimator ◽

Gaussian Kernel ◽

Robust Estimator ◽

Density Estimator ◽

Lithospheric Deformation ◽

Single Measure ◽

Frequency Peak ◽

Recrystallized Grain Size

Abstract. Paleopiezometry and paleowattometry studies are essential to validate models of lithospheric deformation and therefore increasingly common in structural geology. These studies require a single measure of dynamically recrystallized grain size in natural mylonites to estimate the magnitude of differential paleostress (or the rate of mechanical work). This contribution tests the various measures of grain size used in the literature and proposes the frequency peak of a grain size distribution as the most robust estimator for paleopiezometry or paleowattometry studies. The novelty of the approach resides in the use of the Gaussian kernel density estimator as an alternative to the classical histograms, which improves reproducibility. A free, open-source, easy-to-handle script named GrainSizeTools ( http://www.TEOS-10.org) was developed with the aim of facilitating the adoption of this measure of grain size in paleopiezometry or paleowattometry studies. The major advantage of the script over other programs is that by using the Gaussian kernel density estimator and by avoiding manual steps in the estimation of the frequency peak, the reproducibility of results is improved.

Download Full-text

Estimation of the Number of Clusters in Multipath Radio Channel Data Sets

IEEE Transactions on Antennas and Propagation ◽

10.1109/tap.2013.2242823 ◽

2013 ◽

Vol 61 (5) ◽

pp. 2879-2883 ◽

Cited By ~ 14

Author(s):

Susana Mota ◽

Fernando Perez-Fontan ◽

Armando Rocha

Keyword(s):

Radio Channel ◽

Data Sets ◽

Number Of Clusters

Download Full-text