Improving Scalable K-Means++

Joonas Hämäläinen; Tommi Kärkkäinen; Tuomo Rossi

doi:10.3390/a14010006

Improving Scalable K-Means++

Algorithms ◽

10.3390/a14010006 ◽

2020 ◽

Vol 14 (1) ◽

pp. 6

Author(s):

Joonas Hämäläinen ◽

Tommi Kärkkäinen ◽

Tuomo Rossi

Keyword(s):

Large Scale ◽

State Of The Art ◽

Random Projection ◽

Divide And Conquer ◽

High Dimensional ◽

Data Generation ◽

Large Scale Problems ◽

Clustering Data ◽

Very High ◽

High Dimensional Clustering

Two new initialization methods for K-means clustering are proposed. Both proposals are based on applying a divide-and-conquer approach for the K-means‖ type of an initialization strategy. The second proposal also uses multiple lower-dimensional subspaces produced by the random projection method for the initialization. The proposed methods are scalable and can be run in parallel, which make them suitable for initializing large-scale problems. In the experiments, comparison of the proposed methods to the K-means++ and K-means‖ methods is conducted using an extensive set of reference and synthetic large-scale datasets. Concerning the latter, a novel high-dimensional clustering data generation algorithm is given. The experiments show that the proposed methods compare favorably to the state-of-the-art by improving clustering accuracy and the speed of convergence. We also observe that the currently most popular K-means++ initialization behaves like the random one in the very high-dimensional cases.

Download Full-text

Investigation of Improved Cooperative Coevolution for Large-Scale Global Optimization Problems

Algorithms ◽

10.3390/a14050146 ◽

2021 ◽

Vol 14 (5) ◽

pp. 146

Author(s):

Aleksei Vakhnin ◽

Evgenii Sopov

Keyword(s):

Global Optimization ◽

Evolutionary Algorithms ◽

Numerical Experiments ◽

Large Scale ◽

Optimization Problems ◽

State Of The Art ◽

Fixed Number ◽

High Dimensional ◽

Cooperative Coevolution ◽

Large Scale Problems

Modern real-valued optimization problems are complex and high-dimensional, and they are known as “large-scale global optimization (LSGO)” problems. Classic evolutionary algorithms (EAs) perform poorly on this class of problems because of the curse of dimensionality. Cooperative Coevolution (CC) is a high-performed framework for performing the decomposition of large-scale problems into smaller and easier subproblems by grouping objective variables. The efficiency of CC strongly depends on the size of groups and the grouping approach. In this study, an improved CC (iCC) approach for solving LSGO problems has been proposed and investigated. iCC changes the number of variables in subcomponents dynamically during the optimization process. The SHADE algorithm is used as a subcomponent optimizer. We have investigated the performance of iCC-SHADE and CC-SHADE on fifteen problems from the LSGO CEC’13 benchmark set provided by the IEEE Congress of Evolutionary Computation. The results of numerical experiments have shown that iCC-SHADE outperforms, on average, CC-SHADE with a fixed number of subcomponents. Also, we have compared iCC-SHADE with some state-of-the-art LSGO metaheuristics. The experimental results have shown that the proposed algorithm is competitive with other efficient metaheuristics.

Download Full-text

In-House Developed Kr Estimator: Implemented to Analyze the Sensitivity of Relative Permeability Data to Variations in Wetting Phase Saturation

Energy Exploration & Exploitation ◽

10.1260/0144-5987.29.6.817 ◽

2011 ◽

Vol 29 (6) ◽

pp. 817-825 ◽

Cited By ~ 3

Author(s):

Muhammad Khurram Zahoor

Keyword(s):

Porous Medium ◽

Pore Size Distribution ◽

Pore Size ◽

Relative Permeability ◽

State Of The Art ◽

Data Generation ◽

Phase Saturation ◽

Complex Models ◽

Relative Permeability Curves ◽

Very High

Reservoir surveillance always requires fast, unproblematic access and solution to different relative permeability models which have been developed from time to time. In addition, complex models sometimes require in-depth knowledge of mathematics for solution prior to use them for data generation. For this purpose, in-house software has been designed to generate rigorous relative permeability curves, with a provision to include users own relative permeability models, a part from built-in various relative permeability correlations. The developed software with state-of-the-art algorithms has been used to analyze the effect of variations in residual and maximum wetting phase saturation on relative permeability curves for a porous medium having very high non-uniformity in pore size distribution. To further increase the spectrum of the study, two relative permeability models, i.e., Pirson's correlation and Brooks and Corey model has been used and the obtained results show that the later model is more sensitive to such variations.

Download Full-text

Classifying Very High-Dimensional and Large-Scale Multi-class Image Datasets with Latent-lSVM

2016 Intl IEEE Conferences on Ubiquitous Intelligence & Computing, Advanced and Trusted Computing, Scalable Computing and Communications, Cloud and Big Data Computing, Internet of People, and Smart World Congress (UIC/ATC/ScalCom/CBDCom/IoP/SmartWorld) ◽

10.1109/uic-atc-scalcom-cbdcom-iop-smartworld.2016.0116 ◽

2016 ◽

Cited By ~ 4

Author(s):

Thanh-Nghi Do ◽

Francois Poulet

Keyword(s):

Large Scale ◽

High Dimensional ◽

Very High ◽

Image Datasets

Download Full-text

A Dynamic Ensemble Framework for Mining Textual Streams with Class Imbalance

The Scientific World JOURNAL ◽

10.1155/2014/497354 ◽

2014 ◽

Vol 2014 ◽

pp. 1-11 ◽

Cited By ~ 2

Author(s):

Ge Song ◽

Yunming Ye

Keyword(s):

Large Scale ◽

State Of The Art ◽

Concept Drift ◽

Real Life ◽

Class Imbalance ◽

High Dimensional ◽

Adaptive Selection ◽

Stream Classification ◽

Rare Class

Textual stream classification has become a realistic and challenging issue since large-scale, high-dimensional, and non-stationary streams with class imbalance have been widely used in various real-life applications. According to the characters of textual streams, it is technically difficult to deal with the classification of textual stream, especially in imbalanced environment. In this paper, we propose a new ensemble framework, clustering forest, for learning from the textual imbalanced stream with concept drift (CFIM). The CFIM is based on ensemble learning by integrating a set of clustering trees (CTs). An adaptive selection method, which flexibly chooses the useful CTs by the property of the stream, is presented in CFIM. In particular, to deal with the problem of class imbalance, we collect and reuse both rare-class instances and misclassified instances from the historical chunks. Compared to most existing approaches, it is worth pointing out that our approach assumes that both majority class and rareclass may suffer from concept drift. Thus the distribution of resampled instances is similar to the current concept. The effectiveness of CFIM is examined in five real-world textual streams under an imbalanced nonstationary environment. Experimental results demonstrate that CFIM achieves better performance than four state-of-the-art ensemble models.

Download Full-text

Knowledge management overview of feature selection problem in high-dimensional financial data: cooperative co-evolution and MapReduce perspectives

Problems and Perspectives in Management ◽

10.21511/ppm.17(4).2019.28 ◽

2019 ◽

Vol 17 (4) ◽

pp. 340-359 ◽

Cited By ~ 1

Author(s):

A N M Bazlur Rashid ◽

Tonmoy Choudhury

Keyword(s):

Feature Selection ◽

Big Data ◽

Knowledge Management ◽

Programming Model ◽

Financial Data ◽

Divide And Conquer ◽

High Dimensional ◽

Future Research ◽

Data Generation ◽

Time Data

The term “big data” characterizes the massive amounts of data generation by the advanced technologies in different domains using 4Vs – volume, velocity, variety, and veracity - to indicate the amount of data that can only be processed via computationally intensive analysis, the speed of their creation, the different types of data, and their accuracy. High-dimensional financial data, such as time-series and space-time data, contain a large number of features (variables) while having a small number of samples, which are used to measure various real-time business situations for financial organizations. Such datasets are normally noisy, and complex correlations may exist between their features, and many domains, including financial, lack the al analytic tools to mine the data for knowledge discovery because of the high-dimensionality. Feature selection is an optimization problem to find a minimal subset of relevant features that maximizes the classification accuracy and reduces the computations. Traditional statistical-based feature selection approaches are not adequate to deal with the curse of dimensionality associated with big data. Cooperative co-evolution, a meta-heuristic algorithm and a divide-and-conquer approach, decomposes high-dimensional problems into smaller sub-problems. Further, MapReduce, a programming model, offers a ready-to-use distributed, scalable, and fault-tolerant infrastructure for parallelizing the developed algorithm. This article presents a knowledge management overview of evolutionary feature selection approaches, state-of-the-art cooperative co-evolution and MapReduce-based feature selection techniques, and future research directions.

Download Full-text

Large-Scale Multi-modal Distance Metric Learning with Application to Content-Based Information Retrieval and Image Classification

International Journal of Pattern Recognition and Artificial Intelligence ◽

10.1142/s0218001420500342 ◽

2020 ◽

Vol 34 (13) ◽

pp. 2050034

Author(s):

Ali Salim Rasheed ◽

Davood Zabihzadeh ◽

Sumia Abdulhussien Razooqi Al-Obaidi

Keyword(s):

Information Retrieval ◽

Image Classification ◽

Large Scale ◽

Learning Algorithm ◽

State Of The Art ◽

Metric Learning ◽

Random Projection ◽

Linear Projection ◽

Modal Data ◽

Related Data

Metric learning algorithms aim to make the conceptually related data items closer and keep dissimilar ones at a distance. The most common approach for metric learning on the Mahalanobis method. Despite its success, this method is limited to find a linear projection and also suffer from scalability respecting both the dimensionality and the size of input data. To address these problems, this paper presents a new scalable metric learning algorithm for multi-modal data. Our method learns an optimal metric for any feature set of the multi-modal data in an online fashion. We also combine the learned metrics with a novel Passive/Aggressive (PA)-based algorithm which results in a higher convergence rate compared to the state-of-the-art methods. To address scalability with respect to dimensionality, Dual Random Projection (DRP) is adopted in this paper. The present method is evaluated on some challenging machine vision datasets for image classification and Content-Based Information Retrieval (CBIR) tasks. The experimental results confirm that the proposed method significantly surpasses other state-of-the-art metric learning methods in most of these datasets in terms of both accuracy and efficiency.

Download Full-text

Large-Scale High-Dimensional Clustering with Fast Sketching

2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) ◽

10.1109/icassp.2018.8461328 ◽

2018 ◽

Cited By ~ 2

Author(s):

Antoine Chatalic ◽

Remi Gribonval ◽

Nicolas Keriven

Keyword(s):

Large Scale ◽

High Dimensional ◽

High Dimensional Clustering

Download Full-text

Comparison of Classification Methods for Very High-Dimensional Data in Sparse Random Projection Representation

Proceedings in Adaptation, Learning and Optimization - Proceedings of ELM 2018 ◽

10.1007/978-3-030-23307-5_3 ◽

2019 ◽

pp. 17-26

Author(s):

Anton Akusok ◽

Emil Eirola

Keyword(s):

High Dimensional Data ◽

Random Projection ◽

High Dimensional ◽

Classification Methods ◽

Very High

Download Full-text

Toward Large-Scale Continuous EDA: A Random Matrix Theory Perspective

Evolutionary Computation ◽

10.1162/evco_a_00150 ◽

2016 ◽

Vol 24 (2) ◽

pp. 255-291 ◽

Cited By ~ 25

Author(s):

A. Kabán ◽

J. Bootkrajang ◽

R. J. Durrant

Keyword(s):

Random Matrix Theory ◽

Random Matrix ◽

Large Scale ◽

Model Building ◽

Matrix Theory ◽

Theoretical Computer Science ◽

Global Optimisation ◽

Divide And Conquer ◽

Random Projections ◽

Large Scale Problems

Estimations of distribution algorithms (EDAs) are a major branch of evolutionary algorithms (EA) with some unique advantages in principle. They are able to take advantage of correlation structure to drive the search more efficiently, and they are able to provide insights about the structure of the search space. However, model building in high dimensions is extremely challenging, and as a result existing EDAs may become less attractive in large-scale problems because of the associated large computational requirements. Large-scale continuous global optimisation is key to many modern-day real-world problems. Scaling up EAs to large-scale problems has become one of the biggest challenges of the field. This paper pins down some fundamental roots of the problem and makes a start at developing a new and generic framework to yield effective and efficient EDA-type algorithms for large-scale continuous global optimisation problems. Our concept is to introduce an ensemble of random projections to low dimensions of the set of fittest search points as a basis for developing a new and generic divide-and-conquer methodology. Our ideas are rooted in the theory of random projections developed in theoretical computer science, and in developing and analysing our framework we exploit some recent results in nonasymptotic random matrix theory.

Download Full-text

An efficient similarity join approach on large‐scale high‐dimensional data using random projection

Concurrency and Computation Practice and Experience ◽

10.1002/cpe.5303 ◽

2019 ◽

Vol 31 (20) ◽

Cited By ~ 1

Author(s):

Youzhong Ma ◽

Ruiling Zhang ◽

Shijie Jia ◽

Yongxin Zhang ◽

Xiaofeng Meng

Keyword(s):

Large Scale ◽

High Dimensional Data ◽

Random Projection ◽

High Dimensional ◽

Similarity Join

Download Full-text