Big data clustering techniques based on Spark: a literature review

Data clustering plays a very important role in Data mining, machine learning and Image processing areas. As modern day databases have inherent uncertainties, many uncertainty-based data clustering algorithms have been developed in this direction. These algorithms are fuzzy c-means, rough c-means, intuitionistic fuzzy c-means and the means like rough fuzzy c-means, rough intuitionistic fuzzy c-means which base on hybrid models. Also, we find many variants of these algorithms which improve them in different directions like their Kernelised versions, possibilistic versions, and possibilistic Kernelised versions. However, all the above algorithms are not effective on big data for various reasons. So, researchers have been trying for the past few years to improve these algorithms in order they can be applied to cluster big data. The algorithms are relatively few in comparison to those for datasets of reasonable size. It is our aim in this chapter to present the uncertainty based clustering algorithms developed so far and proposes a few new algorithms which can be developed further.

Download Full-text

Research and Application of Massive Data Processing in Oil Services

Advanced Engineering Forum ◽

10.4028/www.scientific.net/aef.6-7.1036 ◽

2012 ◽

Vol 6-7 ◽

pp. 1036-1040

Author(s):

Bao An Li

Keyword(s):

Cloud Computing ◽

Big Data ◽

Internet Of Things ◽

Data Processing ◽

Rapid Growth ◽

Massive Data ◽

Oil Drilling ◽

Data Problem ◽

Traditional Industry ◽

Massive Data Processing

Big data problem has caused widespread concern from industry to academia in recent years. As the amount of data produced by various industries and sectors of rapid growth, increasing demands on data processing and analysis capabilities, how to face the challenges of data, discover new opportunities, the issue has received wide attention. As a traditional industry, the oil drilling or refinery enterprise is facing the operational status of the system to produce large amounts of data. This text introduced an approach to massive data processing for oil enterprise based on cloud computing and Internet of Things.

Download Full-text

Performance Evaluation of a Big Data Application on Apache Spark

10.32920/ryerson.14651544 ◽

2021 ◽

Author(s):

Jeanne Alcantara

Keyword(s):

Big Data ◽

Performance Evaluation ◽

Execution Time ◽

Apache Spark ◽

Massive Data ◽

Application Performance ◽

Data Application ◽

Size Number ◽

Big Data Application ◽

The Impact

Apache Spark enables a big data application—one that takes massive data as input and may produce massive data along its execution—to run in parallel on multiple nodes. Hence, for a big data application, performance is a vital issue. This project analyzes a WordCount application using Apache Spark, where the impact on the execution time and average utilization is assessed. To facilitate this assessment, the number of executor cores and the size of executor memory are varied across different sizes of data that the application has to process, and the different number of nodes in the cluster that the application runs on. It is concluded that different pairs (data size, number of nodes in the cluster) require different number of executor cores and different size of executor memory to obtain optimum results for execution time and average node utilization.

Download Full-text

Performance Evaluation of a Big Data Application on Apache Spark

10.32920/ryerson.14651544.v1 ◽

2021 ◽

Author(s):

Jeanne Alcantara

Keyword(s):

Big Data ◽

Performance Evaluation ◽

Execution Time ◽

Apache Spark ◽

Massive Data ◽

Application Performance ◽

Data Application ◽

Size Number ◽

Big Data Application ◽

The Impact

Apache Spark enables a big data application—one that takes massive data as input and may produce massive data along its execution—to run in parallel on multiple nodes. Hence, for a big data application, performance is a vital issue. This project analyzes a WordCount application using Apache Spark, where the impact on the execution time and average utilization is assessed. To facilitate this assessment, the number of executor cores and the size of executor memory are varied across different sizes of data that the application has to process, and the different number of nodes in the cluster that the application runs on. It is concluded that different pairs (data size, number of nodes in the cluster) require different number of executor cores and different size of executor memory to obtain optimum results for execution time and average node utilization.

Download Full-text

Big Data in Finance

Review of Financial Studies ◽

10.1093/rfs/hhab038 ◽

2021 ◽

Author(s):

Itay Goldstein ◽

Chester S Spatt ◽

Mao Ye

Keyword(s):

Big Data ◽

Asset Pricing ◽

Complex Structure ◽

Future Research ◽

Special Issue ◽

Research Directions ◽

Finance Industry ◽

Large Size ◽

Future Research Directions ◽

New Research

Abstract Big data is revolutionizing the finance industry and has the potential to significantly shape future research in finance. This special issue contains papers following the 2019 NBER-RFS Conference on Big Data. In this introduction to the special issue, we define the “big data” phenomenon as a combination of three features: large size, high dimension, and complex structure. Using the papers in the special issue, we discuss how new research builds on these features to push the frontier on fundamental questions across areas in finance—including corporate finance, market microstructure, and asset pricing. Finally, we offer some thoughts for future research directions.

Download Full-text

Upgrading a high performance computing environment for massive data processing

Journal of Internet Services and Applications ◽

10.1186/s13174-019-0118-7 ◽

2019 ◽

Vol 10 (1) ◽

Cited By ~ 1

Author(s):

Lucas M. Ponce ◽

Walter dos Santos ◽

Wagner Meira ◽

Dorgival Guedes ◽

Daniele Lezzi ◽

...

Keyword(s):

Big Data ◽

Data Processing ◽

High Performance Computing ◽

High Performance ◽

Data Access ◽

Massive Data ◽

Analysis Tool ◽

Data Framework ◽

Performance Computing ◽

Massive Data Processing

Abstract High-performance computing (HPC) and massive data processing (Big Data) are two trends that are beginning to converge. In that process, aspects of hardware architectures, systems support and programming paradigms are being revisited from both perspectives. This paper presents our experience on this path of convergence with the proposal of a framework that addresses some of the programming issues derived from such integration. Our contribution is the development of an integrated environment that integretes (i) COMPSs, a programming framework for the development and execution of parallel applications for distributed infrastructures; (ii) Lemonade, a data mining and analysis tool; and (iii) HDFS, the most widely used distributed file system for Big Data systems. To validate our framework, we used Lemonade to create COMPSs applications that access data through HDFS, and compared them with equivalent applications built with Spark, a popular Big Data framework. The results show that the HDFS integration benefits COMPSs by simplifying data access and by rearranging data transfer, reducing execution time. The integration with Lemonade facilitates COMPSs’s use and may help its popularization in the Data Science community, by providing efficient algorithm implementations for experts from the data domain that want to develop applications with a higher level abstraction.

Download Full-text

ONLINE PROBABILISTIC FUZZY CLUSTERING METHOD BASED ON EVOLUTIONARY OPTIMIZATION OF CAT SWARM

Radio Electronics Computer Science Control ◽

10.15588/1607-3274-2021-2-7 ◽

2021 ◽

pp. 65-70

Author(s):

Ye. V. Bodyanskiy ◽

A. Yu. Shafronenko ◽

I. N. Klymova

Keyword(s):

Big Data ◽

Fuzzy Clustering ◽

Data Clustering ◽

Clustering Algorithm ◽

Evolutionary Optimization ◽

Clustering Methods ◽

Classification Problems ◽

Probabilistic Data ◽

Fuzzy Clustering Method ◽

Clustering And Classification

Context. The problems of big data clustering today is a very relevant area of artificial intelligence. This task is often found in many applications related to data mining, deep learning, etc. To solve these problems, traditional approaches and methods require that the entire data sample be submitted in batch form. Objective. The aim of the work is to propose a method of fuzzy probabilistic data clustering using evolutionary optimization of cat swarm, that would be devoid of the drawbacks of traditional data clustering approaches. Method. The procedure of fuzzy probabilistic data clustering using evolutionary algorithms, for faster determination of sample extrema, cluster centroids and adaptive functions, allowing not to spend machine resources for storing intermediate calculations and do not require additional time to solve the problem of data clustering, regardless of the dimension and the method of presentation for processing. Results. The proposed data clustering algorithm based on evolutionary optimization is simple in numerical implementation, is devoid of the drawbacks inherent in traditional fuzzy clustering methods and can work with a large size of input information processed online in real time. Conclusions. The results of the experiment allow to recommend the developed method for solving the problems of automatic clustering and classification of big data, as quickly as possible to find the extrema of the sample, regardless of the method of submitting the data for processing. The proposed method of online probabilistic fuzzy data clustering based on evolutionary optimization of cat swarm is intended for use in hybrid computational intelligence systems, neuro-fuzzy systems, in training artificial neural networks, in clustering and classification problems.

Download Full-text

The Modeling and Simulation of Data Clustering Algorithms in Data Mining with Big Data

Journal of Industrial Integration and Management ◽

10.1142/s2424862218500173 ◽

2019 ◽

Vol 04 (01) ◽

pp. 1850017 ◽

Cited By ~ 3

Author(s):

Weiru Chen ◽

Jared Oliverio ◽

Jin Ho Kim ◽

Jiayue Shen

Keyword(s):

Data Mining ◽

Big Data ◽

Data Reduction ◽

Data Clustering ◽

Clustering Algorithms ◽

High Volume ◽

Clustering Methods ◽

Data Set ◽

Processing Methods ◽

Integration Data

Big Data is a popular cutting-edge technology nowadays. Techniques and algorithms are expanding in different areas including engineering, biomedical, and business. Due to the high-volume and complexity of Big Data, it is necessary to conduct data pre-processing methods when data mining. The pre-processing methods include data cleaning, data integration, data reduction, and data transformation. Data clustering is the most important step of data reduction. With data clustering, mining on the reduced data set should be more efficient yet produce quality analytical results. This paper presents the different data clustering methods and related algorithms for data mining with Big Data. Data clustering can increase the efficiency and accuracy of data mining.

Download Full-text

CLG clustering for dropout prediction using log-data clustering method

IAES International Journal of Artificial Intelligence (IJ-AI) ◽

10.11591/ijai.v10.i3.pp764-770 ◽

2021 ◽

Vol 10 (3) ◽

pp. 764

Author(s):

Agung Triayudi ◽

Wahyu Oktri Widyarto ◽

Lia Kamelia ◽

Iksal Iksal ◽

Sumiati Sumiati

Keyword(s):

Data Mining ◽

Data Clustering ◽

Statistical Data ◽

Source Code ◽

Educational Data Mining ◽

Mining Machine ◽

Clustering Methods ◽

Clustering Method ◽

Log Data ◽

Cluster Data

<span lang="EN-US">Implementation of data mining, machine learning, and statistical data from educational department commonly known as educational data mining. Most of school systems require a teacher to teach a number of students at one time. Exam are regularly being use as a method to measure student’s achievement, which is difficult to understand because examination cannot be done easily. The other hand, programming classes makes source code editing and UNIX commands able to easily detect and store automatically as log-data. Hence, rather that estimating the performance of those student based on this log-data, this study being more focused on detecting them who experienced a difficulty or unable to take programming classes. We propose CLG clustering methods that can predict a risk of being dropped out from school using cluster data for outlier detection.</span>

Download Full-text

Aproximación al grado de conocimiento y aplicación de Big Data en las bibliotecas universitarias españolas

Anales de Documentación ◽

10.6018/analesdoc.390931 ◽

2020 ◽

Vol 23 (1) ◽

Author(s):

Ana Belén Ríos Hilario ◽

Alberto Fraile Sastre

Keyword(s):

Big Data ◽

Data Processing ◽

Massive Data ◽

University Libraries ◽

Internal Sources ◽

Massive Data Processing ◽

Correct Implementation ◽

Big Data Technology

Se analiza el grado de conocimiento e implantación de la tecnología Big Data y sus características principales en las bibliotecas universitarias españolas inscritas en REBIUN con el objetivo de observar si estas instituciones se encuentran capacitadas para la utilización y aprovechamiento de las ventajas del tratamiento masivo de datos. Los datos son obtenidos mediante un cuestionario cuya respuesta proviene de fuentes internas de las bibliotecas, a partir de los cuales se establecen una serie de conclusiones junto a unas propuestas de mejora y líneas de trabajo futuras que permitan la correcta implantación, uso y aprovechamiento del Big Data en la oferta de servicios y funciones de las bibliotecas universitarias españolas. It is analyzed the degree of knowledge and implementation of Big Data technology and its main characteristics in the Spanish university libraries registered in REBIUN, with the objective of observing if these institutions are qualified for the use of the advantages of the massive data processing. The data is obtained by means of a questionnaire, whose response comes from internal sources of the libraries, from which a series of conclusions are established together with proposals for improvement and future lines of work that allow the correct implementation, use and exploitation of the Big Data in the offer of services and functions of the Spanish university libraries.

Download Full-text