A Skyline-Based Decision Boundary Estimation Method for Binominal Classification in Big Data

Christos Kalyvas; Manolis Maragoudakis

doi:10.3390/computation8030080

A Skyline-Based Decision Boundary Estimation Method for Binominal Classification in Big Data

Computation ◽

10.3390/computation8030080 ◽

2020 ◽

Vol 8 (3) ◽

pp. 80

Author(s):

Christos Kalyvas ◽

Manolis Maragoudakis

Keyword(s):

Big Data ◽

Estimation Method ◽

Current Approach ◽

Large Datasets ◽

Support Vector ◽

Full Potential ◽

Decision Boundary ◽

K Nearest Neighbor ◽

Advantages And Disadvantages ◽

Very Large Datasets

One of the most common tasks nowadays in big data environments is the need to classify large amounts of data. There are numerous classification models designed to perform best in different environments and datasets, each with its advantages and disadvantages. However, when dealing with big data, their performance is significantly degraded because they are not designed—or even capable—of handling very large datasets. The current approach is based on a novel proposal of exploiting the dynamics of skyline queries to efficiently identify the decision boundary and classify big data. A comparison against the popular k-nearest neighbor (k-NN), support vector machines (SVM) and naïve Bayes classification algorithms shows that the proposed method is faster than the k-NN and the SVM. The novelty of this method is based on the fact that only a small number of computations are needed in order to make a prediction, while its full potential is revealed in very large datasets.

Download Full-text

Mining Big Data and Streams

Encyclopedia of Information Science and Technology, Fourth Edition ◽

10.4018/978-1-5225-2255-3.ch036 ◽

2018 ◽

pp. 406-417

Author(s):

Hoda Ahmed Abdelhafez

Keyword(s):

Data Mining ◽

Big Data ◽

Large Datasets ◽

Lessons Learned ◽

Velocity Data ◽

Time Data ◽

Data Mining Techniques ◽

Complex Information ◽

Very Large Datasets ◽

Data Volume

Mining big data is getting a lot of attention currently because the businesses need more complex information in order to increase their revenue and gain competitive advantage. Therefore, mining the huge amount of data as well as mining real-time data needs to be done by new data mining techniques/approaches. This chapter will discuss big data volume, variety and velocity, data mining techniques and open source tools for handling very large datasets. Moreover, the chapter will focus on two industrial areas telecommunications and healthcare and lessons learned from them.

Download Full-text

Mining Big Data and Streams

Advances in Computer and Electrical Engineering - Advanced Methodologies and Technologies in Network Architecture, Mobile Computing, and Data Analytics ◽

10.4018/978-1-5225-7598-6.ch008 ◽

2019 ◽

pp. 94-107

Author(s):

Hoda Ahmed Abdelhafez

Keyword(s):

Data Mining ◽

Big Data ◽

Large Datasets ◽

Lessons Learned ◽

Velocity Data ◽

Time Data ◽

Data Mining Techniques ◽

Complex Information ◽

Very Large Datasets ◽

Data Volume

Mining big data is getting a lot of attention currently because businesses need more complex information in order to increase their revenue and gain competitive advantage. Therefore, mining the huge amount of data as well as mining real-time data needs to be done by new data mining techniques/approaches. This chapter will discuss big data volume, variety, and velocity, data mining techniques, and open source tools for handling very large datasets. Moreover, the chapter will focus on two industrial areas telecommunications and healthcare and lessons learned from them.

Download Full-text

Mining Very Large Datasets with Support Vector Machine Algorithms

Enterprise Information Systems V ◽

10.1007/1-4020-2673-0_21 ◽

2006 ◽

pp. 177-184 ◽

Cited By ~ 4

Author(s):

François Poulet ◽

Thanh-Nghi Do

Keyword(s):

Support Vector Machine ◽

Large Datasets ◽

Support Vector ◽

Very Large Datasets

Download Full-text

Acoustic Diversity Classification Using Machine Learning Techniques: Towards Automated Marine Big Data Analysis

International Journal of Artificial Intelligence Tools ◽

10.1142/s0218213020600118 ◽

2020 ◽

Vol 29 (03n04) ◽

pp. 2060011

Author(s):

Emna Hachicha Belghith ◽

François Rioult ◽

Medjber Bouzidi

Keyword(s):

Machine Learning ◽

Big Data ◽

Data Analysis ◽

Big Data Analysis ◽

Machine Learning Techniques ◽

Support Vector ◽

K Nearest Neighbor ◽

Learning Techniques ◽

Acoustic Diversity ◽

Marine Data

During the last years, big data has become the new emerging trend that increasingly attracting the attention of the R&D community in several fields (e.g., image processing, database engineering, data mining, artificial intelligence). Marine data is part of these fields which accommodates this growth, hence the appearance of marine big data paradigm that monitoring advocates the assessment of human impact on marine data. Nonetheless, supporting acoustic sounds classification is missing in such environment, with taking into account the diversity of such data (i.e., sounds of living undersea species, sounds of human activities, and sounds of environmental effects). To overcome this issue, we propose in this paper an approach that efficiently allowing acoustic diversity classification using machine learning techniques. The aim is to reach an automated support of marine big data analysis. We have conducted a set of experiments, using a real marine dataset, in order to validate our approach and show its effectiveness and efficiency. To do so, three machine learning techniques are employed: (i) classic machine learning models (i.e., k-nearest neighbor and support vector machine), (ii) deep learning based on convolutional neural networks, and (iii) transfer learning based on the reuse of pretrained models.

Download Full-text

A Novel Hybrid Sampling Algorithm for Solving Class Imbalance Problem in Big Data

Advances in Data Science and Adaptive Analysis ◽

10.1142/s2424922x21500054 ◽

2021 ◽

pp. 2150005

Author(s):

Khyati Ahlawat ◽

Anuradha Chug ◽

Amit Prakash Singh

Keyword(s):

Big Data ◽

Class Imbalance ◽

Support Vector ◽

Efficiency Gain ◽

Learning Approaches ◽

K Nearest Neighbor ◽

Class Imbalance Problem ◽

Sampling Algorithm ◽

Imbalance Problem ◽

Hybrid Sampling

The uneven distribution of classes in any dataset poses a tendency of biasness toward the majority class when analyzed using any standard classifier. The instances of the significant class being deficient in numbers are generally ignored and their correct classification which is of paramount interest is often overlooked in calculating overall accuracy. Therefore, the conventional machine learning approaches are rigorously refined to address this class imbalance problem. This challenge of imbalanced classes is more prevalent in big data scenario due to its high volume. This study deals with acknowledging a sampling solution based on cluster computing in handling class imbalance problems in the case of big data. The newly proposed approach hybrid sampling algorithm (HSA) is assessed using three popular classification algorithms namely, support vector machine, decision tree and k-nearest neighbor based on balanced accuracy and elapsed time. The results obtained from the experiment are considered promising with an efficiency gain of 42% in comparison to the traditional sampling solution synthetic minority oversampling technique (SMOTE). This work proves the effectiveness of the distribution and clustering principle in imbalanced big data scenarios.

Download Full-text

Estimation of the Conditional Probability Using a Stochastic Gradient Process

Journal of Mathematics ◽

10.1155/2021/7660113 ◽

2021 ◽

Vol 2021 ◽

pp. 1-7

Author(s):

Ali Labriji ◽

Abdelkrim Bennar ◽

Mostafa Rachik

Keyword(s):

Conditional Probability ◽

Large Volume ◽

Cost Estimation ◽

Low Cost ◽

Estimation Method ◽

Large Datasets ◽

Full Potential ◽

Conditional Probabilities ◽

Imaging Processing ◽

Estimation Algorithms

The use of conditional probabilities has gained in popularity in various fields such as medicine, finance, and imaging processing. This has occurred especially with the availability of large datasets that allow us to extract the full potential of the available estimation algorithms. Nevertheless, such a large volume of data is often accompanied by a significant need for computational capacity as well as a consequent compilation time. In this article, we propose a low-cost estimation method: we first demonstrate analytically the convergence of our method to the desired probability and then we perform a simulation to support our point.

Download Full-text

Big Data-oriented Wheel Position and Geometry Calculation for Cutting Tool Groove Manufacturing based on AI Algorithms

10.21203/rs.3.rs-1029477/v1 ◽

2021 ◽

Author(s):

Li Guochao ◽

Zhigang Liu ◽

Jie Lu ◽

Honggen Zhou ◽

Li Sun

Keyword(s):

Big Data ◽

High Performance ◽

Nearest Neighbor ◽

Cutting Tools ◽

Support Vector ◽

Grinding Wheel ◽

K Nearest Neighbor ◽

Data Set ◽

Data Resource ◽

Grinding Machine

Abstract Groove is a key structure of high-performance integral cutting tools. It has to be manufactured by 5-axis grinding machine due to its complex spatial geometry and hard materials. The crucial manufacturing parameters (CMP) are grinding wheel positions and geometries. However, it is a challenging problem to solve the CMP for the designed groove. The traditional trial-and-error or analytical methods have defects such as time-consuming, limited-applying and low accuracy. In this study, the problem is translated into a multiple output regression model of groove manufacture (MORGM) based on the big data technology and AI algorithms. The input are 34 groove geometry features and the output are 5 CMP. Firstly, two groove machining big data sets with different range are established, each of which is includes 46656 records. They are used as data resource for MORGM. Secondly, 7 AI algorithms, including linear regression, k nearest-neighbor regression, decision trees, random forest regression, support vector regression and ANN algorithms are discussed to build the model. Then, 28 experiments are carried out to test the big data set and algorithms. Finally, the best MORGM is built by ANN algorithm and the big data set with a larger range. The results show that CMP can be calculated accurately and conveniently by the built MORGM.

Download Full-text