scholarly journals Workload Optimization by Horizontal Aggregation in SQL for Data Mining Analysis

Author(s):  
Prasanna M. Rathod ◽  
Prof. Dr. Anjali B. Raut

Preparing a data set for analysis is generally the most time consuming task in a data mining project, requiring many complex SQL queries, joining tables, and aggregating columns. Existing SQL aggregations have limitations to prepare data sets because they return one column per aggregated group. In general, a significant manual effort is required to build data sets, where a horizontal layout is required. We propose simple, yet powerful, methods to generate SQL code to return aggregated columns in a horizontal tabular layout, returning a set of numbers instead of one number per row. This new class of functions is called horizontal aggregations. Horizontal aggregations build data sets with a horizontal denormalized layout (e.g., point-dimension, observation variable, instance-feature), which is the standard layout required by most data mining algorithms. We propose three fundamental methods to evaluate horizontal aggregations: ? CASE: Exploiting the programming CASE construct; ? SPJ: Based on standard relational algebra operators (SPJ queries); ? PIVOT: Using the PIVOT operator, which is offered by some DBMSs. Experiments with large tables compare the proposed query evaluation methods. Our CASE method has similar speed to the PIVOT operator and it is much faster than the SPJ method. In general, the CASE and PIVOT methods exhibit linear scalability, whereas the SPJ method does not. For query optimization the distance computation and nearest cluster in the k-means are based on SQL. Workload balancing is the assignment of work to processors in a way that maximizes application performance. The process of load balancing can be generalized into four basic steps: 1. Monitoring processor load and state; 2. Exchanging workload and state information between processors; 3. Decision making; 4. Data migration. The decision phase is triggered when the load imbalance is detected to calculate optimal data redistribution. In the fourth and last phase, data migrates from overloaded processors to under-loaded ones.

2020 ◽  
Author(s):  
Isha Sood ◽  
Varsha Sharma

Essentially, data mining concerns the computation of data and the identification of patterns and trends in the information so that we might decide or judge. Data mining concepts have been in use for years, but with the emergence of big data, they are even more common. In particular, the scalable mining of such large data sets is a difficult issue that has attached several recent findings. A few of these recent works use the MapReduce methodology to construct data mining models across the data set. In this article, we examine current approaches to large-scale data mining and compare their output to the MapReduce model. Based on our research, a system for data mining that combines MapReduce and sampling is implemented and addressed


Author(s):  
Barak Chizi ◽  
Lior Rokach ◽  
Oded Maimon

Dimensionality (i.e., the number of data set attributes or groups of attributes) constitutes a serious obstacle to the efficiency of most data mining algorithms (Maimon and Last, 2000). The main reason for this is that data mining algorithms are computationally intensive. This obstacle is sometimes known as the “curse of dimensionality” (Bellman, 1961). The objective of Feature Selection is to identify features in the data-set as important, and discard any other feature as irrelevant and redundant information. Since Feature Selection reduces the dimensionality of the data, data mining algorithms can be operated faster and more effectively by using Feature Selection. In some cases, as a result of feature selection, the performance of the data mining method can be improved. The reason for that is mainly a more compact, easily interpreted representation of the target concept. The filter approach (Kohavi , 1995; Kohavi and John ,1996) operates independently of the data mining method employed subsequently -- undesirable features are filtered out of the data before learning begins. These algorithms use heuristics based on general characteristics of the data to evaluate the merit of feature subsets. A sub-category of filter methods that will be refer to as rankers, are methods that employ some criterion to score each feature and provide a ranking. From this ordering, several feature subsets can be chosen by manually setting There are three main approaches for feature selection: wrapper, filter and embedded. The wrapper approach (Kohavi, 1995; Kohavi and John,1996), uses an inducer as a black box along with a statistical re-sampling technique such as cross-validation to select the best feature subset according to some predictive measure. The embedded approach (see for instance Guyon and Elisseeff, 2003) is similar to the wrapper approach in the sense that the features are specifically selected for a certain inducer, but it selects the features in the process of learning.


2016 ◽  
Vol 2016 ◽  
pp. 1-11 ◽  
Author(s):  
Ivan Kholod ◽  
Ilya Petukhov ◽  
Andrey Shorov

This paper describes the construction of a Cloud for Distributed Data Analysis (CDDA) based on the actor model. The design uses an approach to map the data mining algorithms on decomposed functional blocks, which are assigned to actors. Using actors allows users to move the computation closely towards the stored data. The process does not require loading data sets into the cloud and allows users to analyze confidential information locally. The results of experiments show that the efficiency of the proposed approach outperforms established solutions.


Author(s):  
Balazs Feil ◽  
Janos Abonyi

This chapter aims to give a comprehensive view about the links between fuzzy logic and data mining. It will be shown that knowledge extracted from simple data sets or huge databases can be represented by fuzzy rule-based expert systems. It is highlighted that both model performance and interpretability of the mined fuzzy models are of major importance, and effort is required to keep the resulting rule bases small and comprehensible. Therefore, in the previous years, soft computing based data mining algorithms have been developed for feature selection, feature extraction, model optimization, and model reduction (rule based simplification). Application of these techniques is illustrated using the wine data classification problem. The results illustrate that fuzzy tools can be applied in a synergistic manner through the nine steps of knowledge discovery.


2018 ◽  
Vol 7 (4.36) ◽  
pp. 845 ◽  
Author(s):  
K. Kavitha ◽  
K. Rohini ◽  
G. Suseendran

Data mining is the course of process during which knowledge is extracted through interesting patterns recognized from large amount of data. It is one of the knowledge exploring areas which is widely used in the field of computer science. Data mining is an inter-disciplinary area which has great impact on various other fields such as data analytics in business organizations, medical forecasting and diagnosis, market analysis, statistical analysis and forecasting, predictive analysis in various other fields. Data mining has multiple forms such as text mining, web mining, visual mining, spatial mining, knowledge mining and distributed mining. In general the process of data mining has many tasks from pre-processing. The actual task of data mining starts after the preprocessing task. This work deals with the analysis and comparison of the various Data mining algorithms particularly Meta classifiers based upon performance and accuracy. This work is under medical domain, which is using the lung function test report data along with the smoking data. This medical data set has been created from the raw data obtained from the hospital. In this paper work, we have analyzed the performance of Meta classifiers for classifying the files. Initially the performances of Meta and Rule classifiers are analyzed observed and found that the Meta classifier is more efficient than the Rule classifiers in Weka tool. The implementation work then continued with the performance comparison between the different types of classification algorithm among which the Meta classifiers showed comparatively higher accuracy in the process of classification. The four Meta classifier algorithms which are widely explored using the Weka tool namely Bagging, Attribute Selected Classifier, Logit Boost and Classification via Regression are used to classify this medical dataset and the result so obtained has been evaluated and compared to recognize the best among the classifier.  


passer ◽  
2019 ◽  
Vol 3 (1) ◽  
pp. 174-179
Author(s):  
Noor Bahjat ◽  
Snwr Jamak

Cancer is a common disease that threats the life of one of every three people. This dangerous disease urgently requires early detection and diagnosis. The recent progress in data mining methods, such as classification, has proven the need for machine learning algorithms to apply to large datasets. This paper mainly aims to utilise data mining techniques to classify cancer data sets into blood cancer and non-blood cancer based on pre-defined information and post-defined information obtained after blood tests and CT scan tests. This research conducted using the WEKA data mining tool with 10-fold cross-validation to evaluate and compare different classification algorithms, extract meaningful information from the dataset and accurately identify the most suitable and predictive model. This paper depicted that the most suitable classifier with the best ability to predict the cancerous dataset is Multilayer perceptron with an accuracy of 99.3967%.


2017 ◽  
Vol 9 (1) ◽  
pp. 50-58
Author(s):  
Ali Bayır ◽  
Sebnem Ozdemir ◽  
Sevinç Gülseçen

Political elections can be defined as systems that contain political tendencies and voters' perceptions and preferences. The outputs of those systems are formed by specific attributes of individuals such as age, gender, occupancy, educational status, socio-economic status, religious belief, etc. Those attributes can create a data set, which contains hidden information and undiscovered patterns that can be revealed by using data mining methods and techniques. The main purpose of this study is to define voting tendencies in politics by using some of data mining methods. According to that purpose, the survey results, which were prepared and applied before 2011 elections of Turkey by KONDA Research and Consultancy Company, were used as raw data set. After Preprocessing of data, models were generated via data mining algorithms, such as Gini, C4.5 Decision Tree, Naive Bayes and Random Forest. Because of increasing popularity and flexibility in analyzing process, R language and Rstudio environment were used.


Author(s):  
Ansar Abbas ◽  
Muhammad Aman Ullah ◽  
Abdul Waheed

This study is conducted to predict the body weight (BW) for Thalli sheep of southern Punjab from different body measurements. In the BW prediction, several body measurements viz., withers height, body length, head length, head width, ear length, ear width, neck length, neck width, heart girth, rump length, rump width, tail length, barrel depth and sacral pelvic width are used as predictors. The data mining algorithms such as Chi-square Automatic Interaction Detector (CHAID), Exhaustive CHAID, Classification and Regression Tree (CART) and Artificial Neural Network (ANN) are used to predict the BW for a total of 85 female Thalli sheep. The data set is partitioned into training (80 %) and test (20 %) sets before the algorithms are used. The minimum number of parent (4) and child nodes (2) are set in order to ensure their predictive ability. The R2 % and RMSE values for CHAID, Exhaustive CHAID, ANN and CART algorithms are 67.38(1.003), 64.37(1.049), 61.45(1.093) and 59.02(1.125), respectively. The mostsignificant predictor is BL in the BW prediction of Thalli sheep. The heaviest BW average of 9.596 kg is obtained from the subgroup of those having BL > 25.000 inches. On behalf of the several goodness of fit criteria, we conclude that the CHAID algorithm performance is better in order to predict the BW of Thalli sheep and more suitable decision tree diagram visually. Also, the obtained CHAID results may help to determine body measurements positively associated with BW for developing better selection strategies with the scope of indirect selection criteria.


2010 ◽  
Vol 1 (1) ◽  
pp. 60-92 ◽  
Author(s):  
Joaquín Derrac ◽  
Salvador García ◽  
Francisco Herrera

The use of Evolutionary Algorithms to perform data reduction tasks has become an effective approach to improve the performance of data mining algorithms. Many proposals in the literature have shown that Evolutionary Algorithms obtain excellent results in their application as Instance Selection and Instance Generation procedures. The purpose of this paper is to present a survey on the application of Evolutionary Algorithms to Instance Selection and Generation process. It will cover approaches applied to the enhancement of the nearest neighbor rule, as well as other approaches focused on the improvement of the models extracted by some well-known data mining algorithms. Furthermore, some proposals developed to tackle two emerging problems in data mining, Scaling Up and Imbalance Data Sets, also are reviewed.


2014 ◽  
Vol 490-491 ◽  
pp. 1361-1367
Author(s):  
Xin Huang ◽  
Hui Juan Chen ◽  
Mao Gong Zheng ◽  
Ping Liu ◽  
Jing Qian

With the advent of location-based social media and locationacquisition technologies, trajectory data are becoming more and more ubiquitous in the real world. A lot of data mining algorithms have been successfully applied to trajectory data sets. Trajectory pattern mining has received a lot of attention in recent years. In this paper, we review the most inuential methods as well as typical applications within the context of trajectory pattern mining.


Sign in / Sign up

Export Citation Format

Share Document