Significance of hierarchical and partitioning based clustering in grouping aware data placement for data intensive applications

In this data era, massive volumes of data are being generated every second in variety of domains such as Geoscience, Social Web, Finance, e-Commerce, Health Care, Climate modelling, Physics, Astronomy, Government sectors etc. Hadoop has been well-recognized as de factobig data processing platform that have been extensively adopted, and is currently widely used, in many application domains processing Big Data. Even though it is considered as an efficient solution for such complex query processing, it has its own limitation when the data to be processed exhibit interest locality. The data required for any query execution follows grouping behavior wherein only a part of the Big-Data is accessed frequently. During such scenarion, the time taken to execute a queryand return results, increases exponentially as the amount of data increases leading to much waiting time for the user. Since Hadoop default data placement strategy (HDDPS) does not consider such grouping behavior, it does not perform efficiently resulting in lacunas such as decreased local map task execution, increased query execution time etc. Hence proposed an Optimal Data Placement Strategy (ODPS) based on grouping semantics. In this paper we experiment the significance oftwo most promising clustering techniques viz. Hierarchical Agglomerative Clustering (HAC) and Markov Clustering (MCL) in grouping aware data placement for data intensive applications having interest locality. Initially user access pattern is identified by dynamically analyzing history log.Then both clustering techniques (HAC & MCL) are separately applied over the access pattern to obtain independent clusters. These clusters are interpreted and validated to extract the Optimal Data Groupings (ODG). Finally proposed strategy reorganizes the default data layouts in HDFSbased on ODG to achieve maximum parallel execution per group subjective to Load Balancer and Rack Awareness. Our proposed strategy is tested in 10 node cluster placed in a multi rack with Hadoop installed in every node deployed in cloud platform. Proposed strategy reduces the query execution time, significantly improves the data locality and has proved to be more efficient for massive datasets processing in heterogeneous distributed environment. Also MCL shows a marginal improved performance over HAC for queries exhibiting interest localities.

Download Full-text

DRAW: A New Data-gRouping-AWare Data Placement Scheme for Data Intensive Applications with Interest Locality

International Journal of Recent Trends in Engineering and Research ◽

10.23883/ijrter.2018.4333.w8uli ◽

2018 ◽

Vol 4 (6) ◽

pp. 172-181

Keyword(s):

Data Placement ◽

Data Intensive ◽

Data Grouping ◽

Data Intensive Applications

Download Full-text

Genetic Based Data Placement for Geo-Distributed Data-Intensive Applications in Cloud Computing

Lecture Notes in Computer Science - Advances in Services Computing ◽

10.1007/978-3-319-49178-3_20 ◽

2016 ◽

pp. 253-265 ◽

Cited By ~ 1

Author(s):

Weifeng Fan ◽

Jun Peng ◽

Xiaoyong Zhang ◽

Zhiwu Huang

Keyword(s):

Cloud Computing ◽

Data Placement ◽

Distributed Data ◽

Data Intensive ◽

Data Intensive Applications

Download Full-text

BRPS: A Big Data Placement Strategy for Data Intensive Applications

2016 IEEE 16th International Conference on Data Mining Workshops (ICDMW) ◽

10.1109/icdmw.2016.0120 ◽

2016 ◽

Cited By ~ 4

Author(s):

Lihui Liu ◽

Junping Song ◽

Haibo Wang ◽

Pin Lv

Keyword(s):

Big Data ◽

Data Placement ◽

Data Intensive ◽

Data Intensive Applications

Download Full-text

Heuristic Data Placement for Data-Intensive Applications in Heterogeneous Cloud

Journal of Electrical and Computer Engineering ◽

10.1155/2016/3516358 ◽

2016 ◽

Vol 2016 ◽

pp. 1-8 ◽

Cited By ~ 4

Author(s):

Qing Zhao ◽

Congcong Xiong ◽

Peng Wang

Keyword(s):

Clustering Algorithm ◽

Recursive Partitioning ◽

Data Placement ◽

Data Intensive ◽

High Bandwidth ◽

Tree Data ◽

Placement Algorithm ◽

Heterogeneous Cloud ◽

The Cost ◽

Data Intensive Applications

Data placement is an important issue which aims at reducing the cost of internode data transfers in cloud especially for data-intensive applications, in order to improve the performance of the entire cloud system. This paper proposes an improved data placement algorithm for heterogeneous cloud environments. In the initialization phase, a data clustering algorithm based on data dependency clustering and recursive partitioning has been presented, and both the factor of data size and fixed position are incorporated. And then a heuristic tree-to-tree data placement strategy is advanced in order to make frequent data movements occur on high-bandwidth channels. Simulation results show that, compared with two classical strategies, this strategy can effectively reduce the amount of data transmission and its time consumption during execution.

Download Full-text

Efficient location-aware data placement for data-intensive applications in geo-distributed scientific data centers

Tsinghua Science & Technology ◽

10.1109/tst.2016.7590316 ◽

2016 ◽

Vol 21 (5) ◽

pp. 471-481 ◽

Cited By ~ 13

Author(s):

Jinghui Zhang ◽

Jian Chen ◽

Junzhou Luo ◽

Aibo Song

Keyword(s):

Data Centers ◽

Data Placement ◽

Scientific Data ◽

Data Intensive ◽

Location Aware ◽

Data Intensive Applications

Download Full-text

A Data Placement Algorithm for Data Intensive Applications in Cloud

International Journal of Grid and Distributed Computing ◽

10.14257/ijgdc.2016.9.2.13 ◽

2016 ◽

Vol 9 (2) ◽

pp. 145-156 ◽

Cited By ~ 8

Author(s):

Qing Zhao ◽

Congcong Xiong ◽

Kunyu Zhang ◽

Yang Yue ◽

Jucheng Yang

Keyword(s):

Data Placement ◽

Data Intensive ◽

Placement Algorithm ◽

Data Intensive Applications

Download Full-text