Dynamic data placement and tool assignment for big-data orchestrated bioinformatics workflows

DynDL: Scheduling Data-Locality-Aware Tasks with Dynamic Data Transfer Cost for Multicore-Server-Based Big Data Clusters

Applied Sciences ◽

10.3390/app8112216 ◽

2018 ◽

Vol 8 (11) ◽

pp. 2216

Author(s):

Jiahui Jin ◽

Qi An ◽

Wei Zhou ◽

Jiakai Tang ◽

Runqun Xiong

Keyword(s):

Big Data ◽

Data Processing ◽

Processing Time ◽

Data Transfer ◽

Data Locality ◽

Free Time ◽

Time Data ◽

Dynamic Data ◽

Network Bandwidth ◽

Transfer Cost

Network bandwidth is a scarce resource in big data environments, so data locality is a fundamental problem for data-parallel frameworks such as Hadoop and Spark. This problem is exacerbated in multicore server-based clusters, where multiple tasks running on the same server compete for the server’s network bandwidth. Existing approaches solve this problem by scheduling computational tasks near the input data and considering the server’s free time, data placements, and data transfer costs. However, such approaches usually set identical values for data transfer costs, even though a multicore server’s data transfer cost increases with the number of data-remote tasks. Eventually, this hampers data-processing time, by minimizing it ineffectively. As a solution, we propose DynDL (Dynamic Data Locality), a novel data-locality-aware task-scheduling model that handles dynamic data transfer costs for multicore servers. DynDL offers greater flexibility than existing approaches by using a set of non-decreasing functions to evaluate dynamic data transfer costs. We also propose online and offline algorithms (based on DynDL) that minimize data-processing time and adaptively adjust data locality. Although DynDL is NP-complete (nondeterministic polynomial-complete), we prove that the offline algorithm runs in quadratic time and generates optimal results for DynDL’s specific uses. Using a series of simulations and real-world executions, we show that our algorithms are 30% better than algorithms that do not consider dynamic data transfer costs in terms of data-processing time. Moreover, they can adaptively adjust data localities based on the server’s free time, data placement, and network bandwidth, and schedule tens of thousands of tasks within subseconds or seconds.

Download Full-text

C3PO - A Dynamic Data Placement Agent for ATLAS Distributed Data Management

Journal of Physics Conference Series ◽

10.1088/1742-6596/898/6/062012 ◽

2017 ◽

Vol 898 ◽

pp. 062012 ◽

Cited By ~ 1

Author(s):

T Beermann ◽

M Lassnig ◽

M Barisits ◽

C Serfon ◽

V Garonne ◽

...

Keyword(s):

Data Management ◽

Data Placement ◽

Distributed Data ◽

Dynamic Data ◽

Distributed Data Management

Download Full-text

Cost effective dynamic data placement for efficient access of social networks

Journal of Parallel and Distributed Computing ◽

10.1016/j.jpdc.2020.03.013 ◽

2020 ◽

Vol 141 ◽

pp. 82-98

Author(s):

Hourieh Khalajzadeh ◽

Dong Yuan ◽

Bing Bing Zhou ◽

John Grundy ◽

Yun Yang

Keyword(s):

Social Networks ◽

Cost Effective ◽

Data Placement ◽

Dynamic Data ◽

Efficient Access ◽

Effective Dynamic

Download Full-text

Efficient Data Placement and Replication for QoS-Aware Approximate Query Evaluation of Big Data Analytics

IEEE Transactions on Parallel and Distributed Systems ◽

10.1109/tpds.2019.2921337 ◽

2019 ◽

Vol 30 (12) ◽

pp. 2677-2691 ◽

Cited By ~ 4

Author(s):

Qiufen Xia ◽

Zichuan Xu ◽

Weifa Liang ◽

Shui Yu ◽

Song Guo ◽

...

Keyword(s):

Big Data ◽

Data Analytics ◽

Big Data Analytics ◽

Data Placement ◽

Query Evaluation ◽

Efficient Data ◽

Approximate Query

Download Full-text

Identity-Based Dynamic Data Auditing for Big Data Storage

IEEE Transactions on Big Data ◽

10.1109/tbdata.2019.2941882 ◽

2019 ◽

pp. 1-1

Author(s):

Tao Shang ◽

Feng Zhang ◽

Xingyue Chen ◽

Jianwei Liu ◽

Xinxi Lu

Keyword(s):

Big Data ◽

Data Storage ◽

Dynamic Data ◽

Identity Based ◽

Data Auditing ◽

Big Data Storage

Download Full-text

Significance of Hierarchical and Markov Clustering in Grouping Aware Data Placement for Data Intensive Applications Having Interest Locality

Scalable Computing Practice and Experience ◽

10.12694/scpe.v19i3.1375 ◽

2018 ◽

Vol 19 (3) ◽

pp. 245-258

Author(s):

Vengadeswaran Shanmugasundaram ◽

Balasundaram Sadhu Ramakrishnan

Keyword(s):

Big Data ◽

Data Placement ◽

Query Execution ◽

Access Pattern ◽

Clustering Techniques ◽

Data Intensive ◽

Markov Clustering ◽

Default Data ◽

Data Intensive Applications ◽

Grouping Behavior

In this data era, massive volumes of data are being generated every second in variety of domains such as Geoscience, Social Web, Finance, e-Commerce, Health Care, Climate modelling, Physics, Astronomy, Government sectors etc. Hadoop has been well-recognized as de factobig data processing platform that have been extensively adopted, and is currently widely used, in many application domains processing Big Data. Even though it is considered as an efficient solution for such complex query processing, it has its own limitation when the data to be processed exhibit interest locality. The data required for any query execution follows grouping behavior wherein only a part of the Big-Data is accessed frequently. During such scenarion, the time taken to execute a queryand return results, increases exponentially as the amount of data increases leading to much waiting time for the user. Since Hadoop default data placement strategy (HDDPS) does not consider such grouping behavior, it does not perform efficiently resulting in lacunas such as decreased local map task execution, increased query execution time etc. Hence proposed an Optimal Data Placement Strategy (ODPS) based on grouping semantics. In this paper we experiment the significance oftwo most promising clustering techniques viz. Hierarchical Agglomerative Clustering (HAC) and Markov Clustering (MCL) in grouping aware data placement for data intensive applications having interest locality. Initially user access pattern is identified by dynamically analyzing history log.Then both clustering techniques (HAC & MCL) are separately applied over the access pattern to obtain independent clusters. These clusters are interpreted and validated to extract the Optimal Data Groupings (ODG). Finally proposed strategy reorganizes the default data layouts in HDFSbased on ODG to achieve maximum parallel execution per group subjective to Load Balancer and Rack Awareness. Our proposed strategy is tested in 10 node cluster placed in a multi rack with Hadoop installed in every node deployed in cloud platform. Proposed strategy reduces the query execution time, significantly improves the data locality and has proved to be more efficient for massive datasets processing in heterogeneous distributed environment. Also MCL shows a marginal improved performance over HAC for queries exhibiting interest localities.

Download Full-text

A Survey on Data Placement Strategy in Big Data Heterogeneous Environments

2019 3rd International Conference on Trends in Electronics and Informatics (ICOEI) ◽

10.1109/icoei.2019.8862676 ◽

2019 ◽

Author(s):

Anilkumar Ambore ◽

Udaya Rani V.

Keyword(s):

Big Data ◽

Data Placement ◽

Heterogeneous Environments

Download Full-text

Trend Early Warning Technology for Real-Time Computer Network Based on Dynamic Data Flow under the Background of Big Data

2020 12th International Conference on Measuring Technology and Mechatronics Automation (ICMTMA) ◽

10.1109/icmtma50254.2020.00211 ◽

2020 ◽

Author(s):

Rui Wang

Keyword(s):

Big Data ◽

Real Time ◽

Early Warning ◽

Data Flow ◽

Computer Network ◽

Dynamic Data

Download Full-text

An Optimal Data Placement Strategy for Improving System Performance of Massive Data Applications Using Graph Clustering

International Journal of Ambient Computing and Intelligence ◽

10.4018/ijaci.2018070102 ◽

2018 ◽

Vol 9 (3) ◽

pp. 15-30 ◽

Cited By ~ 4

Author(s):

S. Vengadeswaran ◽

S. R. Balasundaram

Keyword(s):

Big Data ◽

Execution Time ◽

Clustering Algorithm ◽

Graph Clustering ◽

Data Placement ◽

Data Locality ◽

Query Execution ◽

Data Set ◽

Statistical Measures ◽

Default Data

This article describes how the time taken to execute a query and return the results, increase exponentially as the data size increases, leading to more waiting times of the user. Hadoop with its distributed processing capability is considered as an efficient solution for processing such large data. Hadoop's Default Data Placement Strategy (HDDPS) allocates the data blocks randomly across the cluster of nodes without considering any of the execution parameters. This result in non-availability of the blocks required for execution in local machine so that the data has to be transferred across the network for execution, leading to data locality issue. Also, it is commonly observed that most of the data intensive applications show grouping semantics. Hence during query execution, only a part of the Big-Data set is utilized. Since such execution parameters and grouping behavior are not considered, the default placement does not perform well resulting in several lacunas such as decreased local map task execution, increased query execution time, query latency, etc. In order to overcome such issues, an Optimal Data Placement Strategy (ODPS) based on grouping semantics is proposed. Initially, user history log is dynamically analyzed for identifying access pattern which is depicted as a graph. Markov clustering, a Graph clustering algorithm is applied to identify groupings among the dataset. Then, an Optimal Data Placement Algorithm (ODPA) is proposed based on the statistical measures estimated from the clustered graph. This in turn re-organizes the default data layouts in HDFS to achieve improved performance for Big-Data sets in heterogeneous distributed environment. Our proposed strategy is tested in a 15 node cluster placed in a single rack topology. The result has proved to be more efficient for massive datasets, reducing query execution time by 26% and significantly improves the data locality by 38% compared to HDDPS.

Download Full-text

Implications of data placement strategy to Big Data technologies based on shared-nothing architecture for geosciences

2016 IEEE International Geoscience and Remote Sensing Symposium (IGARSS) ◽

10.1109/igarss.2016.7730983 ◽

2016 ◽

Cited By ~ 1

Author(s):

Kwo-Sen Kuo ◽

Amidu Oloso ◽

Khoa Doan ◽

Thomas L Clune ◽

Hongfeng Yu

Keyword(s):

Big Data ◽

Data Placement ◽

Big Data Technologies

Download Full-text