scholarly journals Collaborative Learning Based Straggler Prevention in Large-Scale Distributed Computing Framework

2021 ◽  
Vol 2021 ◽  
pp. 1-9
Author(s):  
Shyam Deshmukh ◽  
Komati Thirupathi Rao ◽  
Mohammad Shabaz

Modern big data applications tend to prefer a cluster computing approach as they are linked to the distributed computing framework that serves users jobs as per demand. It performs rapid processing of tasks by subdividing them into tasks that execute in parallel. Because of the complex environment, hardware and software issues, tasks might run slowly leading to delayed job completion, and such phenomena are also known as stragglers. The performance improvement of distributed computing framework is a bottleneck by straggling nodes due to various factors like shared resources, heavy system load, or hardware issues leading to the prolonged job execution time. Many state-of-the-art approaches use independent models per node and workload. With increased nodes and workloads, the number of models would increase, and even with large numbers of nodes. Not every node would be able to capture the stragglers as there might not be sufficient training data available of straggler patterns, yielding suboptimal straggler prediction. To alleviate such problems, we propose a novel collaborative learning-based approach for straggler prediction, the alternate direction method of multipliers (ADMM), which is resource-efficient and learns how to efficiently deal with mitigating stragglers without moving data to a centralized location. The proposed framework shares information among the various models, allowing us to use larger training data and bring training time down by avoiding data transfer. We rigorously evaluate the proposed method on various datasets with high accuracy results.

2013 ◽  
Vol 765-767 ◽  
pp. 1087-1091
Author(s):  
Hong Lin ◽  
Shou Gang Chen ◽  
Bao Hui Wang

Recently, with the development of Internet and the coming of new application modes, data storage has some new characters and new requirements. In this paper, a Distributed Computing Framework Mass Small File storage System (For short:Dnet FS) based on Windows Communication Foundation in .Net platform is presented, which is lightweight, good-expansibility, running in cheap hardware platform, supporting Large-scale concurrent access, and having certain fault-tolerance. The framework of this system is analyzed and the performance of this system is tested and compared. All of these prove this system meet requirements.


2021 ◽  
Vol 15 (1) ◽  
pp. 127-140
Author(s):  
Muhammad Adnan ◽  
Yassaman Ebrahimzadeh Maboud ◽  
Divya Mahajan ◽  
Prashant J. Nair

Recommender models are commonly used to suggest relevant items to a user for e-commerce and online advertisement-based applications. These models use massive embedding tables to store numerical representation of items' and users' categorical variables (memory intensive) and employ neural networks (compute intensive) to generate final recommendations. Training these large-scale recommendation models is evolving to require increasing data and compute resources. The highly parallel neural networks portion of these models can benefit from GPU acceleration however, large embedding tables often cannot fit in the limited-capacity GPU device memory. Hence, this paper deep dives into the semantics of training data and obtains insights about the feature access, transfer, and usage patterns of these models. We observe that, due to the popularity of certain inputs, the accesses to the embeddings are highly skewed with a few embedding entries being accessed up to 10000X more. This paper leverages this asymmetrical access pattern to offer a framework, called FAE, and proposes a hot-embedding aware data layout for training recommender models. This layout utilizes the scarce GPU memory for storing the highly accessed embeddings, thus reduces the data transfers from CPU to GPU. At the same time, FAE engages the GPU to accelerate the executions of these hot embedding entries. Experiments on production-scale recommendation models with real datasets show that FAE reduces the overall training time by 2.3X and 1.52X in comparison to XDL CPU-only and XDL CPU-GPU execution while maintaining baseline accuracy.


2015 ◽  
Vol 2015 ◽  
pp. 1-13 ◽  
Author(s):  
Yang Liu ◽  
Jie Yang ◽  
Yuan Huang ◽  
Lixiong Xu ◽  
Siguang Li ◽  
...  

Artificial neural networks (ANNs) have been widely used in pattern recognition and classification applications. However, ANNs are notably slow in computation especially when the size of data is large. Nowadays, big data has received a momentum from both industry and academia. To fulfill the potentials of ANNs for big data applications, the computation process must be speeded up. For this purpose, this paper parallelizes neural networks based on MapReduce, which has become a major computing model to facilitate data intensive applications. Three data intensive scenarios are considered in the parallelization process in terms of the volume of classification data, the size of the training data, and the number of neurons in the neural network. The performance of the parallelized neural networks is evaluated in an experimental MapReduce computer cluster from the aspects of accuracy in classification and efficiency in computation.


2014 ◽  
Vol 513-517 ◽  
pp. 1989-1993
Author(s):  
Ya Ping Zhang ◽  
Jun Qin ◽  
Zhao Zhai

The distributed computing framework of MapReduce has been widely used in big companies as a powerful tool for processing large scale of data. This paper will introduce the existing algorithms about MapReduce job scheduling and analyze the major two ones. It points out the defects of multi-task scheduling application on processing massive jobs and proposes a kind of multi-task cluster schedule algorithm MSBACO which is based on the ant colony optimization. The results of the experiments have proved its effectiveness and stability in heterogeneous environment.


Author(s):  
Ragmi Mustafa ◽  
Basri Ahmedi ◽  
Kujtim Mustafa

Nowadays we have so much images provided by different types of machines, while we need to store them or transfer to other devices or via internet, we need to compress them because the images usually have large amount of size. Compressing them reduces time for transferring files. The compression can be done with different methods and software in order to reduce their capacity expressed in megabytes as much as tens of hundreds of gigabytes for more files. It is well known that the speed of information transmission depends mainly on its quantity or the capacity of the information package. Image compression is a very important task for data transfer and data storage, especially nowadays because of the development of many image acquisition devices. If there is no compression technique used on these data, they may occupy immense space of memory, or render difficult data transmission. Artificial Neural Networks (ANN) have demonstrated good capacities for lossy image compression. The ANN algorithm we investigate is BEP-SOFM, which uses a Backward Error Propagation algorithm to quickly obtain the initial weights, and then these weights are used to speed up the training time required by the Self-Organizing Feature Map algorithm. In order to obtain these initial weights with the BEP algorithm, we analyze the hierarchical approach, which consists in preparing the image to compress using the quadtree data structure by segmenting the image into blocks of different sizes. Small blocks are used to represent image areas with large-scale details, while the larger ones represent the areas that have a small number of observed details. Tests demonstrate that the approach of quadtree segmentation quickly leads to the initial weights using the BEP algorithm.


2021 ◽  
Vol 2021 ◽  
pp. 1-11
Author(s):  
Yunsheng Song ◽  
Xiaohan Kong ◽  
Shuoping Huang ◽  
Chao Zhang

Logistic regression has been widely used in artificial intelligence and machine learning due to its deep theoretical basis and good practical performance. Its training process aims to solve a large-scale optimization problem characterized by a likelihood function, where the gradient descent approach is the most commonly used. However, when the data size is large, it is very time-consuming because it computes the gradient using all the training data in every iteration. Though this difficulty can be solved by random sampling, the appropriate sampled examples size is difficult to be predetermined and the obtained could be not robust. To overcome this deficiency, we propose a novel algorithm for fast training logistic regression via adaptive sampling. The proposed method decomposes the problem of gradient estimation into several subproblems according to its dimension; then, each subproblem is solved independently by adaptive sampling. Each element of the gradient estimation is obtained by successively sampling a fixed volume training example multiple times until it satisfies its stopping criteria. The final estimation is combined with the results of all the subproblems. It is proved that the obtained gradient estimation is a robust estimation, and it could keep the objective function value decreasing in the iterative calculation. Compared with the representative algorithms using random sampling, the experimental results show that this algorithm obtains comparable classification performance with much less training time.


Sign in / Sign up

Export Citation Format

Share Document