Collaborative Learning Based Straggler Prevention in Large-Scale Distributed Computing Framework

Security and Communication Networks ◽

10.1155/2021/8340925 ◽

2021 ◽

Vol 2021 ◽

pp. 1-9

Author(s):

Shyam Deshmukh ◽

Komati Thirupathi Rao ◽

Mohammad Shabaz

Keyword(s):

Collaborative Learning ◽

Distributed Computing ◽

Large Scale ◽

Cluster Computing ◽

Data Transfer ◽

Training Data ◽

Shared Resources ◽

Training Time ◽

Big Data Applications ◽

Computing Framework

Modern big data applications tend to prefer a cluster computing approach as they are linked to the distributed computing framework that serves users jobs as per demand. It performs rapid processing of tasks by subdividing them into tasks that execute in parallel. Because of the complex environment, hardware and software issues, tasks might run slowly leading to delayed job completion, and such phenomena are also known as stragglers. The performance improvement of distributed computing framework is a bottleneck by straggling nodes due to various factors like shared resources, heavy system load, or hardware issues leading to the prolonged job execution time. Many state-of-the-art approaches use independent models per node and workload. With increased nodes and workloads, the number of models would increase, and even with large numbers of nodes. Not every node would be able to capture the stragglers as there might not be sufficient training data available of straggler patterns, yielding suboptimal straggler prediction. To alleviate such problems, we propose a novel collaborative learning-based approach for straggler prediction, the alternate direction method of multipliers (ADMM), which is resource-efficient and learns how to efficiently deal with mitigating stragglers without moving data to a centralized location. The proposed framework shares information among the various models, allowing us to use larger training data and bring training time down by avoiding data transfer. We rigorously evaluate the proposed method on various datasets with high accuracy results.

Download Full-text

Research and Design of the Distributed Mass Small File Storage System Based on WCF

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.765-767.1087 ◽

2013 ◽

Vol 765-767 ◽

pp. 1087-1091

Author(s):

Hong Lin ◽

Shou Gang Chen ◽

Bao Hui Wang

Keyword(s):

Fault Tolerance ◽

Distributed Computing ◽

Data Storage ◽

Large Scale ◽

Storage System ◽

Hardware Platform ◽

File Storage ◽

Computing Framework ◽

Research And Design ◽

Small File

Recently, with the development of Internet and the coming of new application modes, data storage has some new characters and new requirements. In this paper, a Distributed Computing Framework Mass Small File storage System (For short:Dnet FS) based on Windows Communication Foundation in .Net platform is presented, which is lightweight, good-expansibility, running in cheap hardware platform, supporting Large-scale concurrent access, and having certain fault-tolerance. The framework of this system is analyzed and the performance of this system is tested and compared. All of these prove this system meet requirements.

Download Full-text

A Unified Distributed Computing Framework with Mobile Multi-Agent Systems and Virtual Machines for Large-Scale Applications: From the Internet-of-Things to Sensor Clouds

Position Papers of the 2015 Federated Conference on Computer Science and Information Systems ◽

10.15439/2015f252 ◽

2015 ◽

Author(s):

Stefan Bosse

Keyword(s):

Internet Of Things ◽

Distributed Computing ◽

Large Scale ◽

Virtual Machines ◽

The Internet ◽

Multi Agent Systems ◽

Agent Systems ◽

Multi Agent ◽

Computing Framework ◽

The Internet Of Things

Download Full-text

Accelerating recommendation system training by leveraging popular choices

Proceedings of the VLDB Endowment ◽

10.14778/3485450.3485462 ◽

2021 ◽

Vol 15 (1) ◽

pp. 127-140

Author(s):

Muhammad Adnan ◽

Yassaman Ebrahimzadeh Maboud ◽

Divya Mahajan ◽

Prashant J. Nair

Keyword(s):

Neural Networks ◽

Large Scale ◽

Recommendation System ◽

Training Data ◽

Categorical Variables ◽

Numerical Representation ◽

Data Layout ◽

Production Scale ◽

Training Time ◽

Usage Patterns

Recommender models are commonly used to suggest relevant items to a user for e-commerce and online advertisement-based applications. These models use massive embedding tables to store numerical representation of items' and users' categorical variables (memory intensive) and employ neural networks (compute intensive) to generate final recommendations. Training these large-scale recommendation models is evolving to require increasing data and compute resources. The highly parallel neural networks portion of these models can benefit from GPU acceleration however, large embedding tables often cannot fit in the limited-capacity GPU device memory. Hence, this paper deep dives into the semantics of training data and obtains insights about the feature access, transfer, and usage patterns of these models. We observe that, due to the popularity of certain inputs, the accesses to the embeddings are highly skewed with a few embedding entries being accessed up to 10000X more. This paper leverages this asymmetrical access pattern to offer a framework, called FAE, and proposes a hot-embedding aware data layout for training recommender models. This layout utilizes the scarce GPU memory for storing the highly accessed embeddings, thus reduces the data transfers from CPU to GPU. At the same time, FAE engages the GPU to accelerate the executions of these hot embedding entries. Experiments on production-scale recommendation models with real datasets show that FAE reduces the overall training time by 2.3X and 1.52X in comparison to XDL CPU-only and XDL CPU-GPU execution while maintaining baseline accuracy.

Download Full-text

MapReduce Based Parallel Neural Networks in Enabling Large Scale Machine Learning

Computational Intelligence and Neuroscience ◽

10.1155/2015/297672 ◽

2015 ◽

Vol 2015 ◽

pp. 1-13 ◽

Cited By ~ 22

Author(s):

Yang Liu ◽

Jie Yang ◽

Yuan Huang ◽

Lixiong Xu ◽

Siguang Li ◽

...

Keyword(s):

Neural Networks ◽

Big Data ◽

Large Scale ◽

Training Data ◽

Computer Cluster ◽

Data Intensive ◽

Big Data Applications ◽

The Neural Network ◽

Computation Process ◽

Data Intensive Applications

Artificial neural networks (ANNs) have been widely used in pattern recognition and classification applications. However, ANNs are notably slow in computation especially when the size of data is large. Nowadays, big data has received a momentum from both industry and academia. To fulfill the potentials of ANNs for big data applications, the computation process must be speeded up. For this purpose, this paper parallelizes neural networks based on MapReduce, which has become a major computing model to facilitate data intensive applications. Three data intensive scenarios are considered in the parallelization process in terms of the volume of classification data, the size of the training data, and the number of neurons in the neural network. The performance of the parallelized neural networks is evaluated in an experimental MapReduce computer cluster from the aspects of accuracy in classification and efficiency in computation.

Download Full-text

The Research and Improvement of MapReduce Cluster Scheduling Strategy Based on Ant Colony Optimization

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.513-517.1989 ◽

2014 ◽

Vol 513-517 ◽

pp. 1989-1993

Author(s):

Ya Ping Zhang ◽

Jun Qin ◽

Zhao Zhai

Keyword(s):

Distributed Computing ◽

Ant Colony Optimization ◽

Task Scheduling ◽

Large Scale ◽

Job Scheduling ◽

Schedule Algorithm ◽

Ant Colony ◽

Heterogeneous Environment ◽

Cluster Scheduling ◽

Computing Framework

The distributed computing framework of MapReduce has been widely used in big companies as a powerful tool for processing large scale of data. This paper will introduce the existing algorithms about MapReduce job scheduling and analyze the major two ones. It points out the defects of multi-task scheduling application on processing massive jobs and proposes a kind of multi-task cluster schedule algorithm MSBACO which is based on the ant colony optimization. The results of the experiments have proved its effectiveness and stability in heterogeneous environment.

Download Full-text

Compression of Monochromatic and Multicolored Image with Neural Network

Asian Journal of Research in Computer Science ◽

10.9734/ajrcos/2021/v9i130213 ◽

2021 ◽

pp. 39-45

Author(s):

Ragmi Mustafa ◽

Basri Ahmedi ◽

Kujtim Mustafa

Keyword(s):

Image Compression ◽

Data Storage ◽

Large Scale ◽

Data Transfer ◽

Backward Error ◽

Compression Technique ◽

Training Time ◽

Different Types ◽

Speed Up ◽

Time Required

Nowadays we have so much images provided by different types of machines, while we need to store them or transfer to other devices or via internet, we need to compress them because the images usually have large amount of size. Compressing them reduces time for transferring files. The compression can be done with different methods and software in order to reduce their capacity expressed in megabytes as much as tens of hundreds of gigabytes for more files. It is well known that the speed of information transmission depends mainly on its quantity or the capacity of the information package. Image compression is a very important task for data transfer and data storage, especially nowadays because of the development of many image acquisition devices. If there is no compression technique used on these data, they may occupy immense space of memory, or render difficult data transmission. Artificial Neural Networks (ANN) have demonstrated good capacities for lossy image compression. The ANN algorithm we investigate is BEP-SOFM, which uses a Backward Error Propagation algorithm to quickly obtain the initial weights, and then these weights are used to speed up the training time required by the Self-Organizing Feature Map algorithm. In order to obtain these initial weights with the BEP algorithm, we analyze the hierarchical approach, which consists in preparing the image to compress using the quadtree data structure by segmenting the image into blocks of different sizes. Small blocks are used to represent image areas with large-scale details, while the larger ones represent the areas that have a small number of observed details. Tests demonstrate that the approach of quadtree segmentation quickly leads to the initial weights using the BEP algorithm.

Download Full-text

T-MPP: A Novel Topic-Driven Meta-path-Based Approach for Co-authorship Prediction in Large-Scale Content-Based Heterogeneous Bibliographic Network in Distributed Computing Framework by Spark

Intelligent Computing & Optimization - Advances in Intelligent Systems and Computing ◽

10.1007/978-3-030-00979-3_9 ◽

2018 ◽

pp. 87-97

Author(s):

Phuc Do ◽

Phu Pham ◽

Trung Phan ◽

Thuc Nguyen

Keyword(s):

Distributed Computing ◽

Large Scale ◽

Bibliographic Network ◽

Meta Path ◽

Computing Framework

Download Full-text

KOSHIK- A Large-scale Distributed Computing Framework for NLP

Proceedings of the 3rd International Conference on Pattern Recognition Applications and Methods ◽

10.5220/0004707704630470 ◽

2014 ◽

Keyword(s):

Distributed Computing ◽

Large Scale ◽

Computing Framework

Download Full-text

Fast Training Logistic Regression via Adaptive Sampling

Scientific Programming ◽

10.1155/2021/9991859 ◽

2021 ◽

Vol 2021 ◽

pp. 1-11

Author(s):

Yunsheng Song ◽

Xiaohan Kong ◽

Shuoping Huang ◽

Chao Zhang

Keyword(s):

Logistic Regression ◽

Random Sampling ◽

Adaptive Sampling ◽

Large Scale ◽

Likelihood Function ◽

Classification Performance ◽

Training Data ◽

Gradient Estimation ◽

Training Time ◽

Fast Training

Logistic regression has been widely used in artificial intelligence and machine learning due to its deep theoretical basis and good practical performance. Its training process aims to solve a large-scale optimization problem characterized by a likelihood function, where the gradient descent approach is the most commonly used. However, when the data size is large, it is very time-consuming because it computes the gradient using all the training data in every iteration. Though this difficulty can be solved by random sampling, the appropriate sampled examples size is difficult to be predetermined and the obtained could be not robust. To overcome this deficiency, we propose a novel algorithm for fast training logistic regression via adaptive sampling. The proposed method decomposes the problem of gradient estimation into several subproblems according to its dimension; then, each subproblem is solved independently by adaptive sampling. Each element of the gradient estimation is obtained by successively sampling a fixed volume training example multiple times until it satisfies its stopping criteria. The final estimation is combined with the results of all the subproblems. It is proved that the obtained gradient estimation is a robust estimation, and it could keep the objective function value decreasing in the iterative calculation. Compared with the representative algorithms using random sampling, the experimental results show that this algorithm obtains comparable classification performance with much less training time.

Download Full-text

Coded Computing: Mitigating Fundamental Bottlenecks in Large-Scale Distributed Computing and Machine Learning

10.1561/9781680837056 ◽

2020 ◽

Author(s):

Songze Li ◽

Salman Avestimehr

Keyword(s):

Machine Learning ◽

Distributed Computing ◽

Large Scale

Download Full-text