Distributed training and evaluation of projection-based descriptors in Siamese Neural Networks

Parallel Restarted SGD with Faster Convergence and Less Communication: Demystifying Why Model Averaging Works for Deep Learning

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33015693 ◽

2019 ◽

Vol 33 ◽

pp. 5693-5700 ◽

Cited By ~ 16

Author(s):

Hao Yu ◽

Sen Yang ◽

Shenghuo Zhu

Keyword(s):

Neural Networks ◽

Deep Neural Networks ◽

Model Averaging ◽

Communication Overhead ◽

Single Server ◽

Training Time ◽

Distributed Training ◽

Speed Up ◽

Experimental Works ◽

Single Worker

In distributed training of deep neural networks, parallel minibatch SGD is widely used to speed up the training process by using multiple workers. It uses multiple workers to sample local stochastic gradients in parallel, aggregates all gradients in a single server to obtain the average, and updates each worker’s local model using a SGD update with the averaged gradient. Ideally, parallel mini-batch SGD can achieve a linear speed-up of the training time (with respect to the number of workers) compared with SGD over a single worker. However, such linear scalability in practice is significantly limited by the growing demand for gradient communication as more workers are involved. Model averaging, which periodically averages individual models trained over parallel workers, is another common practice used for distributed training of deep neural networks since (Zinkevich et al. 2010) (McDonald, Hall, and Mann 2010). Compared with parallel mini-batch SGD, the communication overhead of model averaging is significantly reduced. Impressively, tremendous experimental works have verified that model averaging can still achieve a good speed-up of the training time as long as the averaging interval is carefully controlled. However, it remains a mystery in theory why such a simple heuristic works so well. This paper provides a thorough and rigorous theoretical study on why model averaging can work as well as parallel mini-batch SGD with significantly less communication overhead.

Download Full-text

A Flexible Research-Oriented Framework for Distributed Training of Deep Neural Networks

2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) ◽

10.1109/ipdpsw52791.2021.00110 ◽

2021 ◽

Author(s):

Sergio Barrachina ◽

Adrian Castello ◽

Mar Catalan ◽

Manuel F. Dolz ◽

Jose I. Mestre

Keyword(s):

Neural Networks ◽

Deep Neural Networks ◽

Distributed Training

Download Full-text

Sequence discriminative distributed training of long short-term memory recurrent neural networks

10.21437/interspeech.2014-305 ◽

2014 ◽

Author(s):

Haşim Sak ◽

Oriol Vinyals ◽

Georg Heigold ◽

Andrew Senior ◽

Erik McDermott ◽

...

Keyword(s):

Neural Networks ◽

Recurrent Neural Networks ◽

Short Term Memory ◽

Short Term ◽

Term Memory ◽

Distributed Training ◽

Long Short Term Memory

Download Full-text

PSO-PS:Parameter Synchronization with Particle Swarm Optimization for Distributed Training of Deep Neural Networks

2020 International Joint Conference on Neural Networks (IJCNN) ◽

10.1109/ijcnn48605.2020.9207698 ◽

2020 ◽

Author(s):

Qing Ye ◽

Yuxuan Han ◽

Yanan Sun ◽

Jiancheng Lv

Keyword(s):

Neural Networks ◽

Particle Swarm Optimization ◽

Deep Neural Networks ◽

Particle Swarm ◽

Swarm Optimization ◽

Distributed Training

Download Full-text

OptQuant: Distributed training of neural networks with optimized quantization mechanisms

Neurocomputing ◽

10.1016/j.neucom.2019.02.049 ◽

2019 ◽

Vol 340 ◽

pp. 233-244 ◽

Cited By ~ 2

Author(s):

Li He ◽

Shuxin Zheng ◽

Wei Chen ◽

Zhi-Ming Ma ◽

Tie-Yan Liu

Keyword(s):

Neural Networks ◽

Distributed Training

Download Full-text

FABA: An Algorithm for Fast Aggregation against Byzantine Attacks in Distributed Neural Networks

Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2019/670 ◽

2019 ◽

Author(s):

Qi Xia ◽

Zeyi Tao ◽

Zijiang Hao ◽

Qun Li

Keyword(s):

Neural Network ◽

Neural Networks ◽

Deep Learning ◽

Single Machine ◽

Large Scale ◽

Efficient Solution ◽

Distributed Training ◽

Hardware Failures ◽

Deep Learning Neural Network ◽

Byzantine Attacks

Many times, training a large scale deep learning neural network on a single machine becomes more and more difficult for a complex network model. Distributed training provides an efficient solution, but Byzantine attacks may occur on participating workers. They may be compromised or suffer from hardware failures. If they upload poisonous gradients, the training will become unstable or even converge to a saddle point. In this paper, we propose FABA, a Fast Aggregation algorithm against Byzantine Attacks, which removes the outliers in the uploaded gradients and obtains gradients that are close to the true gradients. We show the convergence of our algorithm. The experiments demonstrate that our algorithm can achieve similar performance to non-Byzantine case and higher efficiency as compared to previous algorithms.

Download Full-text

An introduction to distributed training of deep neural networks for segmentation tasks with large seismic data sets

Geophysics ◽

10.1190/geo2021-0130.1 ◽

2021 ◽

Vol 86 (6) ◽

pp. KS151-KS160

Author(s):

Claire Birnie ◽

Haithem Jarraya ◽

Fredrik Hansteen

Keyword(s):

Neural Networks ◽

Input Data ◽

Model Performance ◽

Microseismic Monitoring ◽

Training Data ◽

Data Sets ◽

Distributed Training ◽

Spatiotemporal Information ◽

Training Approach ◽

Data Generator

Deep learning applications are drastically progressing in seismic processing and interpretation tasks. However, most approaches subsample data volumes and restrict model sizes to minimize computational requirements. Subsampling the data risks losing vital spatiotemporal information which could aid training, whereas restricting model sizes can impact model performance, or in some extreme cases renders more complicated tasks such as segmentation impossible. We have determined how to tackle the two main issues of training of large neural networks (NNs): memory limitations and impracticably large training times. Typically, training data are preloaded into memory prior to training, a particular challenge for seismic applications in which the data format is typically four times larger than that used for standard image processing tasks (float32 versus uint8). Based on an example from microseismic monitoring, we evaluate how more than 750 GB of data can be used to train a model by using a data generator approach, which only stores in memory the data required for that training batch. Furthermore, efficient training over large models is illustrated through the training of a seven-layer U-Net with input data dimensions of [Formula: see text] (approximately [Formula: see text] million parameters). Through a batch-splitting distributed training approach, the training times are reduced by a factor of four. The combination of data generators and distributed training removes any necessity of data subsampling or restriction of NN sizes, offering the opportunity to use larger networks, higher resolution input data, or move from 2D to 3D problem spaces.

Download Full-text

Research and design of distributed training algorithm for neural networks

2005 International Conference on Machine Learning and Cybernetics ◽

10.1109/icmlc.2005.1527645 ◽

2005 ◽

Cited By ~ 1

Author(s):

Bo Yang ◽

Ya-Dong Wang ◽

Xiao-Hong Su

Keyword(s):

Neural Networks ◽

Training Algorithm ◽

Distributed Training ◽

Research And Design

Download Full-text

Parallel and distributed training of neural networks via successive convex approximation

2016 IEEE 26th International Workshop on Machine Learning for Signal Processing (MLSP) ◽

10.1109/mlsp.2016.7738894 ◽

2016 ◽

Cited By ~ 3

Author(s):

Paolo Di Lorenzo ◽

Simone Scardapane

Keyword(s):

Neural Networks ◽

Convex Approximation ◽

Distributed Training ◽

Successive Convex Approximation

Download Full-text

Model Accuracy and Runtime Tradeoff in Distributed Deep Learning: A Systematic Study

Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2017/681 ◽

2017 ◽

Cited By ~ 5

Author(s):

Suyog Gupta ◽

Wei Zhang ◽

Fei Wang

Keyword(s):

Neural Networks ◽

Deep Learning ◽

Systematic Study ◽

Stochastic Gradient Descent ◽

Model Accuracy ◽

Distributed Training ◽

Gradient Descent Algorithm ◽

Runtime Performance ◽

The Impact ◽

Synchronization Protocol

Deep learning with a large number of parame-ters requires distributed training, where model accuracy and runtime are two important factors to be considered. However, there has been no systematic study of the tradeoff between these two factors during the model training process. This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learningrate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.

Download Full-text