Distributed training and evaluation of projection-based descriptors in Siamese Neural Networks

Author(s):  
G. Kertész ◽  
S. Szénási ◽  
Z. Vámossy
Author(s):  
Hao Yu ◽  
Sen Yang ◽  
Shenghuo Zhu

In distributed training of deep neural networks, parallel minibatch SGD is widely used to speed up the training process by using multiple workers. It uses multiple workers to sample local stochastic gradients in parallel, aggregates all gradients in a single server to obtain the average, and updates each worker’s local model using a SGD update with the averaged gradient. Ideally, parallel mini-batch SGD can achieve a linear speed-up of the training time (with respect to the number of workers) compared with SGD over a single worker. However, such linear scalability in practice is significantly limited by the growing demand for gradient communication as more workers are involved. Model averaging, which periodically averages individual models trained over parallel workers, is another common practice used for distributed training of deep neural networks since (Zinkevich et al. 2010) (McDonald, Hall, and Mann 2010). Compared with parallel mini-batch SGD, the communication overhead of model averaging is significantly reduced. Impressively, tremendous experimental works have verified that model averaging can still achieve a good speed-up of the training time as long as the averaging interval is carefully controlled. However, it remains a mystery in theory why such a simple heuristic works so well. This paper provides a thorough and rigorous theoretical study on why model averaging can work as well as parallel mini-batch SGD with significantly less communication overhead.


2014 ◽  
Author(s):  
Haşim Sak ◽  
Oriol Vinyals ◽  
Georg Heigold ◽  
Andrew Senior ◽  
Erik McDermott ◽  
...  

2019 ◽  
Vol 340 ◽  
pp. 233-244 ◽  
Author(s):  
Li He ◽  
Shuxin Zheng ◽  
Wei Chen ◽  
Zhi-Ming Ma ◽  
Tie-Yan Liu

Author(s):  
Qi Xia ◽  
Zeyi Tao ◽  
Zijiang Hao ◽  
Qun Li

Many times, training a large scale deep learning neural network on a single machine becomes more and more difficult for a complex network model. Distributed training provides an efficient solution, but Byzantine attacks may occur on participating workers. They may be compromised or suffer from hardware failures. If they upload poisonous gradients, the training will become unstable or even converge to a saddle point. In this paper, we propose FABA, a Fast Aggregation algorithm against Byzantine Attacks, which removes the outliers in the uploaded gradients and obtains gradients that are close to the true gradients. We show the convergence of our algorithm. The experiments demonstrate that our algorithm can achieve similar performance to non-Byzantine case and higher efficiency as compared to previous algorithms.


Geophysics ◽  
2021 ◽  
Vol 86 (6) ◽  
pp. KS151-KS160
Author(s):  
Claire Birnie ◽  
Haithem Jarraya ◽  
Fredrik Hansteen

Deep learning applications are drastically progressing in seismic processing and interpretation tasks. However, most approaches subsample data volumes and restrict model sizes to minimize computational requirements. Subsampling the data risks losing vital spatiotemporal information which could aid training, whereas restricting model sizes can impact model performance, or in some extreme cases renders more complicated tasks such as segmentation impossible. We have determined how to tackle the two main issues of training of large neural networks (NNs): memory limitations and impracticably large training times. Typically, training data are preloaded into memory prior to training, a particular challenge for seismic applications in which the data format is typically four times larger than that used for standard image processing tasks (float32 versus uint8). Based on an example from microseismic monitoring, we evaluate how more than 750 GB of data can be used to train a model by using a data generator approach, which only stores in memory the data required for that training batch. Furthermore, efficient training over large models is illustrated through the training of a seven-layer U-Net with input data dimensions of [Formula: see text] (approximately [Formula: see text] million parameters). Through a batch-splitting distributed training approach, the training times are reduced by a factor of four. The combination of data generators and distributed training removes any necessity of data subsampling or restriction of NN sizes, offering the opportunity to use larger networks, higher resolution input data, or move from 2D to 3D problem spaces.


Author(s):  
Suyog Gupta ◽  
Wei Zhang ◽  
Fei Wang

Deep learning with a large number of parame-ters requires distributed training, where model accuracy and runtime are two important factors to be considered. However, there has been no systematic study of the tradeoff between these two factors during the model training process. This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learningrate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.


Sign in / Sign up

Export Citation Format

Share Document