scholarly journals A data-parallel implementation of O(N) hierarchical N-body methods

Author(s):  
Yu Hu ◽  
S. Lennart Johnsson
2011 ◽  
Vol 21 (01) ◽  
pp. 31-47 ◽  
Author(s):  
NOEL LOPES ◽  
BERNARDETE RIBEIRO

The Graphics Processing Unit (GPU) originally designed for rendering graphics and which is difficult to program for other tasks, has since evolved into a device suitable for general-purpose computations. As a result graphics hardware has become progressively more attractive yielding unprecedented performance at a relatively low cost. Thus, it is the ideal candidate to accelerate a wide variety of data parallel tasks in many fields such as in Machine Learning (ML). As problems become more and more demanding, parallel implementations of learning algorithms are crucial for a useful application. In particular, the implementation of Neural Networks (NNs) in GPUs can significantly reduce the long training times during the learning process. In this paper we present a GPU parallel implementation of the Back-Propagation (BP) and Multiple Back-Propagation (MBP) algorithms, and describe the GPU kernels needed for this task. The results obtained on well-known benchmarks show faster training times and improved performances as compared to the implementation in traditional hardware, due to maximized floating-point throughput and memory bandwidth. Moreover, a preliminary GPU based Autonomous Training System (ATS) is developed which aims at automatically finding high-quality NNs-based solutions for a given problem.


2014 ◽  
Vol 519-520 ◽  
pp. 719-723
Author(s):  
Guang Wang

A data parallel implementation of geometric operations is proposed and conclusions are proved. It shows that the computation complexity of data parallel implementation scheme presented in this paper is Ο(M+N). It can be used to improve the efficiency of geometric operations and can easily meet the real time requirements of the digital image processing.


2009 ◽  
Vol 2009 ◽  
pp. 1-14 ◽  
Author(s):  
Xinyu Li ◽  
Omar Hammami

Embedded system design is increasingly based on single chip multiprocessors because of the high performance and flexibility requirements. Embedded multiprocessors on FPGA provide the additional flexibility by allowing customization through addition of hardware accelerators on FPGA when parallel software implementation does not provide the expected performance. And the overall multiprocessor architecture is still kept for additional applications. This provides a transition to software only parallel implementation while avoiding pure hardware implementation. An automatic design flow is proposed well suited for data flow signal processing exhibiting both pipelining and data parallel mode of execution. Fork-Join model-based software parallelization is explored to find out the best parallelization configuration. C-based synthesis coprocessor is added to improve performance with more hardware resource usage. The Triple Data Encryption Standard (TDES) cryptographic algorithm on a 48-PE single-chip distributed memory multiprocessor is selected as an application example of the flow.


Sign in / Sign up

Export Citation Format

Share Document