scholarly journals Strassen’s Algorithm Reloaded on GPUs

2020 ◽  
Vol 46 (1) ◽  
pp. 1-22 ◽  
Author(s):  
Jianyu Huang ◽  
Chenhan D. Yu ◽  
Robert A. van de Geijn
2011 ◽  
Vol 21 (03) ◽  
pp. 359-375 ◽  
Author(s):  
NIALL EMMART ◽  
CHARLES C. WEEMS

We have improved our prior implementation of Strassens algorithm for high performance multiplication of very large integers on a general purpose graphics processor (GPU). A combination of algorithmic and implementation optimizations result in a factor of up to 13.9 speed improvement over our previous work, running on an NVIDIA 295. We have also reoptimized the implementation for an NVIDIA 480, from which we obtain a factor of up to 19 speedup in comparison with a Core i7 processor core of the same technology generation. To provide a fairer chip to chip comparison, we also determined total GPU throughput on a set of multiplications relative to all of the cores on a multicore chip running in parallel. We find that the GTX 480 provides a factor of six higher throughput than all four cores/eight threads of the Core i7. This paper discusses how we adapted the algorithm to operate within the limitations of the GPU and how we dealt with other issues encountered in the implementation process, including details of the memory layout of our FFTs. Compared with our earlier work, which used Karatsuba's algorithm to guide multiplication of different operand sizes built on top of Strassen's algorithm being applied to fixed-size segments of the operands, we are now able to apply Strassen's algorithm directly to operands ranging in size from 255K bits to 16,320K bits.


2018 ◽  
Vol 40 (3) ◽  
pp. C305-C326 ◽  
Author(s):  
Jianyu Huang ◽  
Devin A. Matthews ◽  
Robert A. van de Geijn

1996 ◽  
Vol 06 (01) ◽  
pp. 3-12 ◽  
Author(s):  
BRIAN GRAYSON ◽  
ROBERT VAN DE GEIJN

In this paper, we give a practical high performance parallel implementation of Strassen’s algorithm for matrix multiplication. We show how under restricted conditions, this algorithm can be implemented plug compatible with standard parallel matrix multiplication algorithms. Results obtained on a large Intel Paragon system show a 10– 20% reduction in execution time compared to what we believe to be the fastest standard parallel matrix multiplication implementation available at this time.


Sign in / Sign up

Export Citation Format

Share Document