Strassen’s Algorithm Reloaded on GPUs

We have improved our prior implementation of Strassens algorithm for high performance multiplication of very large integers on a general purpose graphics processor (GPU). A combination of algorithmic and implementation optimizations result in a factor of up to 13.9 speed improvement over our previous work, running on an NVIDIA 295. We have also reoptimized the implementation for an NVIDIA 480, from which we obtain a factor of up to 19 speedup in comparison with a Core i7 processor core of the same technology generation. To provide a fairer chip to chip comparison, we also determined total GPU throughput on a set of multiplications relative to all of the cores on a multicore chip running in parallel. We find that the GTX 480 provides a factor of six higher throughput than all four cores/eight threads of the Core i7. This paper discusses how we adapted the algorithm to operate within the limitations of the GPU and how we dealt with other issues encountered in the implementation process, including details of the memory layout of our FFTs. Compared with our earlier work, which used Karatsuba's algorithm to guide multiplication of different operand sizes built on top of Strassen's algorithm being applied to fixed-size segments of the operands, we are now able to apply Strassen's algorithm directly to operands ranging in size from 255K bits to 16,320K bits.

Download Full-text

Strassen's Algorithm for Tensor Contraction

SIAM Journal on Scientific Computing ◽

10.1137/17m1135578 ◽

2018 ◽

Vol 40 (3) ◽

pp. C305-C326 ◽

Cited By ~ 4

Author(s):

Jianyu Huang ◽

Devin A. Matthews ◽

Robert A. van de Geijn

Keyword(s):

Strassen’S Algorithm ◽

Strassen's Algorithm ◽

Tensor Contraction

Download Full-text

Strassen’s algorithm: A conceptual perspective

Discrete Models in Control Systems Theory ◽

10.29003/m128.978-5-317-05834-0/14-16 ◽

2018 ◽

Author(s):

Christian Ikenmeyer ◽

Vladimir Lysikov

Keyword(s):

Strassen’S Algorithm ◽

Strassen's Algorithm

Download Full-text

A Vector Implementation of Gaussian Elimination over GF(2): Exploring the Design-Space of Strassen's Algorithm as a Case Study

2015 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing ◽

10.1109/pdp.2015.24 ◽

2015 ◽

Author(s):

Enric Morancho

Keyword(s):

Design Space ◽

Gaussian Elimination ◽

Strassen’S Algorithm ◽

Strassen's Algorithm

Download Full-text

Fast Multiplication of Interval Matrices (Interval Version of Strassen's Algorithm)

Reliable Computing ◽

10.1023/b:reom.0000032111.16328.b2 ◽

2004 ◽

Vol 10 (3) ◽

pp. 241-243 ◽

Cited By ~ 2

Author(s):

Martine Ceberio ◽

Vladik Kreinovich

Keyword(s):

Interval Matrices ◽

Fast Multiplication ◽

Strassen’S Algorithm ◽

Strassen's Algorithm

Download Full-text

A HIGH PERFORMANCE PARALLEL STRASSEN IMPLEMENTATION

Parallel Processing Letters ◽

10.1142/s0129626496000029 ◽

1996 ◽

Vol 06 (01) ◽

pp. 3-12 ◽

Cited By ~ 20

Author(s):

BRIAN GRAYSON ◽

ROBERT VAN DE GEIJN

Keyword(s):

Execution Time ◽

High Performance ◽

Parallel Implementation ◽

Matrix Multiplication ◽

Intel Paragon ◽

Strassen’S Algorithm ◽

Strassen's Algorithm

In this paper, we give a practical high performance parallel implementation of Strassen’s algorithm for matrix multiplication. We show how under restricted conditions, this algorithm can be implemented plug compatible with standard parallel matrix multiplication algorithms. Results obtained on a large Intel Paragon system show a 10– 20% reduction in execution time compared to what we believe to be the fastest standard parallel matrix multiplication implementation available at this time.

Download Full-text