Memory latency-tolerance approaches for Itanium processors: out-of-order execution vs. speculative precomputation

Author(s):  
P.H. Wang ◽  
Hong Wang ◽  
J.D. Collins ◽  
E. Grochowski ◽  
R.M. Kling ◽  
...  
IEEE Micro ◽  
2006 ◽  
Vol 26 (1) ◽  
pp. 10-20 ◽  
Author(s):  
O. Mutlu ◽  
Hyesoon Kim ◽  
Y.N. Patt

2015 ◽  
Vol 43 (1) ◽  
pp. 471-472
Author(s):  
Russell Clapp ◽  
Martin Dimitrov ◽  
Karthik Kumar ◽  
Vish Viswanathan ◽  
Thomas Willhalm

1993 ◽  
Vol 42 (1) ◽  
pp. 122-127 ◽  
Author(s):  
H.C. Torng ◽  
M. Day

2010 ◽  
Vol 20 (02) ◽  
pp. 103-121 ◽  
Author(s):  
MOSTAFA I. SOLIMAN ◽  
ABDULMAJID F. Al-JUNAID

Technological advances in IC manufacturing provide us with the capability to integrate more and more functionality into a single chip. Today's modern processors have nearly one billion transistors on a single chip. With the increasing complexity of today's system, the designs have to be modeled at a high-level of abstraction before partitioning into hardware and software components for final implementation. This paper explains in detail the implementation and performance evaluation of a matrix processor called Mat-Core with SystemC (system level modeling language). Mat-Core is a research processor aiming at exploiting the increasingly number of transistors per IC to improve the performance of a wide range of applications. It extends a general-purpose scalar processor with a matrix unit. To hide memory latency, the extended matrix unit is decoupled into two components: address generation and data computation, which communicate through data queues. Like vector architectures, the data computation unit is organized in parallel lanes. However, on parallel lanes, Mat-Core can execute matrix-scalar, matrix-vector, and matrix-matrix instructions in addition to vector-scalar and vector-vector instructions. For controlling the execution of vector/matrix instructions on the matrix core, this paper extends the well known scoreboard technique. Furthermore, the performance of Mat-Core is evaluated on vector and matrix kernels. Our results show that the performance of four lanes Mat-Core with matrix registers of size 4 × 4 or 16 elements each, queues size of 10, start up time of 6 clock cycles, and memory latency of 10 clock cycles is about 0.94, 1.3, 2.3, 1.6, 2.3, and 5.5 FLOPs per clock cycle; achieved on scalar-vector multiplication, SAXPY, Givens, rank-1 update, vector-matrix multiplication, and matrix-matrix multiplication, respectively.


1992 ◽  
Vol 14 (2) ◽  
pp. 281 ◽  
Author(s):  
Thomas V. Greer ◽  
B. Wade Brorsen ◽  
Shi-Miin Liu
Keyword(s):  

1994 ◽  
Vol 23 (1) ◽  
pp. 10-26 ◽  
Author(s):  
C. Martel ◽  
A. Raghunathan
Keyword(s):  

2012 ◽  
Vol 2012 ◽  
pp. 1-12 ◽  
Author(s):  
Kaveh Aasaraai ◽  
Andreas Moshovos

Soft processors often use data caches to reduce the gap between processor and main memory speeds. To achieve high efficiency, simple, blocking caches are used. Such caches are not appropriate for processor designs such as Runahead and out-of-order execution that require nonblocking caches to tolerate main memory latencies. Instead, these processors use non-blocking caches to extract memory level parallelism and improve performance. However, conventional non-blocking cache designs are expensive and slow on FPGAs as they use content-addressable memories (CAMs). This work proposes NCOR, an FPGA-friendly non-blocking cache that exploits the key properties of Runahead execution. NCOR does not require CAMs and utilizes smart cache controllers. A 4 KB NCOR operates at 329 MHz on Stratix III FPGAs while it uses only 270 logic elements. A 32 KB NCOR operates at 278 Mhz and uses 269 logic elements.


Sign in / Sign up

Export Citation Format

Share Document