Memory latency-tolerance approaches for Itanium processors: out-of-order execution vs. speculative precomputation

Efficient Runahead Execution: Power-Efficient Memory Latency Tolerance

IEEE Micro ◽

10.1109/mm.2006.10 ◽

2006 ◽

Vol 26 (1) ◽

pp. 10-20 ◽

Cited By ~ 21

Author(s):

O. Mutlu ◽

Hyesoon Kim ◽

Y.N. Patt

Keyword(s):

Memory Latency ◽

Power Efficient ◽

Latency Tolerance ◽

Efficient Memory ◽

Runahead Execution

Download Full-text

A Simple Model to Quantify the Impact of Memory Latency and Bandwidth on Performance

ACM SIGMETRICS Performance Evaluation Review ◽

10.1145/2796314.2745900 ◽

2015 ◽

Vol 43 (1) ◽

pp. 471-472

Author(s):

Russell Clapp ◽

Martin Dimitrov ◽

Karthik Kumar ◽

Vish Viswanathan ◽

Thomas Willhalm

Keyword(s):

Simple Model ◽

Memory Latency ◽

The Impact

Download Full-text

Prodigy: Improving the Memory Latency of Data-Indirect Irregular Workloads Using Hardware-Software Co-Design

2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA) ◽

10.1109/hpca51647.2021.00061 ◽

2021 ◽

Author(s):

Nishil Talati ◽

Kyle May ◽

Armand Behroozi ◽

Yichen Yang ◽

Kuba Kaszyk ◽

...

Keyword(s):

Memory Latency

Download Full-text

Out-of-order execution in sequentially consistent shared-memory systems

ACM SIGARCH Computer Architecture News ◽

10.1145/271003.274463 ◽

1997 ◽

Vol 25 (4) ◽

pp. 3-10 ◽

Cited By ~ 1

Author(s):

Weiwu Hu ◽

Peisu Xia

Keyword(s):

Shared Memory ◽

Memory Systems ◽

Order Execution

Download Full-text

Interrupt handling for out-of-order execution processors

IEEE Transactions on Computers ◽

10.1109/12.192223 ◽

1993 ◽

Vol 42 (1) ◽

pp. 122-127 ◽

Cited By ~ 14

Author(s):

H.C. Torng ◽

M. Day

Keyword(s):

Interrupt Handling ◽

Order Execution

Download Full-text

SYSTEMC IMPLEMENTATION AND PERFORMANCE EVALUATION OF A DECOUPLED GENERAL-PURPOSE MATRIX PROCESSOR

Parallel Processing Letters ◽

10.1142/s0129626410000090 ◽

2010 ◽

Vol 20 (02) ◽

pp. 103-121 ◽

Cited By ~ 1

Author(s):

MOSTAFA I. SOLIMAN ◽

ABDULMAJID F. Al-JUNAID

Keyword(s):

Performance Evaluation ◽

Matrix Multiplication ◽

General Purpose ◽

System Level ◽

Memory Latency ◽

Single Chip ◽

Wide Range ◽

Matrix Unit ◽

And Performance ◽

Vector Matrix

Technological advances in IC manufacturing provide us with the capability to integrate more and more functionality into a single chip. Today's modern processors have nearly one billion transistors on a single chip. With the increasing complexity of today's system, the designs have to be modeled at a high-level of abstraction before partitioning into hardware and software components for final implementation. This paper explains in detail the implementation and performance evaluation of a matrix processor called Mat-Core with SystemC (system level modeling language). Mat-Core is a research processor aiming at exploiting the increasingly number of transistors per IC to improve the performance of a wide range of applications. It extends a general-purpose scalar processor with a matrix unit. To hide memory latency, the extended matrix unit is decoupled into two components: address generation and data computation, which communicate through data queues. Like vector architectures, the data computation unit is organized in parallel lanes. However, on parallel lanes, Mat-Core can execute matrix-scalar, matrix-vector, and matrix-matrix instructions in addition to vector-scalar and vector-vector instructions. For controlling the execution of vector/matrix instructions on the matrix core, this paper extends the well known scoreboard technique. Furthermore, the performance of Mat-Core is evaluated on vector and matrix kernels. Our results show that the performance of four lanes Mat-Core with matrix registers of size 4 × 4 or 16 elements each, queues size of 10, start up time of 6 clock cycles, and memory latency of 10 clock cycles is about 0.94, 1.3, 2.3, 1.6, 2.3, and 5.5 FLOPs per clock cycle; achieved on scalar-vector multiplication, SAXPY, Givens, rank-1 update, vector-matrix multiplication, and matrix-matrix multiplication, respectively.

Download Full-text

Slippage Costs in Order Execution for a Public Futures Fund

Review of Agricultural Economics ◽

10.2307/1349507 ◽

1992 ◽

Vol 14 (2) ◽

pp. 281 ◽

Cited By ~ 5

Author(s):

Thomas V. Greer ◽

B. Wade Brorsen ◽

Shi-Miin Liu

Keyword(s):

Order Execution

Download Full-text

Asynchronous PRAMS with Memory Latency

Journal of Parallel and Distributed Computing ◽

10.1006/jpdc.1994.1115 ◽

1994 ◽

Vol 23 (1) ◽

pp. 10-26 ◽

Cited By ~ 2

Author(s):

C. Martel ◽

A. Raghunathan

Keyword(s):

Memory Latency ◽

With Memory

Download Full-text

NCOR: An FPGA-Friendly Nonblocking Data Cache for Soft Processors with Runahead Execution

International Journal of Reconfigurable Computing ◽

10.1155/2012/915178 ◽

2012 ◽

Vol 2012 ◽

pp. 1-12 ◽

Cited By ~ 2

Author(s):

Kaveh Aasaraai ◽

Andreas Moshovos

Keyword(s):

High Efficiency ◽

Main Memory ◽

Data Cache ◽

Improve Performance ◽

Data Caches ◽

Content Addressable Memories ◽

Processor Designs ◽

Level Parallelism ◽

Order Execution ◽

Runahead Execution

Soft processors often use data caches to reduce the gap between processor and main memory speeds. To achieve high efficiency, simple, blocking caches are used. Such caches are not appropriate for processor designs such as Runahead and out-of-order execution that require nonblocking caches to tolerate main memory latencies. Instead, these processors use non-blocking caches to extract memory level parallelism and improve performance. However, conventional non-blocking cache designs are expensive and slow on FPGAs as they use content-addressable memories (CAMs). This work proposes NCOR, an FPGA-friendly non-blocking cache that exploits the key properties of Runahead execution. NCOR does not require CAMs and utilizes smart cache controllers. A 4 KB NCOR operates at 329 MHz on Stratix III FPGAs while it uses only 270 logic elements. A 32 KB NCOR operates at 278 Mhz and uses 269 logic elements.

Download Full-text

Out-of-Order Execution of Buffered Function Units in Exposed Data Path Architectures

2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) ◽

10.1109/ipdpsw.2017.50 ◽

2017 ◽

Author(s):

Tripti Jain ◽

Klaus Schneider ◽

Frederik Walk

Keyword(s):

Data Path ◽

Order Execution

Download Full-text