DStride: data-cache miss-address-based stride prefetching scheme for multimedia processors

Compiler Techniques for Reducing Data Cache Miss Rate on a Multithreaded Architecture

High Performance Embedded Architectures and Compilers - Lecture Notes in Computer Science ◽

10.1007/978-3-540-77560-7_24 ◽

2008 ◽

pp. 353-368 ◽

Cited By ~ 11

Author(s):

Subhradyuti Sarkar ◽

Dean M. Tullsen

Keyword(s):

Data Cache ◽

Multithreaded Architecture ◽

Cache Miss ◽

Compiler Techniques

Download Full-text

A SPLIT L2 DATA CACHE FOR SCALABLE CC-NUMA MULTIPROCESSORS

Journal of Circuits System and Computers ◽

10.1142/s021812660500243x ◽

2005 ◽

Vol 14 (03) ◽

pp. 605-617 ◽

Cited By ~ 2

Author(s):

SUNG WOO CHUNG ◽

HYONG-SHIK KIM ◽

CHU SHIK JHON

Keyword(s):

Execution Time ◽

Memory Access ◽

Remote Memory ◽

Access Time ◽

Data Cache ◽

Total Execution Time ◽

Memory Address ◽

L2 Cache ◽

Cache Miss

In scalable CC-NUMA multiprocessors, it is crucial to reduce the average memory access time. For applications where the second-level (L2) cache is large enough, we propose a split L2 cache to utilize the surplus space. The split L2 cache is composed of a traditional LRU cache and an RVC (Remote Victim Cache) which only stores the data of remote memory address range. Thus, it reduces the average L2 cache miss time by keeping remote blocks that would be discarded otherwise. Though the split cache does not reduce the miss rates, it is observed to reduce the total execution time effectively by up to 27%.It even outperform an LRU cache of double size.

Download Full-text

Author retrospective improving data cache performance by pre-executing instructions under a cache miss

25th Anniversary International Conference on Supercomputing Anniversary Volume - ◽

10.1145/2591635.2591655 ◽

2014 ◽

Author(s):

Trevor Mudge

Keyword(s):

Data Cache ◽

Cache Performance ◽

Cache Miss

Download Full-text

Improving data cache performance by pre-executing instructions under a cache miss

25th Anniversary International Conference on Supercomputing Anniversary Volume - ◽

10.1145/2591635.2667173 ◽

2014 ◽

Author(s):

James Dundas ◽

Trevor Mudge

Keyword(s):

Data Cache ◽

Cache Performance ◽

Cache Miss

Download Full-text

Improving data cache performance by pre-executing instructions under a cache miss

Proceedings of the 11th international conference on Supercomputing - ICS '97 ◽

10.1145/263580.263597 ◽

1997 ◽

Cited By ~ 132

Author(s):

James Dundas ◽

Trevor Mudge

Keyword(s):

Data Cache ◽

Cache Performance ◽

Cache Miss

Download Full-text

Research of Web-OLAP system based on distributed data cache technology

Journal of Computer Applications ◽

10.3724/sp.j.1087.2008.00515 ◽

2008 ◽

Vol 28 (2) ◽

pp. 515-518

Author(s):

Li-juan CAO

Keyword(s):

Data Cache ◽

Distributed Data

Download Full-text

Decreasing the Miss Rate and Eliminating the Performance Penalty of a Data Filter Cache

ACM Transactions on Architecture and Code Optimization ◽

10.1145/3449043 ◽

2021 ◽

Vol 18 (3) ◽

pp. 1-22

Author(s):

Michael Stokes ◽

David Whalley ◽

Soner Onder

Keyword(s):

Energy Efficient ◽

Data Access ◽

Performance Degradation ◽

Access Time ◽

Data Cache ◽

Energy Usage ◽

Single Cycle ◽

Performance Penalty

While data filter caches (DFCs) have been shown to be effective at reducing data access energy, they have not been adopted in processors due to the associated performance penalty caused by high DFC miss rates. In this article, we present a design that both decreases the DFC miss rate and completely eliminates the DFC performance penalty even for a level-one data cache (L1 DC) with a single cycle access time. First, we show that a DFC that lazily fills each word in a DFC line from an L1 DC only when the word is referenced is more energy-efficient than eagerly filling the entire DFC line. For a 512B DFC, we are able to eliminate loads of words into the DFC that are never referenced before being evicted, which occurred for about 75% of the words in 32B lines. Second, we demonstrate that a lazily word filled DFC line can effectively share and pack data words from multiple L1 DC lines to lower the DFC miss rate. For a 512B DFC, we completely avoid accessing the L1 DC for loads about 23% of the time and avoid a fully associative L1 DC access for loads 50% of the time, where the DFC only requires about 2.5% of the size of the L1 DC. Finally, we present a method that completely eliminates the DFC performance penalty by speculatively performing DFC tag checks early and only accessing DFC data when a hit is guaranteed. For a 512B DFC, we improve data access energy usage for the DTLB and L1 DC by 33% with no performance degradation.

Download Full-text