PERFORMANCE OF SCALABLE SHARED-MEMORY ARCHITECTURES

BAHMAN S. MOTLAGH; RONALD F. DeMARA

doi:10.1142/s0218126600000068

PERFORMANCE OF SCALABLE SHARED-MEMORY ARCHITECTURES

Journal of Circuits System and Computers ◽

10.1142/s0218126600000068 ◽

2000 ◽

Vol 10 (01n02) ◽

pp. 1-22

Author(s):

BAHMAN S. MOTLAGH ◽

RONALD F. DeMARA

Keyword(s):

Memory Access ◽

Analytical Models ◽

Access Time ◽

System Parameters ◽

Hit Rates ◽

Working Set ◽

Memory Accesses ◽

Memory Architectures ◽

Block Sizes ◽

Shared Memory Architectures

Analytical models were developed and simulations of memory latency were performed for Uniform Memory Access (UMA), Non-Uniform Memory Access (NUMA), Local-Remote-Global (LRG), and RCR architectures for hit rates from 0.1 to 0.9 in steps of 0.1, memory access times of 10 to 100 ns, proportions of read/write access from 0.01 to 0.1, and block sizes of 8 to 64 words. The RCR architecture provides favorable performance over UMA and NUMA architectures for all ranges of application and system parameters. RCR outperforms LRG architectures when the hit rates of the processor cache exceed 80%and replicated memory exceed 25%. Thus, inclusion of a small replicated memory at each processor significantly reduces expected access time since all replicated memory hits become independent of global traffic. For configurations of up to 32 processors, results show that latency is further reduced by distinguishing burst-mode transfers between isolated memory accesses and those which are incrementally outside the working set.

Download Full-text

Memory Access Behavior Analysis of NUMA-Based Shared Memory Programs

Scientific Programming ◽

10.1155/2002/790749 ◽

2002 ◽

Vol 10 (1) ◽

pp. 45-53 ◽

Cited By ~ 3

Author(s):

Jie Tao ◽

Wolfgang Karl ◽

Martin Schulz

Keyword(s):

Shared Memory ◽

Data Locality ◽

Memory Access ◽

Remote Memory ◽

Data Layout ◽

Performance Improvements ◽

Significant Performance ◽

Working Set ◽

Memory Accesses ◽

Memory Applications

Shared memory applications running transparently on top of NUMA architectures often face severe performance problems due to bad data locality and excessive remote memory accesses. Optimizations with respect to data locality are therefore necessary, but require a fundamental understanding of an application's memory access behavior. The information necessary for this cannot be obtained using simple code instrumentation due to the implicit nature of the communication handled by the NUMA hardware, the large amount of traffic produced at runtime, and the fine access granularity in shared memory codes. In this paper an approach to overcome these problems and thereby to enable an easy and efficient optimization process is presented. Based on a low-level hardware monitoring facility in coordination with a comprehensive visualization tool, it enables the generation of memory access histograms capable of showing all memory accesses across the complete address space of an application's working set. This information can be used to identify access hot spots, to understand the dynamic behavior of shared memory applications, and to optimize applications using an application specific data layout resulting in significant performance improvements.

Download Full-text

Scaling Non-Regular Shared-Memory Codes by Reusing Custom Loop Schedules

Scientific Programming ◽

10.1155/2003/379739 ◽

2003 ◽

Vol 11 (2) ◽

pp. 143-158 ◽

Cited By ~ 1

Author(s):

Dimitrios S. Nikolopoulos ◽

Ernest Artiaga ◽

Eduard Ayguadé ◽

Jesús Labarta

Keyword(s):

Shared Memory ◽

Memory Access ◽

Access Latency ◽

Programming Paradigm ◽

Work Distribution ◽

Irregular Data ◽

Memory Architectures ◽

Memory Codes ◽

Shared Memory Architectures

In this paper we explore the idea of customizing and reusing loop schedules to improve the scalability of non-regular numerical codes in shared-memory architectures with non-uniform memory access latency. The main objective is to implicitly setup affinity links between threads and data, by devising loop schedules that achieve balanced work distribution within irregular data spaces and reusing them as much as possible along the execution of the program for better memory access locality. This transformation provides a great deal of flexibility in optimizing locality, without compromising the simplicity of the shared-memory programming paradigm. In particular, the programmer does not need to explicitly distribute data between processors. The paper presents practical examples from real applications and experiments showing the efficiency of the approach.

Download Full-text

Off-chip prefetching based on Hidden Markov Model for non-volatile memory architectures

PLoS ONE ◽

10.1371/journal.pone.0257047 ◽

2021 ◽

Vol 16 (9) ◽

pp. e0257047

Author(s):

Adrián Lamela ◽

Óscar G. Ossorio ◽

Guillermo Vinuesa ◽

Benjamín Sahelices

Keyword(s):

Markov Model ◽

Hidden Markov Model ◽

Hidden Markov ◽

Multicore Processors ◽

Memory Access ◽

Non Volatile Memory ◽

Volatile Memory ◽

Memory Accesses ◽

Access Patterns ◽

Memory Architectures

Non-volatile memory technology is now available in commodity hardware. This technology can be used as a backup memory for an external dram cache memory without needing to modify the software. However, the higher read and write latencies of non-volatile memory may exacerbate the memory wall problem. In this work we present a novel off-chip prefetch technique based on a Hidden Markov Model that specifically deals with the latency problem caused by complexity of off-chip memory access patterns. Firstly, we present a thorough analysis of off-chip memory access patterns to identify its complexity in multicore processors. Based on this study, we propose a prefetching module located in the llc which uses two small tables, and where the computational complexity of which is linear with the number of computing threads. Our Markov-based technique is able to keep track and make clustering of several simultaneous groups of memory accesses coming from multiple simultaneous threads in a multicore processor. It can quickly identify complex address groups and trigger prefetch with very high accuracy. Our simulations show an improvement of up to 76% in the hit ratio of an off-chip dram cache for multicore architecture over the conventional prefetch technique (g/dc). Also, the overhead of prefetch requests (failed prefetches) is reduced by 48% in single core simulations and by 83% in multicore simulations.

Download Full-text

Pre-Emphasis Pulse Design for Random-Access Memory

Electronics ◽

10.3390/electronics10121454 ◽

2021 ◽

Vol 10 (12) ◽

pp. 1454

Author(s):

Yoshihiro Sugiura ◽

Toru Tanzawa

Keyword(s):

Time Constant ◽

Random Access ◽

Memory Cell ◽

Random Access Memory ◽

Memory Access ◽

Access Time ◽

Access Memory ◽

Delay Times ◽

Cell Current ◽

The Impact

This paper describes how one can reduce the memory access time with pre-emphasis (PE) pulses even in non-volatile random-access memory. Optimum PE pulse widths and resultant minimum word-line (WL) delay times are investigated as a function of column address. The impact of the process variation in the time constant of WL, the cell current, and the resistance of deciding path on optimum PE pulses are discussed. Optimum PE pulse widths and resultant minimum WL delay times are modeled with fitting curves as a function of column address of the accessed memory cell, which provides designers with the ability to set the optimum timing for WL and BL (bit-line) operations, reducing average memory access time.

Download Full-text

Decision Tree-Based Adaptive Reconfigurable Cache Scheme

Algorithms ◽

10.3390/a14060176 ◽

2021 ◽

Vol 14 (6) ◽

pp. 176

Author(s):

Wei Zhu ◽

Xiaoyang Zeng

Keyword(s):

Decision Tree ◽

Adaptive Algorithms ◽

Memory Access ◽

Access Time ◽

Decision Tree Algorithm ◽

Verilog Hdl ◽

Tree Model ◽

Cache Associativity ◽

Cache Scheme ◽

Reconfigurable Cache

Applications have different preferences for caches, sometimes even within the different running phases. Caches with fixed parameters may compromise the performance of a system. To solve this problem, we propose a real-time adaptive reconfigurable cache based on the decision tree algorithm, which can optimize the average memory access time of cache without modifying the cache coherent protocol. By monitoring the application running state, the cache associativity is periodically tuned to the optimal cache associativity, which is determined by the decision tree model. This paper implements the proposed decision tree-based adaptive reconfigurable cache in the GEM5 simulator and designs the key modules using Verilog HDL. The simulation results show that the proposed decision tree-based adaptive reconfigurable cache reduces the average memory access time compared with other adaptive algorithms.

Download Full-text

Realising a concurrent object-based programming model on parallel virtual shared memory architectures

Programming Models for Massively Parallel Computers ◽

10.1109/pmmpc.1995.504345 ◽

2002 ◽

Cited By ~ 1

Author(s):

M. Fisher ◽

J. Keane

Keyword(s):

Shared Memory ◽

Programming Model ◽

Object Based ◽

Virtual Shared Memory ◽

Memory Architectures ◽

Shared Memory Architectures

Download Full-text

Analytic evaluation of shared-memory architectures

IEEE Transactions on Parallel and Distributed Systems ◽

10.1109/tpds.2003.1178880 ◽

2003 ◽

Vol 14 (2) ◽

pp. 166-180 ◽

Cited By ~ 6

Author(s):

D.J. Sorin ◽

J.L. Lemon ◽

D.L. Eager ◽

M.K. Vernon

Keyword(s):

Shared Memory ◽

Analytic Evaluation ◽

Memory Architectures ◽

Shared Memory Architectures

Download Full-text

Adaptive software cache management for distributed shared memory architectures

[1990] Proceedings. The 17th Annual International Symposium on Computer Architecture ◽

10.1109/isca.1990.134515 ◽

2002 ◽

Cited By ~ 37

Author(s):

J.K. Bennett ◽

J.B. Carter ◽

W. Zwaenepoel

Keyword(s):

Shared Memory ◽

Distributed Shared Memory ◽

Cache Management ◽

Adaptive Software ◽

Memory Architectures ◽

Shared Memory Architectures ◽

Software Cache

Download Full-text

Online Thread and Data Mapping Using a Sharing-Aware Memory Management Unit

ACM Transactions on Modeling and Performance Evaluation of Computing Systems ◽

10.1145/3433687 ◽

2021 ◽

Vol 5 (4) ◽

pp. 1-28

Author(s):

Eduardo H. M. Cruz ◽

Matthias Diener ◽

Laércio L. Pilla ◽

Philippe O. A. Navaux

Keyword(s):

Energy Efficiency ◽

Memory Management ◽

Substantial Reduction ◽

Management Unit ◽

Memory Access ◽

Parallel Applications ◽

Data Mapping ◽

Wide Range ◽

Memory Accesses ◽

Level Parallelism

Current and future architectures rely on thread-level parallelism to sustain performance growth. These architectures have introduced a complex memory hierarchy, consisting of several cores organized hierarchically with multiple cache levels and NUMA nodes. These memory hierarchies can have an impact on the performance and energy efficiency of parallel applications as the importance of memory access locality is increased. In order to improve locality, the analysis of the memory access behavior of parallel applications is critical for mapping threads and data. Nevertheless, most previous work relies on indirect information about the memory accesses, or does not combine thread and data mapping, resulting in less accurate mappings. In this paper, we propose the Sharing-Aware Memory Management Unit (SAMMU), an extension to the memory management unit that allows it to detect the memory access behavior in hardware. With this information, the operating system can perform online mapping without any previous knowledge about the behavior of the application. In the evaluation with a wide range of parallel applications (NAS Parallel Benchmarks and PARSEC Benchmark Suite), performance was improved by up to 35.7% (10.0% on average) and energy efficiency was improved by up to 11.9% (4.1% on average). These improvements happened due to a substantial reduction of cache misses and interconnection traffic.

Download Full-text

Or-Parallel Prolog on Distributed Shared-Memory Architectures

Implementations of Logic Programming Systems ◽

10.1007/978-1-4615-2690-2_14 ◽

1994 ◽

pp. 203-215

Author(s):

Fernando M. A. Silva

Keyword(s):

Shared Memory ◽

Distributed Shared Memory ◽

Memory Architectures ◽

Shared Memory Architectures

Download Full-text