Online Thread and Data Mapping Using a Sharing-Aware Memory Management Unit

Eduardo H. M. Cruz; Matthias Diener; Laércio L. Pilla; Philippe O. A. Navaux

doi:10.1145/3433687

Online Thread and Data Mapping Using a Sharing-Aware Memory Management Unit

ACM Transactions on Modeling and Performance Evaluation of Computing Systems ◽

10.1145/3433687 ◽

2021 ◽

Vol 5 (4) ◽

pp. 1-28

Author(s):

Eduardo H. M. Cruz ◽

Matthias Diener ◽

Laércio L. Pilla ◽

Philippe O. A. Navaux

Keyword(s):

Energy Efficiency ◽

Memory Management ◽

Substantial Reduction ◽

Management Unit ◽

Memory Access ◽

Parallel Applications ◽

Data Mapping ◽

Wide Range ◽

Memory Accesses ◽

Level Parallelism

Current and future architectures rely on thread-level parallelism to sustain performance growth. These architectures have introduced a complex memory hierarchy, consisting of several cores organized hierarchically with multiple cache levels and NUMA nodes. These memory hierarchies can have an impact on the performance and energy efficiency of parallel applications as the importance of memory access locality is increased. In order to improve locality, the analysis of the memory access behavior of parallel applications is critical for mapping threads and data. Nevertheless, most previous work relies on indirect information about the memory accesses, or does not combine thread and data mapping, resulting in less accurate mappings. In this paper, we propose the Sharing-Aware Memory Management Unit (SAMMU), an extension to the memory management unit that allows it to detect the memory access behavior in hardware. With this information, the operating system can perform online mapping without any previous knowledge about the behavior of the application. In the evaluation with a wide range of parallel applications (NAS Parallel Benchmarks and PARSEC Benchmark Suite), performance was improved by up to 35.7% (10.0% on average) and energy efficiency was improved by up to 11.9% (4.1% on average). These improvements happened due to a substantial reduction of cache misses and interconnection traffic.

Download Full-text

GPU Accelerated Path Tracing of Massive Scenes

ACM Transactions on Graphics ◽

10.1145/3447807 ◽

2021 ◽

Vol 40 (2) ◽

pp. 1-17

Author(s):

Milan Jaroš ◽

Lubomír Říha ◽

Petr Strakoš ◽

Matěj Špeťko

Keyword(s):

Data Structures ◽

Memory Management ◽

Memory Access ◽

Minimal Effect ◽

Proof Of Concept ◽

Access Pattern ◽

Multiple Gpus ◽

Management Level ◽

Path Tracing ◽

Memory Accesses

This article presents a solution to path tracing of massive scenes on multiple GPUs. Our approach analyzes the memory access pattern of a path tracer and defines how the scene data should be distributed across up to 16 GPUs with minimal effect on performance. The key concept is that the parts of the scene that have the highest amount of memory accesses are replicated on all GPUs. We propose two methods for maximizing the performance of path tracing when working with partially distributed scene data. Both methods work on the memory management level and therefore path tracer data structures do not have to be redesigned, making our approach applicable to other path tracers with only minor changes in their code. As a proof of concept, we have enhanced the open-source Blender Cycles path tracer. The approach was validated on scenes of sizes up to 169 GB. We show that only 1–5% of the scene data needs to be replicated to all machines for such large scenes. On smaller scenes we have verified that the performance is very close to rendering a fully replicated scene. In terms of scalability we have achieved a parallel efficiency of over 94% using up to 16 GPUs.

Download Full-text

Boosting Parallel Applications Performance on Applying DIM Technique in a Multiprocessing Environment

International Journal of Reconfigurable Computing ◽

10.1155/2011/546962 ◽

2011 ◽

Vol 2011 ◽

pp. 1-13 ◽

Cited By ~ 1

Author(s):

Mateus B. Rutzig ◽

Antonio C. S. Beck ◽

Felipe Madruga ◽

Marco A. Alves ◽

Henrique C. Freitas ◽

...

Keyword(s):

General Purpose ◽

Parallel Applications ◽

Instruction Level Parallelism ◽

Great Level ◽

Embedded Processor ◽

Wide Range ◽

Thread Level Parallelism ◽

Multiprocessing Systems ◽

Performance Gains ◽

Level Parallelism

Limits of instruction-level parallelism and higher transistor density sustain the increasing need for multiprocessor systems: they are rapidly taking over both general-purpose and embedded processor domains. Current multiprocessing systems are composed either of many homogeneous and simple cores or of complex superscalar, simultaneous multithread processing elements. As parallel applications are becoming increasingly present in embedded and general-purpose domains and multiprocessing systems must handle a wide range of different application classes, there is no consensus over which are the best hardware solutions to better exploit instruction-level parallelism (TLP) and thread-level parallelism (TLP) together. Therefore, in this work, we have expanded the DIM (dynamic instruction merging) technique to be used in a multiprocessing scenario, proving the need for an adaptable ILP exploitation even in TLP architectures. We have successfully coupled a dynamic reconfigurable system to an SPARC-based multiprocessor and obtained performance gains of up to 40%, even for applications that show a great level of parallelism at thread level.

Download Full-text

EFFECTIVENESS OF COMPILER-DIRECTED PREFETCHING ON DATA MINING BENCHMARKS

Journal of Circuits System and Computers ◽

10.1142/s0218126612400063 ◽

2012 ◽

Vol 21 (02) ◽

pp. 1240006 ◽

Cited By ~ 1

Author(s):

RAGAVENDRA NATARAJAN ◽

VINEETH MEKKAT ◽

WEI-CHUNG HSU ◽

ANTONIA ZHAI

Keyword(s):

Data Mining ◽

Dynamic Performance ◽

Control Flow ◽

Memory Access ◽

Performance Tuning ◽

Multicore Systems ◽

Long Latency ◽

Memory Accesses ◽

Access Patterns ◽

Level Parallelism

For today's increasingly power-constrained multicore systems, integrating simpler and more energy-efficient in-order cores becomes attractive. However, since in-order processors lack complex hardware support for tolerating long-latency memory accesses, developing compiler technologies to hide such latencies becomes critical. Compiler-directed prefetching has been demonstrated effective on some applications. On the application side, a large class of data centric applications has emerged to explore the underlying properties of the explosively growing data. These applications, in contrast to traditional benchmarks, are characterized by substantial thread-level parallelism, complex and unpredictable control flow, as well as intensive and irregular memory access patterns. These applications are expected to be the dominating workloads on future microprocessors. Thus, in this paper, we investigated the effectiveness of compiler-directed prefetching on data mining applications in in-order multicore systems. Our study reveals that although properly inserted prefetch instructions can often effectively reduce memory access latencies for data mining applications, the compiler is not always able to exploit this potential. Compiler-directed prefetching can become inefficient in the presence of complex control flow and memory access patterns; and architecture dependent behaviors. The integration of multithreaded execution onto a single die makes it even more difficult for the compiler to insert prefetch instructions, since optimizations that are effective for single-threaded execution may or may not be effective in multithreaded execution. Thus, compiler-directed prefetching must be judiciously deployed to avoid creating performance bottlenecks that otherwise do not exist. Our experiences suggest that dynamic performance tuning techniques that adjust to the behaviors of a program can potentially facilitate the deployment of aggressive optimizations in data mining applications.

Download Full-text

A memory scheduling strategy for eliminating memory access interference in heterogeneous system

The Journal of Supercomputing ◽

10.1007/s11227-019-03135-7 ◽

2020 ◽

Vol 76 (4) ◽

pp. 3129-3154

Author(s):

Juan Fang ◽

Mengxuan Wang ◽

Zelin Wei

Keyword(s):

Memory Access ◽

Access Latency ◽

Scheduling Strategy ◽

Memory Scheduling ◽

Request Queue ◽

Average Latency ◽

The Difference ◽

Memory Accesses ◽

Level Parallelism ◽

Memory Request

AbstractMultiple CPUs and GPUs are integrated on the same chip to share memory, and access requests between cores are interfering with each other. Memory requests from the GPU seriously interfere with the CPU memory access performance. Requests between multiple CPUs are intertwined when accessing memory, and its performance is greatly affected. The difference in access latency between GPU cores increases the average latency of memory accesses. In order to solve the problems encountered in the shared memory of heterogeneous multi-core systems, we propose a step-by-step memory scheduling strategy, which improve the system performance. The step-by-step memory scheduling strategy first creates a new memory request queue based on the request source and isolates the CPU requests from the GPU requests when the memory controller receives the memory request, thereby preventing the GPU request from interfering with the CPU request. Then, for the CPU request queue, a dynamic bank partitioning strategy is implemented, which dynamically maps it to different bank sets according to different memory characteristics of the application, and eliminates memory request interference of multiple CPU applications without affecting bank-level parallelism. Finally, for the GPU request queue, the criticality is introduced to measure the difference of the memory access latency between the cores. Based on the first ready-first come first served strategy, we implemented criticality-aware memory scheduling to balance the locality and criticality of application access.

Download Full-text

Online Thread and Data Mapping Using the Memory Management Unit

10.5753/ctd.2017.3461 ◽

2017 ◽

Author(s):

Eduardo H. M. Cruz ◽

Philippe O. A. Navaux

Keyword(s):

Memory Management ◽

Management Unit ◽

Data Mapping

As arquiteturas de computadores atuais incluem complexas hierarquias de memória que introduzem diferentes tempos de acessoá memória. Uma das soluções adotadas para reduzir o tempo de acesso é aumentar a localidade dos acessosá memória através do mapeamento de threads e dados. Nesta tese de doutorado, são propostas soluções inovadoras para identificar um mapeamento que otimize o acessoá memória fazendo uso da unidade de gerência de memória para monitor os acessos. Na avaliação experimental, as soluções melhoraram o desempenho em até 39% e a eficiência energética em até 12,2%. Isto se deu por uma redução substancial da quantidade de faltas na cache, tráfego entre processadores e acessosá bancos de memória remotos.

Download Full-text

Improving Energy Efficiency in Internet of Things using Artificial Bee Colony Algorithm

Recent Patents on Engineering ◽

10.2174/1872212114999200616164642 ◽

2020 ◽

Vol 14 ◽

Author(s):

M. Sivaram ◽

V. Porkodi ◽

Amin Salih Mohammed ◽

S. Anbu Karuppusamy

Keyword(s):

Energy Efficiency ◽

Energy Consumption ◽

Artificial Bee Colony ◽

Trust Model ◽

Sensor Nodes ◽

Remote Areas ◽

Bee Colony ◽

Important Concern ◽

Wide Range ◽

Iot Devices

Background: With the advent of IoT, the deployment of batteries with a limited lifetime in remote areas is a major concern. In certain conditions, the network lifetime gets restricted due to limited battery constraints. Subsequently, the collaborative approaches for key facilities help to reduce the constraint demands of the current security protocols. Aim: This work covers and combines a wide range of concepts linked by IoT based on security and energy efficiency. Specifically, this study examines the WSN energy efficiency problem in IoT and security for the management of threats in IoT through collaborative approaches and finally outlines the future. The concept of energy-efficient key protocols which clearly cover heterogeneous IoT communications among peers with different resources has been developed. Because of the low capacity of sensor nodes, energy efficiency in WSNs has been an important concern. Methods: Hence, in this paper, we present an algorithm for Artificial Bee Colony (ABC) which reviews security and energy consumption to discuss their constraints in the IoT scenarios. Results: The results of a detailed experimental assessment are analyzed in terms of communication cost, energy consumption and security, which prove the relevance of a proposed ABC approach and a key establishment. Conclusion: The validation of DTLS-ABC consists of designing an inter-node cooperation trust model for the creation of a trusted community of elements that are mutually supportive. Initial attempts to design the key methods for management are appropriate individual IoT devices. This gives the system designers, an option that considers the question of scalability.

Download Full-text

Statistical and machine learning models for optimizing energy in parallel applications

The International Journal of High Performance Computing Applications ◽

10.1177/1094342019842915 ◽

2019 ◽

Vol 33 (6) ◽

pp. 1079-1097 ◽

Cited By ~ 2

Author(s):

Mark Endrei ◽

Chao Jin ◽

Minh Ngoc Dinh ◽

David Abramson ◽

Heidi Poxon ◽

...

Keyword(s):

Machine Learning ◽

Energy Efficiency ◽

High Performance ◽

Large Scale ◽

Energy Use ◽

Parallel Applications ◽

Learning Models ◽

Trade Off ◽

Time Required ◽

Machine Learning Models

Rising power costs and constraints are driving a growing focus on the energy efficiency of high performance computing systems. The unique characteristics of a particular system and workload and their effect on performance and energy efficiency are typically difficult for application users to assess and to control. Settings for optimum performance and energy efficiency can also diverge, so we need to identify trade-off options that guide a suitable balance between energy use and performance. We present statistical and machine learning models that only require a small number of runs to make accurate Pareto-optimal trade-off predictions using parameters that users can control. We study model training and validation using several parallel kernels and more complex workloads, including Algebraic Multigrid (AMG), Large-scale Atomic Molecular Massively Parallel Simulator, and Livermore Unstructured Lagrangian Explicit Shock Hydrodynamics. We demonstrate that we can train the models using as few as 12 runs, with prediction error of less than 10%. Our AMG results identify trade-off options that provide up to 45% improvement in energy efficiency for around 10% performance loss. We reduce the sample measurement time required for AMG by 90%, from 13 h to 74 min.

Download Full-text

On the Applicability of PEBS based Online Memory Access Tracking for Heterogeneous Memory Management at Scale

Proceedings of the Workshop on Memory Centric High Performance Computing ◽

10.1145/3286475.3286477 ◽

2018 ◽

Author(s):

Aleix Roca Nonell ◽

Balazs Gerofi ◽

Leonardo Bautista-Gomez ◽

Dominique Martinet ◽

Vicenç Beltran Querol ◽

...

Keyword(s):

Memory Management ◽

Memory Access

Download Full-text

Reduced heart rate variability during sleep in long-duration spaceflight

AJP Regulatory Integrative and Comparative Physiology ◽

10.1152/ajpregu.00423.2012 ◽

2013 ◽

Vol 305 (2) ◽

pp. R164-R170 ◽

Cited By ~ 15

Author(s):

D. Xu ◽

J. K. Shoemaker ◽

A. P. Blaber ◽

P. Arbeille ◽

K. Fraser ◽

...

Keyword(s):

Heart Rate ◽

Heart Rate Variability ◽

Substantial Reduction ◽

Space Station ◽

Power Spectral Analysis ◽

Power Spectral ◽

Long Duration ◽

Fractal Properties ◽

Wide Range ◽

Simultaneous Movement

Limited data are available to describe the regulation of heart rate (HR) during sleep in spaceflight. Sleep provides a stable supine baseline during preflight Earth recordings for comparison of heart rate variability (HRV) over a wide range of frequencies using both linear, complexity, and fractal indicators. The current study investigated the effect of long-duration spaceflight on HR and HRV during sleep in seven astronauts aboard the International Space Station up to 6 mo. Measurements included electrocardiographic waveforms from Holter monitors and simultaneous movement records from accelerometers before, during, and after the flights. HR was unchanged inflight and elevated postflight [59.6 ± 8.9 beats per minute (bpm) compared with preflight 53.3 ± 7.3 bpm; P < 0.01]. Compared with preflight data, HRV indicators from both time domain and power spectral analysis methods were diminished inflight from ultralow to high frequencies and partially recovered to preflight levels after landing. During inflight and at postflight, complexity and fractal properties of HR were not different from preflight properties. Slow fluctuations (<0.04 Hz) in HR presented moderate correlations with movements during sleep, partially accounting for the reduction in HRV. In summary, substantial reduction in HRV was observed with linear, but not with complexity and fractal, methods of analysis. These results suggest that periodic elements that influence regulation of HR through reflex mechanisms are altered during sleep in spaceflight but that underlying system complexity and fractal dynamics were not altered.

Download Full-text

Data Analytics to Improve Customer Energy Efficiency

Archives of Business Research ◽

10.14738/abr.96.10290 ◽

2021 ◽

Vol 9 (6) ◽

pp. 13-25

Author(s):

Michail Angelopoulos ◽

Yannis Pollalis

Keyword(s):

Energy Efficiency ◽

Information And Communication Technologies ◽

Analytical Data ◽

Analytical Framework ◽

Research Information ◽

Improve Energy Efficiency ◽

Econometric Methods ◽

Wide Range ◽

Information And Communication ◽

Study Participants

This research focuses on providing insights for a solution for collecting, storing, analyzing and visualizing data from customer energy consumption patterns. The data analysis part of our research provides the models for knowledge discovery that can be used to improve energy efficiency at both producer and consumer ends. Τhe study sets a new analytical framework for assessing the role of behavioral knowledge in energy efficiency using a wide range of Case Studies, Experiments, Research, Information and Communication Technologies (ICT) in combination with the most modern econometric methods and large analytical data taking into account the characteristics of the study participants (household energy customers).

Download Full-text