Communicating Efficiently on Cluster-Based Remote Direct Memory Access (RDMA) over InfiniBand Protocol

Masoud Hemmatpour; Bartolomeo Montrucchio; Maurizio Rebaudengo

doi:10.3390/app8112034

Communicating Efficiently on Cluster-Based Remote Direct Memory Access (RDMA) over InfiniBand Protocol

Applied Sciences ◽

10.3390/app8112034 ◽

2018 ◽

Vol 8 (11) ◽

pp. 2034

Author(s):

Masoud Hemmatpour ◽

Bartolomeo Montrucchio ◽

Maurizio Rebaudengo

Keyword(s):

Distributed Systems ◽

Real World ◽

High Performance ◽

Direct Memory Access ◽

Distributed Applications ◽

Memory Access ◽

Experimental Results ◽

Distributed Application ◽

Communication Paradigm

Distributed systems are commonly built under the assumption that the network is the primary bottleneck, however this assumption no longer holds by emerging high-performance RDMA enabled protocols in datacenters. Designing distributed applications over such protocols requires a fundamental rethinking in communication components in comparison with traditional protocols (i.e., TCP/IP). In this paper, communication paradigms in existing systems and new possible paradigms have been investigated. Advantages and drawbacks of each paradigm have been comprehensively analyzed and experimentally evaluated. The experimental results show that writing the requests to server and reading the response presents up to 10 times better performance comparing to other communication paradigms. To further expand the investigation, the proposed communication paradigm has been substituted in a real-world distributed application, and the performance has been enhanced up to seven times.

Download Full-text

Data structures access model for remote shared memory

E3S Web of Conferences ◽

10.1051/e3sconf/202124407001 ◽

2021 ◽

Vol 244 ◽

pp. 07001

Author(s):

Anatoliy Nyrkov ◽

Konstantin Ianiushkin ◽

Andrey Nyrkov ◽

Yulia Romanova ◽

Vagiz Gaskarov

Keyword(s):

Shared Memory ◽

Data Structures ◽

Data Model ◽

High Performance ◽

Direct Memory Access ◽

Performance Comparison ◽

Memory Access ◽

Memory Storage ◽

Race Conditions ◽

Performance Computing

Recent achievements in high-performance computing significantly narrow the performance gap between single and multi-node computing, and open up opportunities for systems with remote shared memory. The combination of in-memory storage, remote direct memory access and remote calls requires rethinking how data organized, protected and queried in distributed systems. Reviewed models let us implement new interpretations of distributed algorithms allowing us to validate different approaches to avoid race conditions, decrease resource acquisition or synchronization time. In this paper, we describe the data model for mixed memory access with analysis of optimized data structures. We also provide the result of experiments, which contain a performance comparison of data structures, operating with different approaches, evaluate the limitations of these models, and show that the model does not always meet expectations. The purpose of this paper to assist developers in designing data structures that will help to achieve architectural benefits or improve the design of existing distributed system.

Download Full-text

Design and Verification of a Scalable Enhanced High Performance DMA Architecture for Complex SoC

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.556-562.4303 ◽

2014 ◽

Vol 556-562 ◽

pp. 4303-4308

Author(s):

Hua Long Zhao

Keyword(s):

Image Quality ◽

High Performance ◽

Stream Processing ◽

Direct Memory Access ◽

System On Chip ◽

Memory Access ◽

High Data ◽

Bandwidth Requirement ◽

Chip Processing ◽

On Chip

As the demand of higher image quality and greater processing capabilities are growing, obtaining higher data bandwidth for on-chip processing is becoming a more and more important issue. DMA (Direct Memory Access) component, as the key element in stream processing SoC (System on Chip) [1], should be deeply researched and designed to satisfy the high data bandwidth requirement of processing units. In this paper, we introduce a scalable high-performance DMA architecture for complex SoC to satisfy rigorous high sustained bandwidth and versatile functionality requirements. Several techniques and structures are proposed in this paper. A state-in-art verification environment is built for our design to fully verify its functionality. At the end of the paper, the tape-out results are provided. The whole implementation has been silicon proven to be functional and efficient.

Download Full-text

Flexible IDL Compilation for Complex Communication Patterns

Scientific Programming ◽

10.1155/1999/926915 ◽

1999 ◽

Vol 7 (3-4) ◽

pp. 275-287 ◽

Cited By ~ 1

Author(s):

Eric Eide ◽

James L. Simister ◽

Tim Stack ◽

Jay Lepreau

Keyword(s):

High Performance ◽

Distributed Applications ◽

Application Performance ◽

Distributed Application ◽

Fine Grain ◽

Development Tools ◽

Patterns Of Communication ◽

High Performance Systems ◽

Software Development Tools ◽

Compilation Techniques

Distributed applications are complex by nature, so it is essential that there be effective software development tools to aid in the construction of these programs. Commonplace “middleware” tools, however, often impose a tradeoff between programmer productivity and application performance. For instance, many CORBA IDL compilers generate code that is too slow for high‐performance systems. More importantly, these compilers provide inadequate support for sophisticated patterns of communication. We believe that these problems can be overcome, thus making idl compilers and similar middleware tools useful for a broader range of systems. To this end we have implemented Flick, a flexible and optimizing IDL compiler, and are using it to produce specialized high‐performance code for complex distributed applications. Flick can produce specially “decomposed” stubs that encapsulate different aspects of communication in separate functions, thus providing application programmers with fine‐grain control over all messages. The design of our decomposed stubs was inspired by the requirements of a particular distributed application called Khazana, and in this paper we describe our experience to date in refitting Khazana with Flick‐generated stubs. We believe that the special idl compilation techniques developed for Khazana will be useful in other applications with similar communication requirements.

Download Full-text

OpenCL-enabled High Performance Direct Memory Access for GPU-FPGA Cooperative Computation

Proceedings of the HPC Asia 2019 Workshops on ZZZ - HPCAsia'19 Workshops ◽

10.1145/3317576.3317581 ◽

2019 ◽

Author(s):

Ryohei Kobayashi ◽

Norihisa Fujita ◽

Yoshiki Yamaguchi ◽

Taisuke Boku

Keyword(s):

High Performance ◽

Direct Memory Access ◽

Memory Access ◽

Cooperative Computation

Download Full-text

Embedded Real-Time MPEG-4 Decoder Based on ADSP-BF527

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.58-60.1560 ◽

2011 ◽

Vol 58-60 ◽

pp. 1560-1565

Author(s):

Chou Chen Wang ◽

Huei Shiung Lin ◽

Feng Yu Liou ◽

Ji De Hung

Keyword(s):

Real Time ◽

Direct Memory Access ◽

Memory Access ◽

Experimental Results ◽

Time Requirement ◽

The Real ◽

Parallel Decoding

In this paper, we propose an embedded real-time MPEG-4 decoder based on ADSP-BF527 Blackfin DSP. In order to achieve the real-time requirement of MPEG-4 decoding, we modify and optimize the decoding modules. Firstly, we analyze the number of operations for various modules, and then use two buffer groups (BG) as parallel decoding mechanism of broadcast transformation. Finally, we make use of direct memory access (DMA) to carry out program steps. Experimental results demonstrate that the proposed method can decode a CIF video which reduces approximately 40.3 MHz core cycles. In addition, the decoded frame playing rate can increase from 3 fps to 25 fps when applied the PPI procedure. The playing rate can reach above 30 fps as using QCIF video so that the proposed method can achieve a real-time decoder and player.

Download Full-text

A Real-Time H.264 BP Decoder Based on ADSP-BF548

Advanced Engineering Forum ◽

10.4028/www.scientific.net/aef.6-7.410 ◽

2012 ◽

Vol 6-7 ◽

pp. 410-415

Author(s):

Chi Wei Tung ◽

Chou Chen Wang ◽

Wan Ying Jhuang ◽

Yi Chieh Tsai

Keyword(s):

Real Time ◽

Direct Memory Access ◽

Memory Access ◽

Experimental Results ◽

Time Requirement ◽

Assembly Code ◽

Data Movement ◽

The Real ◽

Parallel Decoding

In this paper, we propose an embedded real-time H.264 Baseline Profile (BP) decoder based on ADSP-BF548 Blackfin processor. In order to achieve the real-time requirement of H.264 decoding, we modify and optimize the decoding modules and codes, respectively. Firstly, we analyze the number of operations for various modules using assembly code, and then use direct memory access (DMA) to carry out the parallelism between algorithm execution and data movement. Finally, we make use of two buffer groups (BG) as parallel decoding mechanism of broadcast transformation. Experimental results demonstrate that the play rate can reach above 25 fps as using QCIF video. According to some tests, a real-time H.264 BP decoding can be achieved with a 600 MHz DSP.

Download Full-text

The Need for HPC Computing in Network Science

Advances in Computer and Electrical Engineering - Creativity in Load-Balance Schemes for Multi/Many-Core Heterogeneous Graph Computing ◽

10.4018/978-1-5225-3799-1.ch001 ◽

2018 ◽

pp. 1-29

Keyword(s):

High Performance Computing ◽

Real World ◽

High Performance ◽

Heterogeneous Computing ◽

Network Science ◽

Memory Access ◽

Combined Use ◽

Hardware Architectures ◽

Many Core ◽

Performance Computing

The size of complex networks introduces large amounts of traversal times that can be tackled by exploiting pervasive multi-core and many-core parallel hardware architectures. However, there is a list of factors that make the design of efficient parallel traversal algorithms for graphs difficult: unstructured problems, data-driven computation, irregular memory access, poor locality, and low computing load. In this chapter, the authors introduce the synergy between Network Science and High Performance Computing and motivate the combined use of multi/many-core heterogeneous computing and Network Science techniques to tackle the above-mentioned challenges and to efficiently traverse the structure of massive real-world graphs.

Download Full-text

A Fault Tolerant Decentralized Scheduling in Large Scale Distributed Systems

Handbook of Research on P2P and Grid Systems for Service-Oriented Computing ◽

10.4018/978-1-61520-686-5.ch024 ◽

2010 ◽

pp. 566-588 ◽

Cited By ~ 2

Author(s):

Florin Pop

Keyword(s):

Distributed Systems ◽

High Performance ◽

Large Scale ◽

Fault Tolerant ◽

Optimal Algorithm ◽

Distributed Applications ◽

Distributed Scheduling ◽

Agent Based ◽

Decentralized Scheduling ◽

Optimization Schemes

This chapter presents a fault tolerant framework for the applications scheduling in large scale distributed systems (LSDS). Due to the specific characteristics and requirements of distributed systems, a good scheduling model should be dynamic. More specifically, it should adapt the scheduling decisions to resource state changes, which are commonly captured through monitoring. The scheduler and the monitor are two important middleware pieces that correlate their actions to ensure the high performance execution of distributed applications. The chapter presents and analyses agent based architecture for scheduling in large scale distributed systems. Then the user and resources management are presented. Optimization schemes for scheduling consider the near-optimal algorithm for distributed scheduling. The chapter presents the solution for scheduling optimization. The chapter covers and explains the fault tolerance cases for Grid environments and describes two possible scenarios for scheduling system.

Download Full-text

MuPeG—The Multiple Person Gait Framework

Sensors ◽

10.3390/s20051358 ◽

2020 ◽

Vol 20 (5) ◽

pp. 1358

Author(s):

Rubén Delgado-Escaño ◽

Francisco M. Castro ◽

Julián R. Cózar ◽

Manuel J. Marín-Jiménez ◽

Nicolás Guil

Keyword(s):

Real World ◽

High Performance ◽

Gait Recognition ◽

Real Life ◽

Experimental Results ◽

Experimental Methodology ◽

Single Subject ◽

Different Types ◽

New Framework

Gait recognition is being employed as an effective approach to identify people without requiring subject collaboration. Nowadays, developed techniques for this task are obtaining high performance on current datasets (usually more than 90 % of accuracy). However, those datasets are simple as they only contain one subject in the scene at the same time. This fact limits the extrapolation of the results to real world conditions where, usually, multiple subjects are simultaneously present at the scene, generating different types of occlusions and requiring better tracking methods and models trained to deal with those situations. Thus, with the aim of evaluating more realistic and challenging situations appearing in scenarios with multiple subjects, we release a new framework (MuPeG) that generates augmented datasets with multiple subjects using existing datasets as input. By this way, it is not necessary to record and label new videos, since it is automatically done by our framework. In addition, based on the use of datasets generated by our framework, we propose an experimental methodology that describes how to use datasets with multiple subjects and the recommended experiments that are necessary to perform. Moreover, we release the first experimental results using datasets with multiple subjects. In our case, we use an augmented version of TUM-GAID and CASIA-B datasets obtained with our framework. In these augmented datasets the obtained accuracies are 54.8 % and 42.3 % whereas in the original datasets (single subject), the same model achieved 99.7 % and 98.0 % for TUM-GAID and CASIA-B, respectively. The performance drop shows clearly that the difficulty of datasets with multiple subjects in the scene is much higher than the ones reported in the literature for a single subject. Thus, our proposed framework is able to generate useful datasets with multiple subjects which are more similar to real life situations.

Download Full-text

Process arrival pattern aware algorithms for acceleration of scatter and gather operations

Cluster Computing ◽

10.1007/s10586-019-03040-x ◽

2020 ◽

Vol 23 (4) ◽

pp. 2735-2751 ◽

Cited By ~ 1

Author(s):

Jerzy Proficz

Keyword(s):

Distributed Systems ◽

Open Source ◽

Execution Time ◽

High Performance ◽

Experimental Results ◽

Arrival Times ◽

Additional Process ◽

Process Arrival Pattern ◽

Application Execution ◽

Arrival Pattern

AbstractImbalanced process arrival patterns (PAPs) are ubiquitous in many parallel and distributed systems, especially in HPC ones. The collective operations, e.g. in MPI, are designed for equal process arrival times, and are not optimized for deviations in their appearance. We propose eight new PAP-aware algorithms for the scatter and gather operations. They are binomial or linear tree adaptations introducing additional process ordering and (in some cases) additional activities in a special background thread. The solution was implemented using one of the most popular open source MPI compliant library (OpenMPI), and evaluated in a typical HPC environment using a specially developed benchmark as well as a real application: FFT. The experimental results show a significant advantage of the proposed approach over the default OpenMPI implementation, showing good scalability and high performance with the FFT acceleration for the communication run time: 16.7% and for the total application execution time: 3.3%.

Download Full-text