scholarly journals Communicating Efficiently on Cluster-Based Remote Direct Memory Access (RDMA) over InfiniBand Protocol

2018 ◽  
Vol 8 (11) ◽  
pp. 2034
Author(s):  
Masoud Hemmatpour ◽  
Bartolomeo Montrucchio ◽  
Maurizio Rebaudengo

Distributed systems are commonly built under the assumption that the network is the primary bottleneck, however this assumption no longer holds by emerging high-performance RDMA enabled protocols in datacenters. Designing distributed applications over such protocols requires a fundamental rethinking in communication components in comparison with traditional protocols (i.e., TCP/IP). In this paper, communication paradigms in existing systems and new possible paradigms have been investigated. Advantages and drawbacks of each paradigm have been comprehensively analyzed and experimentally evaluated. The experimental results show that writing the requests to server and reading the response presents up to 10 times better performance comparing to other communication paradigms. To further expand the investigation, the proposed communication paradigm has been substituted in a real-world distributed application, and the performance has been enhanced up to seven times.

2021 ◽  
Vol 244 ◽  
pp. 07001
Author(s):  
Anatoliy Nyrkov ◽  
Konstantin Ianiushkin ◽  
Andrey Nyrkov ◽  
Yulia Romanova ◽  
Vagiz Gaskarov

Recent achievements in high-performance computing significantly narrow the performance gap between single and multi-node computing, and open up opportunities for systems with remote shared memory. The combination of in-memory storage, remote direct memory access and remote calls requires rethinking how data organized, protected and queried in distributed systems. Reviewed models let us implement new interpretations of distributed algorithms allowing us to validate different approaches to avoid race conditions, decrease resource acquisition or synchronization time. In this paper, we describe the data model for mixed memory access with analysis of optimized data structures. We also provide the result of experiments, which contain a performance comparison of data structures, operating with different approaches, evaluate the limitations of these models, and show that the model does not always meet expectations. The purpose of this paper to assist developers in designing data structures that will help to achieve architectural benefits or improve the design of existing distributed system.


2014 ◽  
Vol 556-562 ◽  
pp. 4303-4308
Author(s):  
Hua Long Zhao

As the demand of higher image quality and greater processing capabilities are growing, obtaining higher data bandwidth for on-chip processing is becoming a more and more important issue. DMA (Direct Memory Access) component, as the key element in stream processing SoC (System on Chip) [1], should be deeply researched and designed to satisfy the high data bandwidth requirement of processing units. In this paper, we introduce a scalable high-performance DMA architecture for complex SoC to satisfy rigorous high sustained bandwidth and versatile functionality requirements. Several techniques and structures are proposed in this paper. A state-in-art verification environment is built for our design to fully verify its functionality. At the end of the paper, the tape-out results are provided. The whole implementation has been silicon proven to be functional and efficient.


1999 ◽  
Vol 7 (3-4) ◽  
pp. 275-287 ◽  
Author(s):  
Eric Eide ◽  
James L. Simister ◽  
Tim Stack ◽  
Jay Lepreau

Distributed applications are complex by nature, so it is essential that there be effective software development tools to aid in the construction of these programs. Commonplace “middleware” tools, however, often impose a tradeoff between programmer productivity and application performance. For instance, many CORBA IDL compilers generate code that is too slow for high‐performance systems. More importantly, these compilers provide inadequate support for sophisticated patterns of communication. We believe that these problems can be overcome, thus making idl compilers and similar middleware tools useful for a broader range of systems. To this end we have implemented Flick, a flexible and optimizing IDL compiler, and are using it to produce specialized high‐performance code for complex distributed applications. Flick can produce specially “decomposed” stubs that encapsulate different aspects of communication in separate functions, thus providing application programmers with fine‐grain control over all messages. The design of our decomposed stubs was inspired by the requirements of a particular distributed application called Khazana, and in this paper we describe our experience to date in refitting Khazana with Flick‐generated stubs. We believe that the special idl compilation techniques developed for Khazana will be useful in other applications with similar communication requirements.


2011 ◽  
Vol 58-60 ◽  
pp. 1560-1565
Author(s):  
Chou Chen Wang ◽  
Huei Shiung Lin ◽  
Feng Yu Liou ◽  
Ji De Hung

In this paper, we propose an embedded real-time MPEG-4 decoder based on ADSP-BF527 Blackfin DSP. In order to achieve the real-time requirement of MPEG-4 decoding, we modify and optimize the decoding modules. Firstly, we analyze the number of operations for various modules, and then use two buffer groups (BG) as parallel decoding mechanism of broadcast transformation. Finally, we make use of direct memory access (DMA) to carry out program steps. Experimental results demonstrate that the proposed method can decode a CIF video which reduces approximately 40.3 MHz core cycles. In addition, the decoded frame playing rate can increase from 3 fps to 25 fps when applied the PPI procedure. The playing rate can reach above 30 fps as using QCIF video so that the proposed method can achieve a real-time decoder and player.


2012 ◽  
Vol 6-7 ◽  
pp. 410-415
Author(s):  
Chi Wei Tung ◽  
Chou Chen Wang ◽  
Wan Ying Jhuang ◽  
Yi Chieh Tsai

In this paper, we propose an embedded real-time H.264 Baseline Profile (BP) decoder based on ADSP-BF548 Blackfin processor. In order to achieve the real-time requirement of H.264 decoding, we modify and optimize the decoding modules and codes, respectively. Firstly, we analyze the number of operations for various modules using assembly code, and then use direct memory access (DMA) to carry out the parallelism between algorithm execution and data movement. Finally, we make use of two buffer groups (BG) as parallel decoding mechanism of broadcast transformation. Experimental results demonstrate that the play rate can reach above 25 fps as using QCIF video. According to some tests, a real-time H.264 BP decoding can be achieved with a 600 MHz DSP.


The size of complex networks introduces large amounts of traversal times that can be tackled by exploiting pervasive multi-core and many-core parallel hardware architectures. However, there is a list of factors that make the design of efficient parallel traversal algorithms for graphs difficult: unstructured problems, data-driven computation, irregular memory access, poor locality, and low computing load. In this chapter, the authors introduce the synergy between Network Science and High Performance Computing and motivate the combined use of multi/many-core heterogeneous computing and Network Science techniques to tackle the above-mentioned challenges and to efficiently traverse the structure of massive real-world graphs.


Author(s):  
Florin Pop

This chapter presents a fault tolerant framework for the applications scheduling in large scale distributed systems (LSDS). Due to the specific characteristics and requirements of distributed systems, a good scheduling model should be dynamic. More specifically, it should adapt the scheduling decisions to resource state changes, which are commonly captured through monitoring. The scheduler and the monitor are two important middleware pieces that correlate their actions to ensure the high performance execution of distributed applications. The chapter presents and analyses agent based architecture for scheduling in large scale distributed systems. Then the user and resources management are presented. Optimization schemes for scheduling consider the near-optimal algorithm for distributed scheduling. The chapter presents the solution for scheduling optimization. The chapter covers and explains the fault tolerance cases for Grid environments and describes two possible scenarios for scheduling system.


Sensors ◽  
2020 ◽  
Vol 20 (5) ◽  
pp. 1358
Author(s):  
Rubén Delgado-Escaño ◽  
Francisco M. Castro ◽  
Julián R. Cózar ◽  
Manuel J. Marín-Jiménez ◽  
Nicolás Guil

Gait recognition is being employed as an effective approach to identify people without requiring subject collaboration. Nowadays, developed techniques for this task are obtaining high performance on current datasets (usually more than 90 % of accuracy). However, those datasets are simple as they only contain one subject in the scene at the same time. This fact limits the extrapolation of the results to real world conditions where, usually, multiple subjects are simultaneously present at the scene, generating different types of occlusions and requiring better tracking methods and models trained to deal with those situations. Thus, with the aim of evaluating more realistic and challenging situations appearing in scenarios with multiple subjects, we release a new framework (MuPeG) that generates augmented datasets with multiple subjects using existing datasets as input. By this way, it is not necessary to record and label new videos, since it is automatically done by our framework. In addition, based on the use of datasets generated by our framework, we propose an experimental methodology that describes how to use datasets with multiple subjects and the recommended experiments that are necessary to perform. Moreover, we release the first experimental results using datasets with multiple subjects. In our case, we use an augmented version of TUM-GAID and CASIA-B datasets obtained with our framework. In these augmented datasets the obtained accuracies are 54.8 % and 42.3 % whereas in the original datasets (single subject), the same model achieved 99.7 % and 98.0 % for TUM-GAID and CASIA-B, respectively. The performance drop shows clearly that the difficulty of datasets with multiple subjects in the scene is much higher than the ones reported in the literature for a single subject. Thus, our proposed framework is able to generate useful datasets with multiple subjects which are more similar to real life situations.


2020 ◽  
Vol 23 (4) ◽  
pp. 2735-2751 ◽  
Author(s):  
Jerzy Proficz

AbstractImbalanced process arrival patterns (PAPs) are ubiquitous in many parallel and distributed systems, especially in HPC ones. The collective operations, e.g. in MPI, are designed for equal process arrival times, and are not optimized for deviations in their appearance. We propose eight new PAP-aware algorithms for the scatter and gather operations. They are binomial or linear tree adaptations introducing additional process ordering and (in some cases) additional activities in a special background thread. The solution was implemented using one of the most popular open source MPI compliant library (OpenMPI), and evaluated in a typical HPC environment using a specially developed benchmark as well as a real application: FFT. The experimental results show a significant advantage of the proposed approach over the default OpenMPI implementation, showing good scalability and high performance with the FFT acceleration for the communication run time: 16.7% and for the total application execution time: 3.3%.


Sign in / Sign up

Export Citation Format

Share Document