Transparent Control Flow Transfer between CPU and Accelerators for HPC

Daniel Granhão; João Canas Ferreira

doi:10.3390/electronics10040406

Transparent Control Flow Transfer between CPU and Accelerators for HPC

Electronics ◽

10.3390/electronics10040406 ◽

2021 ◽

Vol 10 (4) ◽

pp. 406

Author(s):

Daniel Granhão ◽

João Canas Ferreira

Keyword(s):

High Performance ◽

Small Time ◽

Control Flow ◽

Software Applications ◽

Complex Process ◽

Function Call ◽

Specialized Hardware ◽

The Cost ◽

Software Profiling ◽

Flow Transfer

Heterogeneous platforms with FPGAs have started to be employed in the High-Performance Computing (HPC) field to improve performance and overall efficiency. These platforms allow the use of specialized hardware to accelerate software applications, but require the software to be adapted in what can be a prolonged and complex process. The main goal of this work is to describe and evaluate mechanisms that can transparently transfer the control flow between CPU and FPGA within the scope of HPC. Combining such a mechanism with transparent software profiling and accelerator configuration could lead to an automatic way of accelerating regular applications. In this work, a mechanism based on the ptrace system call is proposed, and its performance on the Intel Xeon+FPGA platform is evaluated. The feasibility of the proposed approach is demonstrated by a working prototype that performs the transparent control flow transfer of any function call to a matching hardware accelerator. This approach is more general than shared library interposition at the cost of a small time overhead in each accelerator use (about 1.3 ms in the prototype implementation).

Download Full-text

Buffer Placement and Sizing for High-Performance Dataflow Circuits

ACM Transactions on Reconfigurable Technology and Systems ◽

10.1145/3477053 ◽

2022 ◽

Vol 15 (1) ◽

pp. 1-32

Author(s):

Lana Josipović ◽

Shabnam Sheikhha ◽

Andrea Guerrieri ◽

Paolo Ienne ◽

Jordi Cortadella

Keyword(s):

Performance Optimization ◽

Optimization Model ◽

High Performance ◽

Control Flow ◽

High Level Synthesis ◽

Software Applications ◽

Marked Graphs ◽

Variable Latency ◽

High Level ◽

Strong Contrast

Commercial high-level synthesis tools typically produce statically scheduled circuits. Yet, effective C-to-circuit conversion of arbitrary software applications calls for dataflow circuits, as they can handle efficiently variable latencies (e.g., caches), unpredictable memory dependencies, and irregular control flow. Dataflow circuits exhibit an unconventional property: registers (usually referred to as “buffers”) can be placed anywhere in the circuit without changing its semantics, in strong contrast to what happens in traditional datapaths. Yet, although functionally irrelevant, this placement has a significant impact on the circuit’s timing and throughput. In this work, we show how to strategically place buffers into a dataflow circuit to optimize its performance. Our approach extracts a set of choice-free critical loops from arbitrary dataflow circuits and relies on the theory of marked graphs to optimize the buffer placement and sizing. Our performance optimization model supports important high-level synthesis features such as pipelined computational units, units with variable latency and throughput, and if-conversion. We demonstrate the performance benefits of our approach on a set of dataflow circuits obtained from imperative code.

Download Full-text

Scheduling Data Intensive Scientific Workflows in Cloud Environment Using Nature Inspired Algorithms

Nature-Inspired Algorithms for Big Data Frameworks - Advances in Computational Intelligence and Robotics ◽

10.4018/978-1-5225-5852-1.ch008 ◽

2019 ◽

pp. 196-217 ◽

Cited By ~ 3

Author(s):

Shikha Mehta ◽

Parmeet Kaur

Keyword(s):

High Performance ◽

Heuristic Algorithms ◽

Control Flow ◽

Grey Wolf ◽

Data Intensive ◽

Np Hard Problem ◽

Shuffled Frog Leaping ◽

Nature Inspired Algorithms ◽

The Cost ◽

Performance Computing

Workflows are a commonly used model to describe applications consisting of computational tasks with data or control flow dependencies. They are used in domains of bioinformatics, astronomy, physics, etc., for data-driven scientific applications. Execution of data-intensive workflow applications in a reasonable amount of time demands a high-performance computing environment. Cloud computing is a way of purchasing computing resources on demand through virtualization technologies. It provides the infrastructure to build and run workflow applications, which is called ‘Infrastructure as a Service.' However, it is necessary to schedule workflows on cloud in a way that reduces the cost of leasing resources. Scheduling tasks on resources is a NP hard problem and using meta-heuristic algorithms is an obvious choice for the same. This chapter presents application of nature-inspired algorithms: particle swarm optimization, shuffled frog leaping algorithm and grey wolf optimization algorithm to the workflow scheduling problem on the cloud. Simulation results prove the efficacy of the suggested algorithms.

Download Full-text

DDT: A Research Tool for Automatic Data Distribution in High Performance Fortran

Scientific Programming ◽

10.1155/1997/780152 ◽

1997 ◽

Vol 6 (1) ◽

pp. 73-94 ◽

Cited By ~ 4

Author(s):

Eduard AyguadÉ ◽

Jordi Garcia ◽

MercÉ GironÈs ◽

M. Luz Grande ◽

JesÚs Labarta

Keyword(s):

High Performance ◽

Data Distribution ◽

Control Flow ◽

Research Tool ◽

Automatic Data ◽

Computation Cost ◽

Data Movement ◽

High Performance Fortran ◽

Automatic Data Distribution ◽

The Cost

This article describes the main features and implementation of our automatic data distribution research tool. The tool (DDT) accepts programs written in Fortran 77 and generates High Performance Fortran (HPF) directives to map arrays onto the memories of the processors and parallelize loops, and executable statements to remap these arrays. DDT works by identifying a set of computational phases (procedures and loops). The algorithm builds a search space of candidate solutions for these phases which is explored looking for the combination that minimizes the overall cost; this cost includes data movement cost and computation cost. The movement cost reflects the cost of accessing remote data during the execution of a phase and the remapping costs that have to be paid in order to execute the phase with the selected mapping. The computation cost includes the cost of executing a phase in parallel according to the selected mapping and the owner computes rule. The tool supports interprocedural analysis and uses control flow information to identify how phases are sequenced during the execution of the application.

Download Full-text

Adaptive Precision Block-Jacobi for High Performance Preconditioning in the Ginkgo Linear Algebra Software

ACM Transactions on Mathematical Software ◽

10.1145/3441850 ◽

2021 ◽

Vol 47 (2) ◽

pp. 1-28

Author(s):

Goran Flegar ◽

Hartwig Anzt ◽

Terry Cojean ◽

Enrique S. Quintana-Ortí

Keyword(s):

Linear Algebra ◽

Graphics Processing Units ◽

High Performance ◽

Numerical Algorithms ◽

Mixed Precision ◽

Before And After ◽

Memory Accesses ◽

Specialized Hardware ◽

The Individual ◽

Graphics Processing

The use of mixed precision in numerical algorithms is a promising strategy for accelerating scientific applications. In particular, the adoption of specialized hardware and data formats for low-precision arithmetic in high-end GPUs (graphics processing units) has motivated numerous efforts aiming at carefully reducing the working precision in order to speed up the computations. For algorithms whose performance is bound by the memory bandwidth, the idea of compressing its data before (and after) memory accesses has received considerable attention. One idea is to store an approximate operator–like a preconditioner–in lower than working precision hopefully without impacting the algorithm output. We realize the first high-performance implementation of an adaptive precision block-Jacobi preconditioner which selects the precision format used to store the preconditioner data on-the-fly, taking into account the numerical properties of the individual preconditioner blocks. We implement the adaptive block-Jacobi preconditioner as production-ready functionality in the Ginkgo linear algebra library, considering not only the precision formats that are part of the IEEE standard, but also customized formats which optimize the length of the exponent and significand to the characteristics of the preconditioner blocks. Experiments run on a state-of-the-art GPU accelerator show that our implementation offers attractive runtime savings.

Download Full-text

Based on Numerical Simulation of High-Performance Parallel Machine Muffler Experimental Calibration

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.718-720.1645 ◽

2013 ◽

Vol 718-720 ◽

pp. 1645-1650

Author(s):

Gen Yin Cheng ◽

Sheng Chen Yu ◽

Zhi Yong Wei ◽

Shao Jie Chen ◽

You Cheng

Keyword(s):

Numerical Simulation ◽

Finite Element ◽

Boundary Element ◽

Message Passing ◽

High Performance ◽

Message Passing Interface ◽

Parallel Machine ◽

Simulation Software ◽

Experimental Calibration ◽

The Cost

Commonly used commercial simulation software SYSNOISE and ANSYS is run on a single machine (can not directly run on parallel machine) when use the finite element and boundary element to simulate muffler effect, and it will take more than ten days, sometimes even twenty days to work out an exact solution as the large amount of numerical simulation. Use a high performance parallel machine which was built by 32 commercial computers and transform the finite element and boundary element simulation software into a program that can running under the MPI (message passing interface) parallel environment in order to reduce the cost of numerical simulation. The relevant data worked out from the simulation experiment demonstrate that the result effect of the numerical simulation is well. And the computing speed of the high performance parallel machine is 25 ~ 30 times a microcomputer.

Download Full-text

Insights on Cost Estimation Methods and its Uses in Software Project Design

International Journal of Advanced Information and Communication Technology ◽

10.46532/ijaict-2020033 ◽

2020 ◽

pp. 166-169

Author(s):

Aravindhan K

Keyword(s):

Project Management ◽

Cost Estimation ◽

Estimation Methods ◽

Software Project ◽

Estimation Model ◽

Software Projects ◽

Software Applications ◽

Software Companies ◽

Cost Estimation Model ◽

The Cost

Cost estimation of software projects is risky task in project management field. It is a process of predicting the cost and effort required to develop a software applications. Several cost estimation models have been proposed over the last thirty to forty years. Many software companies track and analyse the current project by measuring the planed cost and estimate the accuracy. If the estimation is not proper then it leads to the failure of the project. One of the challenging tasks in project management is how to evaluate the different cost estimation and selecting the proper model for the current project. This paper summarizes the different cost estimation model and its techniques. It also provides the proper model selection for the different types of the projects.

Download Full-text

Digital monitoring system of equipment for the analysis of fuel and lubricants quality

World of OIL Products the Oil Companies Bulletin ◽

10.32758/2071-5951-2021-1-4-54-59 ◽

2021 ◽

Vol 04 (1) ◽

pp. 54-54

Author(s):

V. R. Nigmatullin ◽

◽

I. R. Nigmatullin ◽

R. G. Nigmatullin ◽

A.M. Migranov ◽

...

Keyword(s):

Monitoring System ◽

High Performance ◽

Chemical Elements ◽

Large Data ◽

Climatic Conditions ◽

Point Of View ◽

Friction Units ◽

Digital Monitoring ◽

Wear Products ◽

The Cost

Currently, to increase the efficiency of industrial production, high-performance and expensive technological equipment is increasingly used, in which the weakest link, from the point of view of efficiency and reliability, is the components and parts of heavily loaded tribo – couplings operating both at significantly different temperatures (conditionally under lighter conditions, the temperature difference can be 100-120 degrees) and climatic conditions (high humidity, the presence of abrasives and other chemical elements in the atmosphere). As the results of the analysis of the frequency of failures of friction units and, accordingly, the cost of their restoration reach 9-20 percent of the cost of all equipment, without taking into account significant losses of income (profit) of the enterprise from downtime. The solution of this problem is based on the study of the wear rate of friction units by the wear products accumulated in working oils, cooling lubricants, and greases. A digital equipment monitoring system (DSMT) has been developed and implemented, which includes dynamic recording of the number of wear products and oil temperature by original modern recording devices, followed by the technology of their processing and use. The system also includes methods for finding the necessary information in large data sets useful and necessary in theoretical and practical terms with a similar technique controlled by a digital monitoring system. The advantages of SMT are the ability to predict the reliability of the equipment; reduce production risks and significantly reduce inefficient costs.

Download Full-text

A Bit String Content Aware Chunking Strategy for Reduced CPU Energy on Cloud Storage

Journal of Electrical and Computer Engineering ◽

10.1155/2015/242086 ◽

2015 ◽

Vol 2015 ◽

pp. 1-8 ◽

Cited By ~ 1

Author(s):

Bin Zhou ◽

ShuDao Zhang ◽

Ying Zhang ◽

JiaHao Tan

Keyword(s):

Energy Consumption ◽

Data Center ◽

System Performance ◽

Cloud Storage ◽

High Performance ◽

Factors Affecting ◽

Redundant Data ◽

Energy Factors ◽

The Cost ◽

Content Aware

In order to achieve energy saving and reduce the total cost of ownership, green storage has become the first priority for data center. Detecting and deleting the redundant data are the key factors to the reduction of the energy consumption of CPU, while high performance stable chunking strategy provides the groundwork for detecting redundant data. The existing chunking algorithm greatly reduces the system performance when confronted with big data and it wastes a lot of energy. Factors affecting the chunking performance are analyzed and discussed in the paper and a new fingerprint signature calculation is implemented. Furthermore, a Bit String Content Aware Chunking Strategy (BCCS) is put forward. This strategy reduces the cost of signature computation in chunking process to improve the system performance and cuts down the energy consumption of the cloud storage data center. On the basis of relevant test scenarios and test data of this paper, the advantages of the chunking strategy are verified.

Download Full-text

Constructing a Bioinformatics Platform with Web and Mobile Services Based on NVIDIA Jetson TK1

International Journal of Grid and High Performance Computing ◽

10.4018/ijghpc.2015100105 ◽

2015 ◽

Vol 7 (4) ◽

pp. 57-73 ◽

Cited By ~ 2

Author(s):

Chun-Yuan Lin ◽

Jin Ye ◽

Che-Lun Hung ◽

Chung-Hung Wang ◽

Min Su ◽

...

Keyword(s):

Power Consumption ◽

High Performance Computing ◽

Graphics Processing Units ◽

High Performance ◽

Low Cost ◽

Research Direction ◽

Mobile Services ◽

Performance Ratio ◽

The Cost ◽

Performance Computing

Current high-end graphics processing units (abbreviate to GPUs), such as NVIDIA Tesla, Fermi, Kepler series cards which contain up to thousand cores per-chip, are widely used in the high performance computing fields. These GPU cards (called desktop GPUs) should be installed in personal computers/servers with desktop CPUs; moreover, the cost and power consumption of constructing a high performance computing platform with these desktop CPUs and GPUs are high. NVIDIA releases Tegra K1, called Jetson TK1, which contains 4 ARM Cortex-A15 CPUs and 192 CUDA cores (Kepler GPU) and is an embedded board with low cost, low power consumption and high applicability advantages for embedded applications. NVIDIA Jetson TK1 becomes a new research direction. Hence, in this paper, a bioinformatics platform was constructed based on NVIDIA Jetson TK1. ClustalWtk and MCCtk tools for sequence alignment and compound comparison were designed on this platform, respectively. Moreover, the web and mobile services for these two tools with user friendly interfaces also were provided. The experimental results showed that the cost-performance ratio by NVIDIA Jetson TK1 is higher than that by Intel XEON E5-2650 CPU and NVIDIA Tesla K20m GPU card.

Download Full-text

Development of Filters with Minimal Hydraulic Resistance for Underground Water Intakes

Civil Engineering Journal ◽

10.28991/cej-2020-03091517 ◽

2020 ◽

Vol 6 (5) ◽

pp. 919-927

Author(s):

A. A. Akulshin ◽

N. V. Bredikhina ◽

An. A. Akulshin ◽

I. Y. Aksenteva ◽

N. P. Ermakova

Keyword(s):

Hydraulic Resistance ◽

Cross Section ◽

High Performance ◽

Underground Water ◽

Performance Characteristics ◽

Design Stage ◽

Performance Loss ◽

Filter Performance ◽

Optimal Section ◽

The Cost

The development of modern structures of water wells filtering equipment with enhanced performance characteristics is a vital task. The purpose of this work was to create filters for taking water from underground sources that have high performance, long service life, quickly and economically replace or repair in case of performance loss. The selection of the filter device must be made taking into account all the geological features of the aquifers, the performance characteristics of the filter devices and the size of the future structure. Filter equipment designs for water intake wells have been developed in this study. These filters have low hydraulic resistance, high performance and are easy to repair. This article presents the dependency of flow inside the receiving part of the well, the dependence of filter resistance at various forms of the cross section of the filter wire and the selected optimal section. The paper proposes a method for selecting the optimal cross-section of the filter wire used in the manufacture of a water well filter. The proposed structures of easy-to-remove well filters with increased productivity allow replacing the sealed well filter with a new one easily, reducing capital and operating costs, and increasing the inter-repair periods of their operation. Based on the presented method, examples are given for selecting the parameters of the filter wire cross-section. The above calculations showed that the use of the hydraulic resistance criterion at the design stage of underground water intakes can significantly reduce the cost of well construction. Studies have found that the minimum hydraulic resistance to ensure maximum filter performance is achieved when using filter wire teardrop and elliptical shapes.

Download Full-text