Effects of Loop Unrolling and Loop Fusion on Register Pressure and Code Performance.

Research of Register Pressure Aware Loop Unrolling Optimizations for Compiler

MATEC Web of Conferences ◽

10.1051/matecconf/201822803008 ◽

2018 ◽

Vol 228 ◽

pp. 03008

Author(s):

Xuehua Liu ◽

Liping Ding ◽

Yanfeng Li ◽

Guangxuan Chen ◽

Jin Du

Keyword(s):

Finite Number ◽

Infinite Number ◽

Performance Degradation ◽

Transformation Process ◽

Fine Grained ◽

Loop Unrolling ◽

Average Improvement ◽

Register Pressure ◽

Linpack Benchmark ◽

Loop Optimizations

Register pressure problem has been a known problem for compiler because of the mismatch between the infinite number of pseudo registers and the finite number of hard registers. Too heavy register pressure may results in register spilling and then leads to performance degradation. There are a lot of optimizations, especially loop optimizations suffer from register spilling in compiler. In order to fight register pressure and therefore improve the effectiveness of compiler, this research takes the register pressure into account to improve loop unrolling optimization during the transformation process. In addition, a register pressure aware transformation is able to reduce the performance overhead of some fine-grained randomization transformations which can be used to defend against ROP attacks. Experiments showed a peak improvement of about 3.6% and an average improvement of about 1% for SPEC CPU 2006 benchmarks and a peak improvement of about 3% and an average improvement of about 1% for the LINPACK benchmark.

Download Full-text

On the Transformation Optimization for Stencil Computation

Electronics ◽

10.3390/electronics11010038 ◽

2021 ◽

Vol 11 (1) ◽

pp. 38

Author(s):

Huayou Su ◽

Kaifang Zhang ◽

Songzhu Mei

Keyword(s):

Load Balance ◽

Loop Transformation ◽

Redundancy Elimination ◽

Stencil Computation ◽

Loop Unrolling ◽

Loop Fusion ◽

Potential Benefits ◽

Successful Employment ◽

2D And 3D

Stencil computation optimizations have been investigated quite a lot, and various approaches have been proposed. Loop transformation is a vital kind of optimization in modern production compilers and has proved successful employment within compilers. In this paper, we combine the two aspects to study the potential benefits some common transformation recipes may have for stencils. The recipes consist of loop unrolling, loop fusion, address precalculation, redundancy elimination, instruction reordering, load balance, and a forward and backward update algorithm named semi-stencil. Experimental evaluations of diverse stencil kernels, including 1D, 2D, and 3D computation patterns, on two typical ARM and Intel platforms, demonstrate the respective effects of the transformation recipes. An average speedup of 1.65× is obtained, and the best is 1.88× for the single transformation recipes we analyze. The compound recipes demonstrate a maximum speedup of 1.92×.

Download Full-text

Code performance assessment: application to the pull-out strength of concrete

Materials and Structures ◽

10.1617/14219 ◽

2005 ◽

Vol 38 (281) ◽

pp. 691-699

Author(s):

J. Al-Hajjar

Keyword(s):

Performance Assessment ◽

Pull Out ◽

Code Performance ◽

Pull Out Strength

Download Full-text

MILC Code Performance on High End CPU and GPU Supercomputer Clusters

EPJ Web of Conferences ◽

10.1051/epjconf/201817502009 ◽

2018 ◽

Vol 175 ◽

pp. 02009

Author(s):

Carleton DeTar ◽

Steven Gottlieb ◽

Ruizi Li ◽

Doug Toussaint

Keyword(s):

Conjugate Gradient ◽

Memory Hierarchy ◽

Xeon Phi ◽

Intel Xeon Phi ◽

Code Performance ◽

Recent Developments ◽

Knights Landing ◽

Many Core ◽

Intel Xeon

With recent developments in parallel supercomputing architecture, many core, multi-core, and GPU processors are now commonplace, resulting in more levels of parallelism, memory hierarchy, and programming complexity. It has been necessary to adapt the MILC code to these new processors starting with NVIDIA GPUs, and more recently, the Intel Xeon Phi processors. We report on our efforts to port and optimize our code for the Intel Knights Landing architecture. We consider performance of the MILC code with MPI and OpenMP, and optimizations with QOPQDP and QPhiX. For the latter approach, we concentrate on the staggered conjugate gradient and gauge force. We also consider performance on recent NVIDIA GPUs using the QUDA library.

Download Full-text

Modeling cesium migration through Opalinus clay: a benchmark for single- and multi-species sorption-diffusion models

Computational Geosciences ◽

10.1007/s10596-021-10050-5 ◽

2021 ◽

Author(s):

Jesús F. Águila ◽

Vanessa Montoya ◽

Javier Samper ◽

Luis Montenegro ◽

Georg Kosakowski ◽

...

Keyword(s):

Time Discretization ◽

Single Species ◽

Opalinus Clay ◽

Sorption Behavior ◽

Diffusion Experiment ◽

Modeling Tools ◽

Cesium Sorption ◽

Code Performance ◽

Good Agreement

AbstractSophisticated modeling of the migration of sorbing radionuclides in compacted claystones is needed for supporting the safety analysis of deep geological repositories for radioactive waste, which requires robust modeling tools/codes. Here, a benchmark related to a long term laboratory scale diffusion experiment of cesium, a moderately sorbing radionuclide, through Opalinus clay is presented. The benchmark was performed with the following codes: CORE2DV5, Flotran, COMSOL Multiphysics, OpenGeoSys-GEM, MCOTAC and PHREEQC v.3. The migration setup was solved with two different conceptual models, i) a single-species model by using a look-up table for a cesium sorption isotherm and ii) a multi-species diffusion model including a complex mechanistic cesium sorption model. The calculations were performed for three different cesium boundary concentrations (10−3, 10−5, 10−7 mol / L) to investigate the models/codes capabilities taking into account the nonlinear sorption behavior of cesium. Generally, good agreement for both single- and multi-species benchmark concepts could be achieved, however, some discrepancies have been identified, especially near the boundaries, where code specific spatial (and time) discretization had to be improved to achieve better agreement at the expense of longer computation times. In addition, the benchmark exercise yielded useful information on code performance, setup options, input and output data management, and post processing options. Finally, the comparison of single-species and multi-species model concepts showed that the single-species approach yielded generally earlier breakthrough, because this approach accounts neither for cation exchange of Cs+ with K+ and Na+, nor K+ and Na+ diffusion in the pore water.

Download Full-text