Effects of Loop Unrolling and Loop Fusion on Register Pressure and Code Performance.

Author(s):  
Dale Shires
2018 ◽  
Vol 228 ◽  
pp. 03008
Author(s):  
Xuehua Liu ◽  
Liping Ding ◽  
Yanfeng Li ◽  
Guangxuan Chen ◽  
Jin Du

Register pressure problem has been a known problem for compiler because of the mismatch between the infinite number of pseudo registers and the finite number of hard registers. Too heavy register pressure may results in register spilling and then leads to performance degradation. There are a lot of optimizations, especially loop optimizations suffer from register spilling in compiler. In order to fight register pressure and therefore improve the effectiveness of compiler, this research takes the register pressure into account to improve loop unrolling optimization during the transformation process. In addition, a register pressure aware transformation is able to reduce the performance overhead of some fine-grained randomization transformations which can be used to defend against ROP attacks. Experiments showed a peak improvement of about 3.6% and an average improvement of about 1% for SPEC CPU 2006 benchmarks and a peak improvement of about 3% and an average improvement of about 1% for the LINPACK benchmark.


Electronics ◽  
2021 ◽  
Vol 11 (1) ◽  
pp. 38
Author(s):  
Huayou Su ◽  
Kaifang Zhang ◽  
Songzhu Mei

Stencil computation optimizations have been investigated quite a lot, and various approaches have been proposed. Loop transformation is a vital kind of optimization in modern production compilers and has proved successful employment within compilers. In this paper, we combine the two aspects to study the potential benefits some common transformation recipes may have for stencils. The recipes consist of loop unrolling, loop fusion, address precalculation, redundancy elimination, instruction reordering, load balance, and a forward and backward update algorithm named semi-stencil. Experimental evaluations of diverse stencil kernels, including 1D, 2D, and 3D computation patterns, on two typical ARM and Intel platforms, demonstrate the respective effects of the transformation recipes. An average speedup of 1.65× is obtained, and the best is 1.88× for the single transformation recipes we analyze. The compound recipes demonstrate a maximum speedup of 1.92×.


2018 ◽  
Vol 175 ◽  
pp. 02009
Author(s):  
Carleton DeTar ◽  
Steven Gottlieb ◽  
Ruizi Li ◽  
Doug Toussaint

With recent developments in parallel supercomputing architecture, many core, multi-core, and GPU processors are now commonplace, resulting in more levels of parallelism, memory hierarchy, and programming complexity. It has been necessary to adapt the MILC code to these new processors starting with NVIDIA GPUs, and more recently, the Intel Xeon Phi processors. We report on our efforts to port and optimize our code for the Intel Knights Landing architecture. We consider performance of the MILC code with MPI and OpenMP, and optimizations with QOPQDP and QPhiX. For the latter approach, we concentrate on the staggered conjugate gradient and gauge force. We also consider performance on recent NVIDIA GPUs using the QUDA library.


Author(s):  
Jesús F. Águila ◽  
Vanessa Montoya ◽  
Javier Samper ◽  
Luis Montenegro ◽  
Georg Kosakowski ◽  
...  

AbstractSophisticated modeling of the migration of sorbing radionuclides in compacted claystones is needed for supporting the safety analysis of deep geological repositories for radioactive waste, which requires robust modeling tools/codes. Here, a benchmark related to a long term laboratory scale diffusion experiment of cesium, a moderately sorbing radionuclide, through Opalinus clay is presented. The benchmark was performed with the following codes: CORE2DV5, Flotran, COMSOL Multiphysics, OpenGeoSys-GEM, MCOTAC and PHREEQC v.3. The migration setup was solved with two different conceptual models, i) a single-species model by using a look-up table for a cesium sorption isotherm and ii) a multi-species diffusion model including a complex mechanistic cesium sorption model. The calculations were performed for three different cesium boundary concentrations (10−3, 10−5, 10−7 mol / L) to investigate the models/codes capabilities taking into account the nonlinear sorption behavior of cesium. Generally, good agreement for both single- and multi-species benchmark concepts could be achieved, however, some discrepancies have been identified, especially near the boundaries, where code specific spatial (and time) discretization had to be improved to achieve better agreement at the expense of longer computation times. In addition, the benchmark exercise yielded useful information on code performance, setup options, input and output data management, and post processing options. Finally, the comparison of single-species and multi-species model concepts showed that the single-species approach yielded generally earlier breakthrough, because this approach accounts neither for cation exchange of Cs+ with K+ and Na+, nor K+ and Na+ diffusion in the pore water.


2008 ◽  
Vol 43 (7) ◽  
pp. 141-150
Author(s):  
Mounira Bachir ◽  
Sid-Ahmed-Ali Touati ◽  
Albert Cohen

2018 ◽  
Vol 53 (3) ◽  
pp. 17-30
Author(s):  
Wenwen Wang ◽  
Jiacheng Wu ◽  
Xiaoli Gong ◽  
Tao Li ◽  
Pen-Chung Yew
Keyword(s):  

Sign in / Sign up

Export Citation Format

Share Document