NUMA-BTDM: A Thread Mapping Algorithm for Balanced Data Locality on NUMA Systems

Author(s):  
Iulia Stirb
2017 ◽  
Vol 2017 ◽  
pp. 1-8 ◽  
Author(s):  
Thomas Mezmur Birhanu ◽  
Zhetao Li ◽  
Hiroo Sekiya ◽  
Nobuyoshi Komuro ◽  
Young-June Choi

This paper proposes a thread scheduling mechanism primed for heterogeneously configured multicore systems. Our approach considers CPU utilization for mapping running threads with the appropriate core that can potentially deliver the actual needed capacity. The paper also introduces a mapping algorithm that is able to map threads to cores in anO(N log M)time complexity, whereNis the number of cores andMis the number of types of cores. In addition to that we also introduced a method of profiling heterogeneous architectures based on the discrepancy between the performances of individual cores. Our heterogeneity aware scheduler was able to speed up processing by 52.62% and save power by 2.22% as compared to the CFS scheduler that is a default in Linux systems.


10.29007/55pq ◽  
2019 ◽  
Author(s):  
Lifeng Liu ◽  
Meilin Liu ◽  
Chongjun Wang

General purpose GPU (GPGPU) is an effective many-core architecture that can yield high throughput for many scientific applications with thread-level parallelism. However, several challenges still limit further performance improvements and make GPU program- ming challenging for programmers who lack the knowledge of GPU hardware architecture. In this paper, we design a compiler-assisted locality aware CTA (cooperative thread array) mapping scheme for GPUs to take advantage of the inter CTA data reuses in the GPU kernels. Using the data reuse analysis based on the polyhedron model, we can detect inter CTA data reuse patterns in the GPU kernels and control the CTA mapping pattern to improve the data locality on each SM. The compiler-assisted locality aware CTA mapping scheme can also be combined with the programmable warp scheduler to further improve the performance. The experimental results show that our CTA mapping algorithm can improve the overall performance of the input GPU programs by 23.3% on average and by 56.7% when combined with the programmable warp scheduler.


Computers ◽  
2018 ◽  
Vol 7 (4) ◽  
pp. 66
Author(s):  
Iulia Știrb

The paper presents a Non-Uniform Memory Access (NUMA)-aware compiler optimization for task-level parallel code. The optimization is based on Non-Uniform Memory Access—Balanced Task and Loop Parallelism (NUMA-BTLP) algorithm Ştirb, 2018. The algorithm gets the type of each thread in the source code based on a static analysis of the code. After assigning a type to each thread, NUMA-BTLP Ştirb, 2018 calls NUMA-BTDM mapping algorithm Ştirb, 2016 which uses PThreads routine pthread_setaffinity_np to set the CPU affinities of the threads (i.e., thread-to-core associations) based on their type. The algorithms perform an improve thread mapping for NUMA systems by mapping threads that share data on the same core(s), allowing fast access to L1 cache data. The paper proves that PThreads based task-level parallel code which is optimized by NUMA-BTLP Ştirb, 2018 and NUMA-BTDM Ştirb, 2016 at compile-time, is running time and energy efficiently on NUMA systems. The results show that the energy is optimized with up to 5% at the same execution time for one of the tested real benchmarks and up to 15% for another benchmark running in infinite loop. The algorithms can be used on real-time control systems such as client/server based applications which require efficient access to shared resources. Most often, task parallelism is used in the implementation of the server and loop parallelism is used for the client.


2013 ◽  
Vol 33 (1) ◽  
pp. 76-79
Author(s):  
Jiamin LIU ◽  
Huiyan WANG ◽  
Xiaoli ZHOU ◽  
Fulin LUO

2011 ◽  
Vol 33 (10) ◽  
pp. 2347-2352
Author(s):  
Bo Lü ◽  
Fan Yang ◽  
Zhen-kai Wang ◽  
Jian-ya Chen ◽  
Yun-jie Liu

Sign in / Sign up

Export Citation Format

Share Document