A thermal resilient integration of many-core microprocessors and main memory by 2.5D TSI I/Os

Stochastic heat conduction differential equation in spite of its complexity allows stationary solutions valid over a certain range of variables characterizing heat flow in multi-processor cores. Heat conduction equation is recast to account for anisotropy of a many core multi-processor in which heat generated at various locations depends on whether it is a cache, processor, bus controller, or memory controller: within the core generated heat depends on the hit rate, processor utilization, cache organization, and the technology used. Thermal conductivity of and heat generation in the core are treated as stochastic variables and influence of workloads, hitherto unrecognized, is explicitly accounted for in determining temperature distribution and its variation with processor clock frequency. Relationships derived from first principles indicate that rise in temperature with processor frequency for OLTP workload is not as catastrophic as predicted by some Industry brochures! A general framework for heat conduction in an orthotropic rectangular slab (representing a many core processor) with stochastic values of thermal conductivity and heat generation is developed; the theoretical trend is validated using published data for OLTP workloads to obtain temperature at the core surface as a function of clock frequency for the deterministic case. Transaction Processing Councils (TPC) openly available data from controlled, closely audited experiments for TPCC workloads during the period 2000–2011 were analyzed to determine the relation between throughput, clock frequency, main memory size, number of cores, and power consumed. Operating systems, compilers, linkers, processor architecture, cache, main memory, and storage sizes have changed drastically during this ten year period, not to mention hyper-threading was unknown in 2000! This analysis yields the following equations for throughput and power consumed, which for a specific case of 64 processors with a main memory of 32 GB and a million users, becomes W = 1075f0.22. For the isotropic case the temperature difference at the surface may be expressed for the case under study as ΔT = 71.1f0.22. This demonstrates that chip temperature for OLTP workloads does not increase to catastrophic values with increase in frequency. This behavior varies for other types of workloads.

Download Full-text

A parallel discord discovery algorithm for time series on many-core accelerators

Numerical Methods and Programming (Vychislitel'nye Metody i Programmirovanie) ◽

10.26089/nummet.v20r320 ◽

2019 ◽

pp. 211-223

Author(s):

М.Л. Цымблер

Keyword(s):

Time Series ◽

Climate Modeling ◽

Main Memory ◽

Wide Range ◽

Euclidean Distances ◽

Nvidia Gpu ◽

Two Stages ◽

Many Core ◽

Intel Mic ◽

Many Integrated Core

Диссонанс является уточнением понятия аномальной подпоследовательности (существенно непохожей на остальные подпоследовательности) временного ряда. Задача поиска диссонанса встречается в широком спектре предметных областей, связанных с временными рядами: медицина, экономика, моделирование климата и др. В работе предложен новый параллельный алгоритм поиска диссонанса во временном ряде на платформе многоядерного ускорителя для случая, когда входные данные могут быть размещены в оперативной памяти. Алгоритм использует возможность независимого вычисления евклидовых расстояний между подпоследовательностями ряда. Алгоритм состоит из двух этапов: подготовка данных и поиск. На этапе подготовки выполняется построение вспомогательных матричных структур данных, обеспечивающих распараллеливание и векторизацию вычислений. На стадии поиска осуществляется нахождение диссонанса с помощью построенных структур данных. Выполнена реализация алгоритма для ускорителей архитектур Intel MIC (Many Integrated Core) и NVIDIA GPU, распараллеливание выполнено с помощью технологий программирования OpenMP и OpenAcc соответственно. Представлены результаты вычислительных экспериментов, подтверждающих масштабируемость разработанного алгоритма. Discord is a refinement of the concept of anomalous subsequence of a time series. The discord discovery problem frequently occurs in a wide range of application areas related to time series: medicine, economics, climate modeling, etc. In this paper we propose a new parallel discord discovery algorithm for many-core systems in the case when the input data fit in the main memory. The algorithm exploits the ability to independently calculate the Euclidean distances between the subsequences of the time series. Computations are paralleled using OpenMP and OpenAcc for the Intel MIC (Many Integrated Core) and NVIDIA GPU platforms, respectively. The algorithm consists of two stages, namely precomputations and discovery. At the precomputation stage, we construct the auxiliary matrix data structures to ensure the efficient vectorization of computations on an accelerator. At the discovery stage, the algorithm searches for a discord based on the constructed structures. A number of numerical experiments confirm a high scalability of the proposed algorithm.

Download Full-text

Monolithically Integrating Non-Volatile Main Memory over the Last-Level Cache

ACM Transactions on Architecture and Code Optimization ◽

10.1145/3462632 ◽

2021 ◽

Vol 18 (4) ◽

pp. 1-26

Author(s):

Candace Walden ◽

Devesh Singh ◽

Meenatchi Jagasivamani ◽

Shang Li ◽

Luyi Kang ◽

...

Keyword(s):

Regular Structure ◽

Main Memory ◽

Monolithically Integrated ◽

Area Efficiency ◽

Memory Area ◽

Simulation Results ◽

Memory Interface ◽

Performance Penalty ◽

Many Core ◽

Design Ideas

Many emerging non-volatile memories are compatible with CMOS logic, potentially enabling their integration into a CPU’s die. This article investigates such monolithically integrated CPU–main memory chips. We exploit non-volatile memories employing 3D crosspoint subarrays, such as resistive RAM (ReRAM), and integrate them over the CPU’s last-level cache (LLC). The regular structure of cache arrays enables co-design of the LLC and ReRAM main memory for area efficiency. We also develop a streamlined LLC/main memory interface that employs a single shared internal interconnect for both the cache and main memory arrays, and uses a unified controller to service both LLC and main memory requests. We apply our monolithic design ideas to a many-core CPU by integrating 3D ReRAM over each core’s LLC slice. We find that co-design of the LLC and ReRAM saves 27% of the total LLC–main memory area at the expense of slight increases in delay and energy. The streamlined LLC/main memory interface saves an additional 12% in area. Our simulation results show monolithic integration of CPU and main memory improves performance by 5.3× and 1.7× over HBM2 DRAM for several graph and streaming kernels, respectively. It also reduces the memory system’s energy by 6.0× and 1.7×, respectively. Moreover, we show that the area savings of co-design permits the CPU to have 23% more cores and main memory, and that streamlining the LLC/main memory interface incurs a small 4% performance penalty.

Download Full-text

Parallel Algorithm for Frequent Itemset Mining on Intel Many-core Systems

Journal of Computing and Information Technology ◽

10.20532/cit.2018.1004382 ◽

2019 ◽

Vol 26 (4) ◽

pp. 209-221

Keyword(s):

Parallel Implementation ◽

Main Memory ◽

Frequent Itemset ◽

Frequent Itemset Mining ◽

Xeon Phi ◽

Intel Xeon Phi ◽

Memory Space ◽

Itemset Mining ◽

Many Core ◽

Intel Xeon

Frequent itemset mining leads to the discovery of associations and correlations among items in large transactional databases. Apriori is a classical frequent itemset mining algorithm, which employs iterative passes over database combining with generation of candidate itemsets based on frequent itemsets found at the previous iteration, and pruning of clearly infrequent itemsets. The Dynamic Itemset Counting (DIC) algorithm is a variation of Apriori, which tries to reduce the number of passes made over a transactional database while keeping the number of itemsets counted in a pass relatively low. In this paper, we address the problem of accelerating DIC on the Intel Xeon Phi many-core system for the case when the transactional database fits in main memory. Intel Xeon Phi provides a large number of small compute cores with vector processing units. The paper presents a parallel implementation of DIC based on OpenMP technology and thread-level parallelism. We exploit the bit-based internal layout for transactions and itemsets. This technique reduces the memory space for storing the transactional database, simplifies the support count via logical bitwise operation, and allows for vectorization of such a step. Experimental evaluation on the platforms of the Intel Xeon CPU and the Intel Xeon Phi coprocessor with large synthetic and real databases showed good performance and scalability of the proposed algorithm.

Download Full-text

Efficient many-core query execution in main memory column-stores

2013 IEEE 29th International Conference on Data Engineering (ICDE) ◽

10.1109/icde.2013.6544838 ◽

2013 ◽

Cited By ~ 9

Author(s):

J. Dees ◽

P. Sanders

Keyword(s):

Main Memory ◽

Query Execution ◽

Column Stores ◽

Many Core

Download Full-text

A Study of Main-Memory Hash Joins on Many-core Processor

Proceedings of the 2017 ACM on Conference on Information and Knowledge Management - CIKM '17 ◽

10.1145/3132847.3132916 ◽

2017 ◽

Cited By ~ 14

Author(s):

Xuntao Cheng ◽

Bingsheng He ◽

Xiaoli Du ◽

Chiew Tong Lau

Keyword(s):

Main Memory ◽

Many Core

Download Full-text