Moving Multiscale Modelling to the Edge: Benchmarking and Load Optimization for Cellular Automata on Low Power Microcomputers

Piotr Hajder; Łukasz Rauch

doi:10.3390/pr9122225

Moving Multiscale Modelling to the Edge: Benchmarking and Load Optimization for Cellular Automata on Low Power Microcomputers

Processes ◽

10.3390/pr9122225 ◽

2021 ◽

Vol 9 (12) ◽

pp. 2225

Author(s):

Piotr Hajder ◽

Łukasz Rauch

Keyword(s):

Energy Consumption ◽

High Performance ◽

Computation Time ◽

Computing Power ◽

Load Optimization ◽

Multi Scale ◽

Scale Modelling ◽

Similar Computation ◽

And Performance ◽

Two Parameters

Numerical computations are usually associated with the High Performance Computing. Nevertheless, both industry and science tend to involve devices with lower power in computations. This is especially true when the data collecting devices are able to partially process them at place, thus increasing the system reliability. This paradigm is known as Edge Computing. In this paper, we propose the use of devices at the edge, with lower computing power, for multi-scale modelling calculations. A system was created, consisting of a high-power device—a two-processor workstation, 8 RaspberryPi 4B microcomputers and 8 NVidia Jetson Nano units, equipped with GPU processor. As a part of this research, benchmarking was performed, on the basis of which the computational capabilities of the devices were classified. Two parameters were considered: the number and performance of computing units (CPUs and GPUs) and the energy consumption of the loaded machines. Then, using the calculated weak scalability and energy consumption, a min–max-based load optimization algorithm was proposed. The system was tested in laboratory conditions, giving similar computation time with same power consumption for 24 physical workstation cores vs. 8x RaspberryPi 4B and 8x Jetson Nano. The work ends with a proposal to use this solution in industrial processes on example of hot rolling of flat products.

Download Full-text

A Review of CFD Modelling and Performance Metrics for Osmotic Membrane Processes

Membranes ◽

10.3390/membranes10100285 ◽

2020 ◽

Vol 10 (10) ◽

pp. 285

Author(s):

Kang Yang Toh ◽

Yong Yeow Liang ◽

Woei Jye Lau ◽

Gustavo A. Fimbres Weihs

Keyword(s):

Performance Metrics ◽

Forward Osmosis ◽

Cfd Modelling ◽

Membrane Processes ◽

Pressure Retarded Osmosis ◽

Multi Scale ◽

Scale Modelling ◽

Computational Fluid Dynamics Cfd ◽

And Performance ◽

Module Performance

Simulation via Computational Fluid Dynamics (CFD) offers a convenient way for visualising hydrodynamics and mass transport in spacer-filled membrane channels, facilitating further developments in spiral wound membrane (SWM) modules for desalination processes. This paper provides a review on the use of CFD modelling for the development of novel spacers used in the SWM modules for three types of osmotic membrane processes: reverse osmosis (RO), forward osmosis (FO) and pressure retarded osmosis (PRO). Currently, the modelling of mass transfer and fouling for complex spacer geometries is still limited. Compared with RO, CFD modelling for PRO is very rare owing to the relative infancy of this osmotically driven membrane process. Despite the rising popularity of multi-scale modelling of osmotic membrane processes, CFD can only be used for predicting process performance in the absence of fouling. This paper also reviews the most common metrics used for evaluating membrane module performance at the small and large scales.

Download Full-text

Energy-Performance Scalability Analysis of a Novel Quasi-Stochastic Computing Approach

Journal of Low Power Electronics and Applications ◽

10.3390/jlpea9040030 ◽

2019 ◽

Vol 9 (4) ◽

pp. 30

Author(s):

Prashanthi Metku ◽

Ramu Seva ◽

Minsu Choi

Keyword(s):

Energy Consumption ◽

Low Cost ◽

Computation Time ◽

Energy Performance ◽

Stochastic Computing ◽

Detection Circuit ◽

And Performance ◽

Efficient Approximation ◽

Conventional Counterpart ◽

Computing Approach

Stochastic computing (SC) is an emerging low-cost computation paradigm for efficient approximation. It processes data in forms of probabilities and offers excellent progressive accuracy. Since SC’s accuracy heavily depends on the stochastic bitstream length, generating acceptable approximate results while minimizing the bitstream length is one of the major challenges in SC, as energy consumption tends to linearly increase with bitstream length. To address this issue, a novel energy-performance scalable approach based on quasi-stochastic number generators is proposed and validated in this work. Compared to conventional approaches, the proposed methodology utilizes a novel algorithm to estimate the computation time based on the accuracy. The proposed methodology is tested and verified on a stochastic edge detection circuit to showcase its viability. Results prove that the proposed approach offers a 12–60% reduction in execution time and a 12–78% decrease in the energy consumption relative to the conventional counterpart. This excellent scalability between energy and performance could be potentially beneficial to certain application domains such as image processing and machine learning, where power and time-efficient approximation is desired.

Download Full-text

Evaluation of a Cold-Mixed High-Performance Polyurethane Mixture

Advances in Materials Science and Engineering ◽

10.1155/2019/1507971 ◽

2019 ◽

Vol 2019 ◽

pp. 1-12 ◽

Cited By ~ 1

Author(s):

Sun Min ◽

Yufeng Bi ◽

Mulian Zheng ◽

Sai Chen ◽

Jingjing Li

Keyword(s):

Energy Consumption ◽

High Performance ◽

Greenhouse Gas Emission ◽

Asphalt Mixture ◽

Temperature Stability ◽

Water Stability ◽

High Temperature Stability ◽

Forming Mechanism ◽

And Performance

The energy consumption and greenhouse gas emission of asphalt pavement have become a very serious global problem. The high-temperature stability and durability of polyurethane (PU) are very good. It is studied as an alternative binder for asphalt recently. However, the strength-forming mechanism and the mixture structure of the PU mixture are different from the asphalt mixture. This work explored the design and performance evaluation of the PU mixture. The PU content of mixtures was determined by the creep slope (K), tensile strength ratios (TSR), immersion Cantabro loss (ICL), and the volume of air voids (VV) to ensure better water stability. The high- and low-temperature stability, water stability, dynamic mechanical property, and sustainability of the PU mixture were evaluated and compared with those of the stone matrix asphalt mixture (SMA). The test results showed that the dynamic stability and bending strain of the PU mixture were about 7.5 and 2.3 times of SMA. The adhesion level of PU and the basalt aggregate was one level greater than the limestone, and basalt aggregates were proposed to use in the PU mixture to improve water stability. Although the initial TSR and ICL of PU mixture were lower, the long-term values were higher; the PUM had better long-term water damage resistance. The dynamic modulus and phase angles (φ) of the PU mixture were much higher. The energy consumption and CO2 emission of the PU mixture were lower than those of SMA. Therefore, the cold-mixed PU mixture is a sustainable material with excellent performance and can be used as a substitute for asphalt mixture.

Download Full-text

FLASH: F ast Neura l A rchitecture S earch with H ardware Optimization

ACM Transactions on Embedded Computing Systems ◽

10.1145/3476994 ◽

2021 ◽

Vol 20 (5s) ◽

pp. 1-26

Author(s):

Guihong Li ◽

Sumit K. Mandal ◽

Umit Y. Ogras ◽

Radu Marculescu

Keyword(s):

Energy Consumption ◽

High Performance ◽

Search Space ◽

Analytical Models ◽

Raspberry Pi ◽

Simplicial Homology ◽

Theoretical Contribution ◽

Standard Ml ◽

Performance Requirements ◽

And Performance

Neural architecture search (NAS) is a promising technique to design efficient and high-performance deep neural networks (DNNs). As the performance requirements of ML applications grow continuously, the hardware accelerators start playing a central role in DNN design. This trend makes NAS even more complicated and time-consuming for most real applications. This paper proposes FLASH, a very fast NAS methodology that co-optimizes the DNN accuracy and performance on a real hardware platform. As the main theoretical contribution, we first propose the NN-Degree, an analytical metric to quantify the topological characteristics of DNNs with skip connections (e.g., DenseNets, ResNets, Wide-ResNets, and MobileNets). The newly proposed NN-Degree allows us to do training-free NAS within one second and build an accuracy predictor by training as few as 25 samples out of a vast search space with more than 63 billion configurations. Second, by performing inference on the target hardware, we fine-tune and validate our analytical models to estimate the latency, area, and energy consumption of various DNN architectures while executing standard ML datasets. Third, we construct a hierarchical algorithm based on simplicial homology global optimization (SHGO) to optimize the model-architecture co-design process, while considering the area, latency, and energy consumption of the target hardware. We demonstrate that, compared to the state-of-the-art NAS approaches, our proposed hierarchical SHGO-based algorithm enables more than four orders of magnitude speedup (specifically, the execution time of the proposed algorithm is about 0.1 seconds). Finally, our experimental evaluations show that FLASH is easily transferable to different hardware architectures, thus enabling us to do NAS on a Raspberry Pi-3B processor in less than 3 seconds.

Download Full-text

Accurate Energy and Performance Prediction for Frequency-Scaled GPU Kernels

Computation ◽

10.3390/computation8020037 ◽

2020 ◽

Vol 8 (2) ◽

pp. 37

Author(s):

Kaijie Fan ◽

Biagio Cosenza ◽

Ben Juurlink

Keyword(s):

Energy Consumption ◽

Performance Prediction ◽

High Performance ◽

Pareto Set ◽

Large Set ◽

Balance Performance ◽

Multi Objective ◽

Dynamic Voltage ◽

And Performance ◽

Performance Computing

Energy optimization is an increasingly important aspect of today’s high-performance computing applications. In particular, dynamic voltage and frequency scaling (DVFS) has become a widely adopted solution to balance performance and energy consumption, and hardware vendors provide management libraries that allow the programmer to change both memory and core frequencies manually to minimize energy consumption while maximizing performance. This article focuses on modeling the energy consumption and speedup of GPU applications while using different frequency configurations. The task is not straightforward, because of the large set of possible and uniformly distributed configurations and because of the multi-objective nature of the problem, which minimizes energy consumption and maximizes performance. This article proposes a machine learning-based method to predict the best core and memory frequency configurations on GPUs for an input OpenCL kernel. The method is based on two models for speedup and normalized energy predictions over the default frequency configuration. Those are later combined into a multi-objective approach that predicts a Pareto-set of frequency configurations. Results show that our approach is very accurate at predicting extema and the Pareto set, and finds frequency configurations that dominate the default configuration in either energy or performance.

Download Full-text

Balancing Power And Performance In HPC Clouds

The Computer Journal ◽

10.1093/comjnl/bxz150 ◽

2020 ◽

Vol 63 (6) ◽

pp. 880-899

Author(s):

Lixia Chen ◽

Jian Li ◽

Ruhui Ma ◽

Haibing Guan ◽

Hans-Arno Jacobsen

Keyword(s):

Energy Consumption ◽

High Performance ◽

Virtual Machines ◽

Support Vector ◽

Frequency Scaling ◽

Vm Placement ◽

Placement Problem ◽

Dynamic Voltage ◽

And Performance ◽

Save Energy

Abstract With energy consumption in high-performance computing clouds growing rapidly, energy saving has become an important topic. Virtualization provides opportunities to save energy by enabling one physical machine (PM) to host multiple virtual machines (VMs). Dynamic voltage and frequency scaling (DVFS) is another technology to reduce energy consumption. However, in heterogeneous cloud environments where DVFS may be applied at the chip level or the core level, it is a great challenge to combine these two technologies efficiently. On per-core DVFS servers, cloud managers should carefully determine VM placements to minimize performance interference. On full-chip DVFS servers, cloud managers further face the choice of whether to combine VMs with different characteristics to reduce performance interference or to combine VMs with similar characteristics to take better advantage of DVFS. This paper presents a novel mechanism combining a VM placement algorithm and a frequency scaling method. We formulate this VM placement problem as an integer programming (IP) to find appropriate placement configurations, and we utilize support vector machines to select suitable frequencies. We conduct detailed experiments and simulations, showing that our scheme effectively reduces energy consumption with modest impact on performance. Particularly, the total energy delay product is reduced by up to 60%.

Download Full-text

Multi-scale modelling to evaluate building energy consumption at the neighbourhood scale

PLoS ONE ◽

10.1371/journal.pone.0183437 ◽

2017 ◽

Vol 12 (9) ◽

pp. e0183437 ◽

Cited By ~ 37

Author(s):

Dasaraden Mauree ◽

Silvia Coccolo ◽

Jérôme Kaempf ◽

Jean-Louis Scartezzini

Keyword(s):

Energy Consumption ◽

Building Energy ◽

Building Energy Consumption ◽

Multi Scale ◽

Scale Modelling ◽

Neighbourhood Scale

Download Full-text

Minimizing Energy and Computation in Long-Running Software

Applied Sciences ◽

10.3390/app11031169 ◽

2021 ◽

Vol 11 (3) ◽

pp. 1169

Author(s):

Erol Gelenbe ◽

Miltiadis Siavvas

Keyword(s):

Energy Consumption ◽

Execution Time ◽

High Performance ◽

Average Energy ◽

Computation Time ◽

Hardware Failures ◽

Average Energy Consumption ◽

The One ◽

Hardware Platforms ◽

Time And Energy

Long-running software may operate on hardware platforms with limited energy resources such as batteries or photovoltaic, or on high-performance platforms that consume a large amount of energy. Since such systems may be subject to hardware failures, checkpointing is often used to assure the reliability of the application. Since checkpointing introduces additional computation time and energy consumption, we study how checkpoint intervals need to be selected so as to minimize a cost function that includes the execution time and the energy. Expressions for both the program’s energy consumption and execution time are derived as a function of the failure probability per instruction. A first principle based analysis yields the checkpoint interval that minimizes a linear combination of the average energy consumption and execution time of the program, in terms of the classical “Lambert function”. The sensitivity of the checkpoint to the importance attributed to energy consumption is also derived. The results are illustrated with numerical examples regarding programs of various lengths and showing the relation between the checkpoint interval that minimizes energy consumption and execution time, and the one that minimizes a weighted sum of the two. In addition, our results are applied to a popular software benchmark, and posted on a publicly accessible web site, together with the optimization software that we have developed.

Download Full-text

Balancing Energy and Performance in Dense Linear System Solvers for Hybrid ARM+GPU platforms

CLEI electronic journal ◽

10.19153/cleiej.19.1.2 ◽

2016 ◽

Author(s):

Juan P. Silva ◽

Ernesto Dufrechou ◽

Pabl Ezzatti ◽

Enrique S. Quintana-Ortí ◽

Alfredo Remón ◽

...

Keyword(s):

Energy Consumption ◽

Linear System ◽

Energy Efficient ◽

High Performance ◽

Energy Aware ◽

Balancing Energy ◽

Dense Linear System ◽

And Performance ◽

Hardware Platforms ◽

Time And Energy

The high performance computing community has traditionally focused uniquely on the reduction of execution time, though in the last years, the optimization of energy consumption has become a main issue. A reduction of energy usage without a degradation of performance requires the adoption of energy-efficient hardware platforms accompanied by the development of energy-aware algorithms and computational kernels. The solution of linear systems is a key operation for many scientific and engineering problems. Its relevance has motivated an important amount of work, and consequently, it is possible to find high performance solvers for a wide variety of hardware platforms. In this work, we aim to develop a high performance and energy-efficient linear system solver. In particular, we develop two solvers for a low-power CPU-GPU platform, the NVIDIA Jetson TK1. These solvers implement the Gauss-Huard algorithm yielding an efficient usage of the target hardware as well as an efficient memory access. The experimental evaluation shows that the novel proposal reports important savings in both time and energy-consumption when compared with the state-of-the-art solvers of the platform.

Download Full-text

HPCGRA - An Orthogonal Designed CGRA Generator for High Performance Spatial Accelerators

10.5753/wscad.2020.14055 ◽

2020 ◽

Author(s):

Lucas Silva ◽

Michael Canesche ◽

Ricardo Ferreira ◽

José Augusto Nacif

Keyword(s):

Energy Consumption ◽

High Performance ◽

Building Blocks ◽

Reconfigurable Architectures ◽

Systolic Arrays ◽

Good Balance ◽

Domain Specific ◽

Functional Units ◽

And Performance ◽

The Ideal

Recently, the increasing adoption of domain-speciﬁc architectures to execute kernels with high computing density and the exploration of sparse architectures using Systolic Arrays created the ideal scenario for using Coarsegrained reconﬁgurable architectures (CGRAs) to accelerate applications. Unlike Systolic Array, CGRA can run different kernel sets and keep a good balance between energy consumption and performance. In this work, we present the HPCGRA, an orthogonal designed CGRA generator for high-performance spatial accelerators. Our tool does not require any expertise in Verilog design. In our approach, the CGRA is designed and implemented in an orthogonal fashion, through wrapping the main building blocks: functional units, interconnection patterns, routing, and elastic buffer capabilities, conﬁguration words, and memories. It optimizes and simpliﬁes the process of creating CGRAs architectures using a portable description (JSON ﬁle) and generating a generic, scalable, and efﬁcient Verilog RTL code with Veriloggen. The tool automatically generates CGRA with up to 46x66 functional units, reaching 1.2 Tera ops/s.

Download Full-text