Efficient multithreading for manycore processor: Multidimensional domain decomposition using Intel® TBB

The Insight Journal ◽

10.54294/73dn1l ◽

2017 ◽

Author(s):

Etienne St-Onge ◽

Benoit Scherrer ◽

Simon Warfield

Keyword(s):

Domain Decomposition ◽

High Performance ◽

Job Scheduling ◽

Building Blocks ◽

Massively Parallel ◽

Decomposition Approach ◽

Task Decomposition ◽

Dynamic Task ◽

Current Implementation ◽

Generic Design

The Insight Toolkit (ITK) utilizes a generic design for image processing filters that allows many developers to rapidly implement new algorithms. While ITK filters benefit from a platform-independent and versatile multithreading capability, the current implementation does not easily achieve high performance. First, ITK relies on a static decomposition of the image into subsets of equal size which is highly inefficient when the computational complexity varies between subsets (unbalanced workloads). Second, the current domain decomposition is limited to subdivide the input domain along a single dimension (typically the slice dimension in a 3-D volume), which causes a multithreading under-utilization when the number of threads is larger than the size of this dimension when using massively parallel compute systems. We previously presented a new itk::TBBImageToImageFilter class that replaced the static task decomposition by a dynamic task decomposition for improved workload balancing, in which the job scheduling task was optimized using the Intel® Threading Building Blocks (TBB) library. In this work, we propose a new multidimensional dynamic image decomposition approach that allows decomposition over an arbitrary number of dimensions. This new generic multithreading capability, combined with the TBB dynamic task scheduler, substantially improves multithreading performance when using massively parallel processors.

Download Full-text

Massively Parallel Simulation of Antenna Array Using Domain Decomposition Method and a High-Performance Computing Scheme

2019 IEEE International Symposium on Antennas and Propagation and USNC-URSI Radio Science Meeting ◽

10.1109/apusncursinrsm.2019.8888517 ◽

2019 ◽

Cited By ~ 2

Author(s):

Hao-Xuan Zhang ◽

Li. Huang ◽

Liang Zhou ◽

Z.G Zhao ◽

Yu-Teng Zheng ◽

...

Keyword(s):

Domain Decomposition ◽

High Performance Computing ◽

Antenna Array ◽

Decomposition Method ◽

High Performance ◽

Domain Decomposition Method ◽

Parallel Simulation ◽

Massively Parallel ◽

Performance Computing ◽

Computing Scheme

Download Full-text

High performance domain decomposition methods on massively parallel architectures with freefem++

Journal of Numerical Mathematics ◽

10.1515/jnum-2012-0015 ◽

2012 ◽

Vol 20 (3-4) ◽

Cited By ~ 5

Author(s):

P. Jolivet ◽

V. Dolean ◽

F. Hecht ◽

F. Nataf ◽

C. Prud’Homme ◽

...

Keyword(s):

Domain Decomposition ◽

High Performance ◽

Decomposition Methods ◽

Parallel Architectures ◽

Massively Parallel ◽

Domain Decomposition Methods ◽

Massively Parallel Architectures

Download Full-text

Hardware Trends and Implications for Programming Models

Handbook of Research on Computational Science and Engineering - Advances in Computer and Electrical Engineering ◽

10.4018/978-1-61350-116-0.ch001 ◽

2012 ◽

pp. 1-21

Author(s):

Gabriele Jost ◽

Alice E. Koniges

Keyword(s):

High Performance Computing ◽

High Performance ◽

Building Blocks ◽

Programming Models ◽

Massively Parallel ◽

New Challenges ◽

Hardware Designs ◽

Performance Computing

The upcoming years bring new challenges in high-performance computing (HPC) technology. Fundamental changes in the building blocks of HPC hardware are forcing corresponding changes in programming models to effectively use these new architectures. The changes in store for HPC will rival the vector to massively parallel transition that scientific and engineering codes and methodologies endured several years ago. We describe some of the upcoming trends in hardware designs, and suggest ways in which software and programming models will advance accordingly.

Download Full-text

A new implementation of itk::ImageToImageFilter for efficient parallelization of image processing algorithms using Intel Threading Building Blocks

The Insight Journal ◽

10.54294/mq1gt4 ◽

2016 ◽

Author(s):

Amir Jaberzadeh ◽

Benoit Scherrer ◽

Simon Warfield

Keyword(s):

Image Processing ◽

Image Analysis ◽

Medical Imaging ◽

Open Source ◽

High Performance ◽

Job Scheduling ◽

Building Blocks ◽

Reconstruction Image ◽

Imaging Algorithms ◽

Very High

Modern medical imaging makes use of high performance computing to accelerate image acquisition, image reconstruction, image visualization and image analysis. Software libraries that provide implementations of key medical imaging algorithms need to efficiently exploit modern CPU architectures. In particular, workstations with small numbers of cores are being replaced by very high core count architectures, and by many integrated core architectures, which offer acceleration by vectorization and multi-threading.The Insight Toolkit (ITK) is the premier open source implementation of medical imaging algorithms, with a generic design for image processing filters that allows for many developers to rapidly incorporate these algorithms in to new applications. While ITK filters benefit from a generic, platform independent multithreading capability, the current implementation is difficult to exploit to achieve very high performance. Specifically, ITK relies on a static decomposition of the image into subsets of equal size which can be highly inefficient. Threads that terminate early due to uneven work throughout the image finish early and do not contribute further to the processing of more complex regions, leading to idle computational resources and longer execution times. Performance is also difficult to coordinate across multiple algorithms, as the ITK filter assumes each filter operates independently but the global implementation has an impact across filters.In this work, we propose a novel, simple to use, high performance multithreading capability for ITK that accelerates the itk::ImageToImageFilter. We utilise a workpile data decomposition strategy, and leave the task of optimal job scheduling on CPU cores to the library called Threading Building Blocks (TBB). We demonstrate the efficacy of multi-threading with TBB in comparison to the itk::Multithreader class, through three simple example image analysis algorithms.Our implementation provides a new multi-threaded itk::ImageToImageFilter that can be conveniently reused to provide simple and efficient multi-threaded code across applications and algorithm libraries. Our new implementation is distributed as open-source software to the community and is straightforward to adopt.

Download Full-text

Multi-level Parallelization of Genotype Imputation on Supercomputers

Current Bioinformatics ◽

10.2174/1574893615999200420071307 ◽

2020 ◽

Vol 15 ◽

Author(s):

Weiwen Zhang ◽

Long Wang ◽

Theint Theint Aye ◽

Juniarto Samsudin ◽

Yongqing Zhu

Keyword(s):

Association Study ◽

Message Passing ◽

High Performance ◽

Message Passing Interface ◽

Genome Wide Association Study ◽

Job Scheduling ◽

Genotype Imputation ◽

Job Level ◽

Multi Level ◽

High Performance Requirement

Background: Genotype imputation as a service is developed to enable researchers to estimate genotypes on haplotyped data without performing whole genome sequencing. However, genotype imputation is computation intensive and thus it remains a challenge to satisfy the high performance requirement of genome wide association study (GWAS). Objective: In this paper, we propose a high performance computing solution for genotype imputation on supercomputers to enhance its execution performance. Method: We design and implement a multi-level parallelization that includes job level, process level and thread level parallelization, enabled by job scheduling management, message passing interface (MPI) and OpenMP, respectively. It involves job distribution, chunk partition and execution, parallelized iteration for imputation and data concatenation. Due to the design of multi-level parallelization, we can exploit the multi-machine/multi-core architecture to improve the performance of genotype imputation. Results: Experiment results show that our proposed method can outperform the Hadoop-based implementation of genotype imputation. Moreover, we conduct the experiments on supercomputers to evaluate the performance of the proposed method. The evaluation shows that it can significantly shorten the execution time, thus improving the performance for genotype imputation. Conclusion: The proposed multi-level parallelization, when deployed as an imputation as a service, will facilitate bioinformatics researchers in Singapore to conduct genotype imputation and enhance the association study.

Download Full-text

Collapsed carbon nanotubes as building blocks for high-performance thermal materials

Physical Review Materials ◽

10.1103/physrevmaterials.1.056001 ◽

2017 ◽

Vol 1 (5) ◽

Cited By ~ 5

Author(s):

Jihong Al-Ghalith ◽

Hao Xu ◽

Traian Dumitrică

Keyword(s):

Carbon Nanotubes ◽

High Performance ◽

Building Blocks

Download Full-text

Novel Bloch wave excitation platform based on few-layer photonic crystal deposited on D-shaped optical fiber

Scientific Reports ◽

10.1038/s41598-021-90504-z ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Esteban Gonzalez-Valencia ◽

Ignacio Del Villar ◽

Pedro Torres

Keyword(s):

Optical Fibers ◽

High Performance ◽

Light Propagation ◽

Fiber Optic ◽

Layer Structure ◽

Optical Network ◽

Building Blocks ◽

Bloch Wave ◽

Nanophotonic Devices ◽

Wide Range

AbstractWith the goal of ultimate control over the light propagation, photonic crystals currently represent the primary building blocks for novel nanophotonic devices. Bloch surface waves (BSWs) in periodic dielectric multilayer structures with a surface defect is a well-known phenomenon, which implies new opportunities for controlling the light propagation and has many applications in the physical and biological science. However, most of the reported structures based on BSWs require depositing a large number of alternating layers or exploiting a large refractive index (RI) contrast between the materials constituting the multilayer structure, thereby increasing the complexity and costs of manufacturing. The combination of fiber–optic-based platforms with nanotechnology is opening the opportunity for the development of high-performance photonic devices that enhance the light-matter interaction in a strong way compared to other optical platforms. Here, we report a BSW-supporting platform that uses geometrically modified commercial optical fibers such as D-shaped optical fibers, where a few-layer structure is deposited on its flat surface using metal oxides with a moderate difference in RI. In this novel fiber optic platform, BSWs are excited through the evanescent field of the core-guided fundamental mode, which indicates that the structure proposed here can be used as a sensing probe, along with other intrinsic properties of fiber optic sensors, as lightness, multiplexing capacity and easiness of integration in an optical network. As a demonstration, fiber optic BSW excitation is shown to be suitable for measuring RI variations. The designed structure is easy to manufacture and could be adapted to a wide range of applications in the fields of telecommunications, environment, health, and material characterization.

Download Full-text

Optimization of Dynamic Task Location within a Manipulator’s Workspace for the Utilization of the Minimum Required Joint Torques

Electronics ◽

10.3390/electronics10030288 ◽

2021 ◽

Vol 10 (3) ◽

pp. 288

Author(s):

Adam Wolniakowski ◽

Charalampos Valsamos ◽

Kanstantsin Miatliuk ◽

Vassilis Moulianitis ◽

Nikos Aspragathos

Keyword(s):

High Performance ◽

Human Robot Interaction ◽

Optimal Position ◽

End Effector ◽

Joint Torques ◽

Dynamic Task ◽

Arbitrary Position ◽

Simulation Results ◽

Set Up ◽

Task Placement

The determination of the optimal position of a robotic task within a manipulator’s workspace is crucial for the manipulator to achieve high performance regarding selected aspects of its operation. In this paper, a method for determining the optimal task placement for a serial manipulator is presented, so that the required joint torques are minimized. The task considered comprises the exercise of a given force in a given direction along a 3D path followed by the end effector. Given that many such tasks are usually conducted by human workers and as such the utilized trajectories are quite complex to model, a Human Robot Interaction (HRI) approach was chosen to define the task, where the robot is taught the task trajectory by a human operator. Furthermore, the presented method considers the singular free paths of the manipulator’s end-effector motion in the configuration space. Simulation results are utilized to set up a physical execution of the task in the optimal derived position within a UR-3 manipulator’s workspace. For reference the task is also placed at an arbitrary “bad” location in order to validate the simulation results. Experimental results verify that the positioning of the task at the optimal location derived by the presented method allows for the task execution with minimum joint torques as opposed to the arbitrary position.

Download Full-text

High-performance sampling of generic determinantal point processes

Philosophical Transactions of The Royal Society A Mathematical Physical and Engineering Sciences ◽

10.1098/rsta.2019.0059 ◽

2020 ◽

Vol 378 (2166) ◽

pp. 20190059 ◽

Cited By ~ 1

Author(s):

Jack Poulson

Keyword(s):

Point Process ◽

Spectral Decomposition ◽

High Performance ◽

Point Processes ◽

Cholesky Factorization ◽

Determinantal Point Processes ◽

Determinantal Point Process ◽

Decomposition Approach ◽

Map Inference ◽

Sampling Schemes

Determinantal point processes (DPPs) were introduced by Macchi (Macchi 1975 Adv. Appl. Probab. 7 , 83–122) as a model for repulsive (fermionic) particle distributions. But their recent popularization is largely due to their usefulness for encouraging diversity in the final stage of a recommender system (Kulesza & Taskar 2012 Found. Trends Mach. Learn. 5 , 123–286). The standard sampling scheme for finite DPPs is a spectral decomposition followed by an equivalent of a randomly diagonally pivoted Cholesky factorization of an orthogonal projection, which is only applicable to Hermitian kernels and has an expensive set-up cost. Researchers Launay et al. 2018 ( http://arxiv.org/abs/1802.08429 ); Chen & Zhang 2018 NeurIPS ( https://papers.nips.cc/paper/7805-fast-greedy-map-inference-for-determinantal-point-process-to-improve-recommendation-diversity.pdf ) have begun to connect DPP sampling to LDL H factorizations as a means of avoiding the initial spectral decomposition, but existing approaches have only outperformed the spectral decomposition approach in special circumstances, where the number of kept modes is a small percentage of the ground set size. This article proves that trivial modifications of LU and LDL H factorizations yield efficient direct sampling schemes for non-Hermitian and Hermitian DPP kernels, respectively. Furthermore, it is experimentally shown that even dynamically scheduled, shared-memory parallelizations of high-performance dense and sparse-direct factorizations can be trivially modified to yield DPP sampling schemes with essentially identical performance. The software developed as part of this research, Catamari ( hodgestar.com/catamari ) is released under the Mozilla Public License v.2.0. It contains header-only, C++14 plus OpenMP 4.0 implementations of dense and sparse-direct, Hermitian and non-Hermitian DPP samplers. This article is part of a discussion meeting issue ‘Numerical algorithms for high-performance computational science’.

Download Full-text

High performance mapping for massively parallel hierarchical structures

[1990 Proceedings] The Third Symposium on the Frontiers of Massively Parallel Computation ◽

10.1109/fmpc.1990.89467 ◽

2002 ◽

Author(s):

S.G. Ziavras

Keyword(s):

High Performance ◽

Hierarchical Structures ◽

Massively Parallel

Download Full-text