Efficient SIMDization and Data Management of the Lattice QCD Computation on the Cell Broadband Engine

Khaled Z. Ibrahim; François Bodin

doi:10.1155/2009/634756

Efficient SIMDization and Data Management of the Lattice QCD Computation on the Cell Broadband Engine

Scientific Programming ◽

10.1155/2009/634756 ◽

2009 ◽

Vol 17 (1-2) ◽

pp. 153-172 ◽

Cited By ~ 2

Author(s):

Khaled Z. Ibrahim ◽

François Bodin

Keyword(s):

Lattice Qcd ◽

General Purpose ◽

Double Precision ◽

Limited Bandwidth ◽

Cell Broadband Engine ◽

Order Of Magnitude ◽

Simd Execution ◽

Local Store ◽

Time Continuum ◽

Better Than

Lattice Quantum Chromodynamic (QCD) models subatomic interactions based on a four-dimensional discretized space–time continuum. The Lattice QCD computation is one of the grand challenges in physics especially when modeling a lattice with small spacing. In this work, we study the implementation of the main kernel routine of Lattice QCD that dominates the execution time on the Cell Broadband Engine. We tackle the problem of efficient SIMD execution and the problem of limited bandwidth for data transfers with the off-chip memory. For efficient SIMD execution, we present runtime data fusion technique that groups data processed similarly at runtime. We also introduce analysis needed to reduce the pressure on the scarce memory bandwidth that limits the performance of this computation. We studied two implementations for the main kernel routine that exhibit different patterns of accessing the memory and thus allowing different sets of optimizations. We show the attributes that make one implementation more favorable in terms of performance. For lattice size that is significantly larger than the local store, our implementation achieves 31.2 GFlops for single precision computations and 16.6 GFlops for double precision computations on the PowerXCell 8i, an order of magnitude better than the performance achieved on most general-purpose processors.

Download Full-text

A LINUX PC CLUSTER FOR LATTICE QCD WITH EXACT CHIRAL SYMMETRY

International Journal of Modern Physics C ◽

10.1142/s0129183103004954 ◽

2003 ◽

Vol 14 (06) ◽

pp. 723-746 ◽

Cited By ~ 1

Author(s):

TING-WAI CHIU ◽

TUNG-HAN HSIEH ◽

CHAO-HSI HUANG ◽

TSUNG-REN HUANG

Keyword(s):

Lattice Qcd ◽

Chiral Symmetry ◽

Hard Disk ◽

Performance Ratio ◽

Double Precision ◽

Computational System ◽

Pc Cluster ◽

Price Performance ◽

Computationally Intensive ◽

Better Than

A computational system for lattice QCD with overlap Dirac quarks is described. The platform is a home-made Linux PC cluster, built with off-the-shelf components. At present the system constitutes of 64 nodes, with each node consisting of one Pentium 4 processor (1.6/2.0/2.5 GHz), one Gbyte of PC800/1066 RDRAM, one 40/80/120 Gbyte hard disk, and a network card. The computationally intensive parts of our program are written in SSE2 codes. The speed of our system is estimated to be 70 Gflops, and its price/performance ratio is better than $1.0/Mflops for 64-bit (double precision) computations in quenched QCD. We discuss how to optimize its hardware and software for computing propagators of overlap Dirac quarks.

Download Full-text

Photometric redshifts for the Kilo-Degree Survey

Astronomy and Astrophysics ◽

10.1051/0004-6361/201731942 ◽

2018 ◽

Vol 616 ◽

pp. A69 ◽

Cited By ~ 24

Author(s):

M. Bilicki ◽

H. Hoekstra ◽

M. J. I. Brown ◽

V. Amaro ◽

C. Blake ◽

...

Keyword(s):

Spectroscopic Data ◽

General Purpose ◽

Photometric Redshifts ◽

Photometric Redshift ◽

The Galaxy ◽

Order Of Magnitude ◽

Mass Assembly ◽

Ir Bands ◽

Limited Coverage ◽

Better Than

We present a machine-learning photometric redshift (ML photo-z) analysis of the Kilo-Degree Survey Data Release 3 (KiDS DR3), using two neural-network based techniques: ANNz2 and MLPQNA. Despite limited coverage of spectroscopic training sets, these ML codes provide photo-zs of quality comparable to, if not better than, those from the Bayesian Photometric Redshift (BPZ) code, at least up to zphot ≲ 0.9 and r ≲ 23.5. At the bright end of r ≲ 20, where very complete spectroscopic data overlapping with KiDS are available, the performance of the ML photo-zs clearly surpasses that of BPZ, currently the primary photo-z method for KiDS. Using the Galaxy And Mass Assembly (GAMA) spectroscopic survey as calibration, we furthermore study how photo-zs improve for bright sources when photometric parameters additional to magnitudes are included in the photo-z derivation, as well as when VIKING and WISE infrared (IR) bands are added. While the fiducial four-band ugri setup gives a photo-z bias 〈δz/(1 + z)〉 = −2 × 10−4 and scatter σδz/(1+z) < 0.022 at mean 〈z〉 = 0.23, combining magnitudes, colours, and galaxy sizes reduces the scatter by ~7% and the bias by an order of magnitude. Once the ugri and IR magnitudes are joined into 12-band photometry spanning up to 12 μm, the scatter decreases by more than 10% over the fiducial case. Finally, using the 12 bands together with optical colours and linear sizes gives 〈δz/(1 + z)〉 < 4 × 10−5 and σδz/(1+z) < 0.019. This paper also serves as a reference for two public photo-z catalogues accompanying KiDS DR3, both obtained using the ANNz2 code. The first one, of general purpose, includes all the 39 million KiDS sources with four-band ugri measurements in DR3. The second dataset, optimised for low-redshift studies such as galaxy-galaxy lensing, is limited to r ≲ 20, and provides photo-zs of much better quality than in the full-depth case thanks to incorporating optical magnitudes, colours, and sizes in the GAMA-calibrated photo-z derivation.

Download Full-text

Short and Long Supports for Constraint Propagation

Journal of Artificial Intelligence Research ◽

10.1613/jair.3749 ◽

2013 ◽

Vol 46 ◽

pp. 1-45 ◽

Cited By ~ 4

Author(s):

P. Nightingale ◽

I. P. Gent ◽

C. Jefferson ◽

I. Miguel

Keyword(s):

Constraint Propagation ◽

General Purpose ◽

Full Length ◽

Compact Set ◽

Propagation Algorithm ◽

Simpler Algorithm ◽

Form Part ◽

Order Of Magnitude ◽

Better Than

Special-purpose constraint propagation algorithms frequently make implicit use of short supports -- by examining a subset of the variables, they can infer support (a justification that a variable-value pair may still form part of an assignment that satisfies the constraint) for all other variables and values and save substantial work -- but short supports have not been studied in their own right. The two main contributions of this paper are the identification of short supports as important for constraint propagation, and the introduction of HaggisGAC, an efficient and effective general purpose propagation algorithm for exploiting short supports. Given the complexity of HaggisGAC, we present it as an optimised version of a simpler algorithm ShortGAC. Although experiments demonstrate the efficiency of ShortGAC compared with other general-purpose propagation algorithms where a compact set of short supports is available, we show theoretically and experimentally that HaggisGAC is even better. We also find that HaggisGAC performs better than GAC-Schema on full-length supports. We also introduce a variant algorithm HaggisGAC-Stable, which is adapted to avoid work on backtracking and in some cases can be faster and have significant reductions in memory use. All the proposed algorithms are excellent for propagating disjunctions of constraints. In all experiments with disjunctions we found our algorithms to be faster than Constructive Or and GAC-Schema by at least an order of magnitude, and up to three orders of magnitude.

Download Full-text

Second Paper: Deep Drawing and Free Forming Using a Water Hammer Technique

Proceedings of the Institution of Mechanical Engineers ◽

10.1243/pime_proc_1964_179_019_02 ◽

1964 ◽

Vol 179 (1) ◽

pp. 222-233 ◽

Cited By ~ 3

Author(s):

A. P. Vafiadakis ◽

W. Johnson ◽

I. S. Donaldson

Keyword(s):

Deep Drawing ◽

Time History ◽

Water Hammer ◽

High Rate ◽

Pressure Time ◽

Order Of Magnitude ◽

History Of ◽

Overall Efficiency ◽

Initial Movement ◽

Better Than

Earlier work on a water-hammer technique for high-rate forming of sheet metal has been extended to include work on deep drawing using lead plugs. A study of the pressure-time history of a deforming blank during its initial movement is reported. An assessment of the overall efficiency of the process has been made and is found to be about 50 per cent; this is an order of magnitude better than that found with comparable electro-hydraulic and explosive methods.

Download Full-text

Attitude reconstruction from strap-down rate gyros using power series

Journal of Navigation ◽

10.1017/s0373463321000023 ◽

2021 ◽

pp. 1-19

Author(s):

Habib Ghanbarpourasl

Keyword(s):

Power Series ◽

Taylor Series ◽

Direction Cosine ◽

Analytical Description ◽

Double Precision ◽

Angular Velocity Vector ◽

Higher Order Terms ◽

Direction Cosine Matrix ◽

The Stability ◽

Better Than

Abstract This paper introduces a power series based method for attitude reconstruction from triad orthogonal strap-down gyros. The method is implemented and validated using quaternions and direction cosine matrix in single and double precision implementation forms. It is supposed that data from gyros are sampled with high frequency and a fitted polynomial is used for an analytical description of the angular velocity vector. The method is compared with the well-known Taylor series approach, and the stability of the coefficients’ norm in higher-order terms for both methods is analysed. It is shown that the norm of quaternions’ derivatives in the Taylor series is bigger than the equivalent terms coefficients in the power series. In the proposed method, more terms can be used in the power series before the saturation of the coefficients and the error of the proposed method is less than that for other methods. The numerical results show that the application of the proposed method with quaternions performs better than other methods. The method is robust with respect to the noise of the sensors and has a low computational load compared with other methods.

Download Full-text

Programming the Linpack Benchmark for the IBM PowerXCell 8i Processor

Scientific Programming ◽

10.1155/2009/401691 ◽

2009 ◽

Vol 17 (1-2) ◽

pp. 43-57 ◽

Cited By ~ 4

Author(s):

Michael Kistler ◽

John Gunnels ◽

Daniel Brokenshire ◽

Brad Benton

Keyword(s):

High Speed ◽

Double Precision ◽

Data Movement ◽

Processing Elements ◽

Cell Broadband Engine ◽

Design And Implementation ◽

Computational Capability ◽

High Speed Data ◽

Linpack Benchmark ◽

And Performance

In this paper we present the design and implementation of the Linpack benchmark for the IBM BladeCenter QS22, which incorporates two IBM PowerXCell 8i1processors. The PowerXCell 8i is a new implementation of the Cell Broadband Engine™2 architecture and contains a set of special-purpose processing cores known as Synergistic Processing Elements (SPEs). The SPEs can be used as computational accelerators to augment the main PowerPC processor. The added computational capability of the SPEs results in a peak double precision floating point capability of 108.8 GFLOPS. We explain how we modified the standard open source implementation of Linpack to accelerate key computational kernels using the SPEs of the PowerXCell 8i processors. We describe in detail the implementation and performance of the computational kernels and also explain how we employed the SPEs for high-speed data movement and reformatting. The result of these modifications is a Linpack benchmark optimized for the IBM PowerXCell 8i processor that achieves 170.7 GFLOPS on a BladeCenter QS22 with 32 GB of DDR2 SDRAM memory. Our implementation of Linpack also supports clusters of QS22s, and was used to achieve a result of 11.1 TFLOPS on a cluster of 84 QS22 blades. We compare our results on a single BladeCenter QS22 with the base Linpack implementation without SPE acceleration to illustrate the benefits of our optimizations.

Download Full-text

A method of precise misorientation determination

Journal of Applied Crystallography ◽

10.1107/s0021889803015905 ◽

2003 ◽

Vol 36 (6) ◽

pp. 1319-1323 ◽

Cited By ~ 5

Author(s):

A. Morawiec

Keyword(s):

Electron Microscopy ◽

Transmission Electron Microscopy ◽

Spatial Resolution ◽

Standard Procedure ◽

Good Spatial Resolution ◽

Transmission Electron ◽

Order Of Magnitude ◽

Achievable Accuracy ◽

Better Than

A method that improves the accuracy of misorientations determined from Kikuchi patterns is described. It is based on the fact that some parameters of a misorientation calculated from two orientations are more accurate than other parameters. A procedure which eliminates inaccurate elements is devised. It requires at least two foil inclinations. The quality of the approach relies on the possibility to set large sample-to-detector distances and the availability of good spatial resolution of transmission electron microscopy. Achievable accuracy is one order of magnitude better than the accuracy of the standard procedure.

Download Full-text

REINFORCEMENT LEARNING BASED ANTI-COLLISION ALGORITHM FOR RFID SYSTEMS

International Journal of Computing ◽

10.47839/ijc.18.2.1414 ◽

2019 ◽

pp. 155-168

Author(s):

Murukesan Loganathan ◽

Thennarasan Sabapathy ◽

Mohamed Elobaid Elshaikh ◽

Mohamed Nasrun Osman ◽

Rosemizi Abd Rahim ◽

...

Keyword(s):

Reinforcement Learning ◽

Radio Frequency Identification ◽

Energy Efficient ◽

Current Standard ◽

Tag Identification ◽

Rfid Systems ◽

Order Of Magnitude ◽

Control Message ◽

Frequency Identification ◽

Better Than

Efficient collision arbitration protocol facilitates fast tag identification in radio frequency identification (RFID) systems. EPCGlobal-Class1-Generation2 (EPC-C1G2) protocol is the current standard for collision arbitration in commercial RFID systems. However, the main drawback of this protocol is that it requires excessive message exchanges between tags and the reader for its operation. This wastes energy of the already resource-constrained RFID readers. Hence, in this work, reinforcement learning based anti-collision protocol (RL-DFSA) is proposed to address the energy efficient collision arbitration problem in the RFID system. The proposed algorithm continuously learns and adapts to the changes in the environment by devising an optimal policy. The proposed RL-DFSA was evaluated through extensive simulations and compared with the variants of EPC-C1G2 algorithms that are currently being used in the commercial readers. Based on the results, it is concluded that RL-DFSA performs equal or better than EPC-C1G2 protocol in delay, throughput and time system efficiency when simulated for sparse and dense environments while requiring one order of magnitude lesser control message exchanges between the reader and the tags.

Download Full-text

Real Time Color Object Tracking on Cell Broadband Engine Using Particle Filters

Journal of Advanced Computational Intelligence and Intelligent Informatics ◽

10.20965/jaciii.2010.p0272 ◽

2010 ◽

Vol 14 (3) ◽

pp. 272-280 ◽

Cited By ~ 4

Author(s):

Norikazu Ikoma ◽

◽

Akihiro Asahara ◽

Keyword(s):

Particle Filter ◽

Object Tracking ◽

Real Time ◽

Visual Tracking ◽

Single Object ◽

Cell Broadband Engine ◽

Time Performance ◽

Large Size ◽

Particle Filter Algorithm ◽

Local Store

Real time visual tracking by particle filter has been implemented on Cell Broadband Engine in parallel. Major problem for the implementation is small size of Local Store (LS) in SPEs (Synergistic PEs), which are computational cores, to deal with image of large size. As a first step for the implementation, we focus on color single object tracking, which is one of the most simple case of visual tracking. By elaborating to compress the color extracted image into bit-wise representation of binary image, all information of the color extracted image can be stored in LS for 640×480 size of original image. By applying our previous implementation of general particle filter algorithm on Cell/B.E. to this specific case, we have achieved real time performance of visual tracking on PlayStation®3 about 7 fps with a camera of maximum 15 fps.

Download Full-text

Defect-Tolerant Architectures for Nanoelectronic Crossbar Memories

Journal of Nanoscience and Nanotechnology ◽

10.1166/jnn.2007.18012 ◽

2007 ◽

Vol 7 (1) ◽

pp. 151-167 ◽

Cited By ~ 42

Author(s):

Dmitri B. Strukov ◽

Konstantin K. Likharev

Keyword(s):

Upper Bound ◽

Memory Cell ◽

Error Correcting Codes ◽

Cell Fraction ◽

Access Time ◽

Initial Stage ◽

Order Of Magnitude ◽

Pitch Ratio ◽

Better Than

We have calculated the maximum useful bit density that may be achieved by the synergy of bad bit exclusion and advanced (BCH) error correcting codes in prospective crossbar nanoelectronic memories, as a function of defective memory cell fraction. While our calculations are based on a particular ("CMOL") memory topology, with naturally segmented nanowires and an area-distributed nano/CMOS interface, for realistic parameters our results are also applicable to "global" crossbar memories with peripheral interfaces. The results indicate that the crossbar memories with a nano/CMOS pitch ratio close to 1/3 (which is typical for the current, initial stage of the nanoelectronics development) may overcome purely semiconductor memories in useful bit density if the fraction of nanodevice defects (stuck-on-faults) is below ∼15%, even under rather tough, 30 ns upper bound on the total access time. Moreover, as the technology matures, and the pitch ratio approaches an order of magnitude, the crossbar memories may be far superior to the densest semiconductor memories by providing, e.g., a 1 Tbit/cm2 density even for a plausible defect fraction of 2%. These highly encouraging results are much better than those reported in literature earlier, including our own early work, mostly due to more advanced error correcting codes.

Download Full-text