scholarly journals Improving Fine-Grained Irregular Shared-Memory Benchmarks by Data Reordering

Author(s):  
Y.C. Hu ◽  
A. Cox ◽  
W. Zwaenepoel
Author(s):  
Vladimir Vlassov ◽  
Oscar Sierra Merino ◽  
Csaba Andras Moritz ◽  
Konstantin Popov

2021 ◽  
Vol 5 (OOPSLA) ◽  
pp. 1-28
Author(s):  
Dan Iorga ◽  
Alastair F. Donaldson ◽  
Tyler Sorensen ◽  
John Wickerson

Heterogeneous CPU/FPGA devices, in which a CPU and an FPGA can execute together while sharing memory, are becoming popular in several computing sectors. In this paper, we study the shared-memory semantics of these devices, with a view to providing a firm foundation for reasoning about the programs that run on them. Our focus is on Intel platforms that combine an Intel FPGA with a multicore Xeon CPU. We describe the weak-memory behaviours that are allowed (and observable) on these devices when CPU threads and an FPGA thread access common memory locations in a fine-grained manner through multiple channels. Some of these behaviours are familiar from well-studied CPU and GPU concurrency; others are weaker still. We encode these behaviours in two formal memory models: one operational, one axiomatic. We develop executable implementations of both models, using the CBMC bounded model-checking tool for our operational model and the Alloy modelling language for our axiomatic model. Using these, we cross-check our models against each other via a translator that converts Alloy-generated executions into queries for the CBMC model. We also validate our models against actual hardware by translating 583 Alloy-generated executions into litmus tests that we run on CPU/FPGA devices; when doing this, we avoid the prohibitive cost of synthesising a hardware design per litmus test by creating our own 'litmus-test processor' in hardware. We expect that our models will be useful for low-level programmers, compiler writers, and designers of analysis tools. Indeed, as a demonstration of the utility of our work, we use our operational model to reason about a producer/consumer buffer implemented across the CPU and the FPGA. When the buffer uses insufficient synchronisation -- a situation that our model is able to detect -- we observe that its performance improves at the cost of occasional data corruption.


2000 ◽  
Vol 10 (01) ◽  
pp. 111-132 ◽  
Author(s):  
VOON-YEE VEE ◽  
WEN-JING HSU

In the past decade, many synchronous algorithms have been proposed for parallel and discrete simulations. However, the actual performance of these algorithms have been far from ideal, especially when event granularity is small. Barring the case of low parallelism in the given simulation models, one of the main reasons of low speedups is in the uneven load distribution among processors. To amend for this, both static and dynamic load balancing approaches have been proposed. Nevertheless, static schemes based on partitioning of LPs are often subject to the dynamic behavior of the specific simulation models and are therefore application dependent; dynamic load balancing schemes, on the other hand, often suffer from loss of localities and hence cache misses, which could severely penalize on fine-grained event processing. In this paper, we present several new locality-preserving load balancing mechanisms for synchronous simulations on shared-memory multiprocessors. We focus on the type of synchronous simulations where the number of LPs to be processed within a cycle decreases monotonically. We show both theoretically and empirically that some of these mechanisms incur very low overhead. The mechanisms have been implemented by using MIT's Cilk and tested with a number of simulation applications. The results confirm that one of the new mechanisms is indeed more efficient and scalable than common existing approaches.


Author(s):  
Thomas Pani ◽  
Georg Weissenbacher ◽  
Florian Zuleger

AbstractWe present a thread-modular proof method for complexity and resource bound analysis of concurrent, shared-memory programs. To this end, we lift Jones’ rely-guarantee reasoning to assumptions and commitments capable of expressing bounds. The compositionality (thread-modularity) of this framework allows us to reason about parameterized programs, i.e., programs that execute arbitrarily many concurrent threads. We automate reasoning in our logic by reducing bound analysis of concurrent programs to the sequential case. As an application, we automatically infer time complexity for a family of fine-grained concurrent algorithms, lock-free data structures, to our knowledge for the first time.


Author(s):  
Richard S. Chemock

One of the most common tasks in a typical analysis lab is the recording of images. Many analytical techniques (TEM, SEM, and metallography for example) produce images as their primary output. Until recently, the most common method of recording images was by using film. Current PS/2R systems offer very large capacity data storage devices and high resolution displays, making it practical to work with analytical images on PS/2s, thereby sidestepping the traditional film and darkroom steps. This change in operational mode offers many benefits: cost savings, throughput, archiving and searching capabilities as well as direct incorporation of the image data into reports.The conventional way to record images involves film, either sheet film (with its associated wet chemistry) for TEM or PolaroidR film for SEM and light microscopy. Although film is inconvenient, it does have the highest quality of all available image recording techniques. The fine grained film used for TEM has a resolution that would exceed a 4096x4096x16 bit digital image.


Author(s):  
Steven D. Toteda

Zirconia oxygen sensors, in such applications as power plants and automobiles, generally utilize platinum electrodes for the catalytic reaction of dissociating O2 at the surface. The microstructure of the platinum electrode defines the resulting electrical response. The electrode must be porous enough to allow the oxygen to reach the zirconia surface while still remaining electrically continuous. At low sintering temperatures, the platinum is highly porous and fine grained. The platinum particles sinter together as the firing temperatures are increased. As the sintering temperatures are raised even further, the surface of the platinum begins to facet with lower energy surfaces. These microstructural changes can be seen in Figures 1 and 2, but the goal of the work is to characterize the microstructure by its fractal dimension and then relate the fractal dimension to the electrical response. The sensors were fabricated from zirconia powder stabilized in the cubic phase with 8 mol% percent yttria. Each substrate was sintered for 14 hours at 1200°C. The resulting zirconia pellets, 13mm in diameter and 2mm in thickness, were roughly 97 to 98 percent of theoretical density. The Engelhard #6082 platinum paste was applied to the zirconia disks after they were mechanically polished ( diamond). The electrodes were then sintered at temperatures ranging from 600°C to 1000°C. Each sensor was tested to determine the impedance response from 1Hz to 5,000Hz. These frequencies correspond to the electrode at the test temperature of 600°C.


Sign in / Sign up

Export Citation Format

Share Document