Fast and Scalable Pattern Matching for Memory Architecture

International Journal of Computer and Communication Technology ◽

10.47893/ijcct.2016.1344 ◽

2016 ◽

pp. 84-89

Author(s):

J. Santhi ◽

L. Srinivas

Keyword(s):

Pattern Matching ◽

Large Fraction ◽

Bloom Filters ◽

Memory Architecture ◽

Embedded Memory ◽

Content Filtering ◽

Pattern Length ◽

Memory Accesses ◽

On Chip ◽

Specialized Hardware

Multi-pattern matching is known to require intensive memory accesses and is often a performance bottleneck. Hence specialized hardware-accelerated algorithms are being developed for line-speed packet processing. While several pattern matching algorithms have already been developed for such applications, we find that most of them suffer from scalability issues. We present a hardware-implementable pattern matching algorithm for content filtering applications, which is scalable in terms of speed, the number of patterns and the pattern length. We modify the classic Aho-Corasick algorithm to consider multiple characters at a time for higher throughput. Furthermore, we suppress a large fraction of memory accesses by using Bloom filters implemented with a small amount of on-chip memory. The resulting algorithm can support matching of several thousands of patterns at more than 10 Gbps with the help of a less than 50 KBytes of embedded memory and a few megabytes of external SRAM.

Download Full-text

Adaptive Precision Block-Jacobi for High Performance Preconditioning in the Ginkgo Linear Algebra Software

ACM Transactions on Mathematical Software ◽

10.1145/3441850 ◽

2021 ◽

Vol 47 (2) ◽

pp. 1-28

Author(s):

Goran Flegar ◽

Hartwig Anzt ◽

Terry Cojean ◽

Enrique S. Quintana-Ortí

Keyword(s):

Linear Algebra ◽

Graphics Processing Units ◽

High Performance ◽

Numerical Algorithms ◽

Mixed Precision ◽

Before And After ◽

Memory Accesses ◽

Specialized Hardware ◽

The Individual ◽

Graphics Processing

The use of mixed precision in numerical algorithms is a promising strategy for accelerating scientific applications. In particular, the adoption of specialized hardware and data formats for low-precision arithmetic in high-end GPUs (graphics processing units) has motivated numerous efforts aiming at carefully reducing the working precision in order to speed up the computations. For algorithms whose performance is bound by the memory bandwidth, the idea of compressing its data before (and after) memory accesses has received considerable attention. One idea is to store an approximate operator–like a preconditioner–in lower than working precision hopefully without impacting the algorithm output. We realize the first high-performance implementation of an adaptive precision block-Jacobi preconditioner which selects the precision format used to store the preconditioner data on-the-fly, taking into account the numerical properties of the individual preconditioner blocks. We implement the adaptive block-Jacobi preconditioner as production-ready functionality in the Ginkgo linear algebra library, considering not only the precision formats that are part of the IEEE standard, but also customized formats which optimize the length of the exponent and significand to the characteristics of the preconditioner blocks. Experiments run on a state-of-the-art GPU accelerator show that our implementation offers attractive runtime savings.

Download Full-text

On-Device Deep Learning Inference for System-on-Chip (SoC) Architectures

Electronics ◽

10.3390/electronics10060689 ◽

2021 ◽

Vol 10 (6) ◽

pp. 689

Author(s):

Tom Springer ◽

Elia Eiroa-Lledo ◽

Elizabeth Stevens ◽

Erik Linstead

Keyword(s):

Machine Learning ◽

Deep Learning ◽

Real Time ◽

Operating Systems ◽

System On Chip ◽

Low Latency ◽

Management Framework ◽

On Chip ◽

Specialized Hardware ◽

Deterministic Behavior

As machine learning becomes ubiquitous, the need to deploy models on real-time, embedded systems will become increasingly critical. This is especially true for deep learning solutions, whose large models pose interesting challenges for target architectures at the “edge” that are resource-constrained. The realization of machine learning, and deep learning, is being driven by the availability of specialized hardware, such as system-on-chip solutions, which provide some alleviation of constraints. Equally important, however, are the operating systems that run on this hardware, and specifically the ability to leverage commercial real-time operating systems which, unlike general purpose operating systems such as Linux, can provide the low-latency, deterministic execution required for embedded, and potentially safety-critical, applications at the edge. Despite this, studies considering the integration of real-time operating systems, specialized hardware, and machine learning/deep learning algorithms remain limited. In particular, better mechanisms for real-time scheduling in the context of machine learning applications will prove to be critical as these technologies move to the edge. In order to address some of these challenges, we present a resource management framework designed to provide a dynamic on-device approach to the allocation and scheduling of limited resources in a real-time processing environment. These types of mechanisms are necessary to support the deterministic behavior required by the control components contained in the edge nodes. To validate the effectiveness of our approach, we applied rigorous schedulability analysis to a large set of randomly generated simulated task sets and then verified the most time critical applications, such as the control tasks which maintained low-latency deterministic behavior even during off-nominal conditions. The practicality of our scheduling framework was demonstrated by integrating it into a commercial real-time operating system (VxWorks) then running a typical deep learning image processing application to perform simple object detection. The results indicate that our proposed resource management framework can be leveraged to facilitate integration of machine learning algorithms with real-time operating systems and embedded platforms, including widely-used, industry-standard real-time operating systems.

Download Full-text

Request, Coalesce, Serve, and Forget: Miss-Optimized Memory Systems for Bandwidth-Bound Cache-Unfriendly Applications on FPGAs

ACM Transactions on Reconfigurable Technology and Systems ◽

10.1145/3466823 ◽

2022 ◽

Vol 15 (2) ◽

pp. 1-33

Author(s):

Mikhail Asiatici ◽

Paolo Ienne

Keyword(s):

Large Scale ◽

Sparse Matrix ◽

Memory Systems ◽

Graph Analytics ◽

Matrix Vector Multiplication ◽

Area Reduction ◽

Cache Line ◽

Speed Up ◽

Memory Accesses ◽

On Chip

Applications such as large-scale sparse linear algebra and graph analytics are challenging to accelerate on FPGAs due to the short irregular memory accesses, resulting in low cache hit rates. Nonblocking caches reduce the bandwidth required by misses by requesting each cache line only once, even when there are multiple misses corresponding to it. However, such reuse mechanism is traditionally implemented using an associative lookup. This limits the number of misses that are considered for reuse to a few tens, at most. In this article, we present an efficient pipeline that can process and store thousands of outstanding misses in cuckoo hash tables in on-chip SRAM with minimal stalls. This brings the same bandwidth advantage as a larger cache for a fraction of the area budget, because outstanding misses do not need a data array, which can significantly speed up irregular memory-bound latency-insensitive applications. In addition, we extend nonblocking caches to generate variable-length bursts to memory, which increases the bandwidth delivered by DRAMs and their controllers. The resulting miss-optimized memory system provides up to 25% speedup with 24× area reduction on 15 large sparse matrix-vector multiplication benchmarks evaluated on an embedded and a datacenter FPGA system.

Download Full-text

Zero capacitor embedded memory technology for system on chip

2005 IEEE International Workshop on Memory Technology, Design, and Testing (MTDT'05) ◽

10.1109/mtdt.2005.4655409 ◽

2005 ◽

Cited By ~ 11

Author(s):

S. Okhonin ◽

P. Fazan ◽

M.-E. Jones

Keyword(s):

System On Chip ◽

Embedded Memory ◽

On Chip

Download Full-text

Embedded Memory Architecture for Low-Power Application Processor

Integrated Circuits and Systems - Embedded Memories for Nano-Scale VLSIs ◽

10.1007/978-0-387-88497-4_2 ◽

2009 ◽

pp. 7-38 ◽

Cited By ~ 1

Author(s):

Hoi Jun Yoo ◽

Donghyun Kim

Keyword(s):

Low Power ◽

Memory Architecture ◽

Embedded Memory ◽

Application Processor ◽

Power Application

Download Full-text

Prototype implementation and evaluation of a multibank embedded memory architecture in programmable logic

2003 IEEE Pacific Rim Conference on Communications Computers and Signal Processing (PACRIM 2003) (Cat. No.03CH37490) ◽

10.1109/pacrim.2003.1235707 ◽

2004 ◽

Author(s):

Huang Jin ◽

N. Manjikian

Keyword(s):

Programmable Logic ◽

Memory Architecture ◽

Embedded Memory ◽

Prototype Implementation

Download Full-text

Three-dimensional image processing VLSI system with network-on-chip system and reconfigurable memory architecture

IEEE Transactions on Consumer Electronics ◽

10.1109/tce.2011.6018893 ◽

2011 ◽

Vol 57 (3) ◽

pp. 1345-1353 ◽

Cited By ~ 2

Author(s):

Yun Yang

Keyword(s):

Image Processing ◽

Three Dimensional ◽

Network On Chip ◽

Memory Architecture ◽

Dimensional Image ◽

Vlsi System ◽

Reconfigurable Memory ◽

On Chip

Download Full-text

Exact pattern matching with feed-forward bloom filters

Journal of Experimental Algorithmics ◽

10.1145/2133803.2330085 ◽

2012 ◽

Vol 17 ◽

Cited By ~ 8

Author(s):

Iulian Moraru ◽

David G. Andersen

Keyword(s):

Pattern Matching ◽

Bloom Filters ◽

Feed Forward

Download Full-text

Designing Coalescing Network-on-Chip for Efficient Memory Accesses of GPGPUs

Advanced Information Systems Engineering - Lecture Notes in Computer Science ◽

10.1007/978-3-662-44917-2_15 ◽

2014 ◽

pp. 169-180 ◽

Cited By ~ 2

Author(s):

Chien-Ting Chen ◽

Yoshi Shih-Chieh Huang ◽

Yuan-Ying Chang ◽

Chiao-Yun Tu ◽

Chung-Ta King ◽

...

Keyword(s):

Network On Chip ◽

Memory Accesses ◽

On Chip ◽

Efficient Memory

Download Full-text

Striping input feature map cache for reducing off-chip memory traffic in CNN accelerators

Telfor Journal ◽

10.5937/telfor2002116s ◽

2020 ◽

Vol 12 (2) ◽

pp. 116-121

Author(s):

Rastislav Struharik ◽

Vuk Vranjković

Keyword(s):

Power Consumption ◽

Data Reuse ◽

Memory Architecture ◽

Cache Memories ◽

Data Movement ◽

Feature Map ◽

Input Feature ◽

On Chip ◽

Memory Resources ◽

Embedded Applications

Data movement between the Convolutional Neural Network (CNN) accelerators and off-chip memory is critical concerning the overall power consumption. Minimizing power consumption is particularly important for low power embedded applications. Specific CNN computes patterns offer a possibility of significant data reuse, leading to the idea of using specialized on-chip cache memories which enable a significant improvement in power consumption. However, due to the unique caching pattern present within CNNs, standard cache memories would not be efficient. In this paper, a novel on-chip cache memory architecture, based on the idea of input feature map striping, is proposed, which requires significantly less on-chip memory resources compared to previously proposed solutions. Experiment results show that the proposed cache architecture can reduce on-chip memory size by a factor of 16 or more, while increasing power consumption no more than 15%, compared to some of the previously proposed solutions.

Download Full-text