An Impulse-C Hardware Accelerator for Packet Classification Based on Fine/Coarse Grain Optimization

International Journal of Reconfigurable Computing ◽

10.1155/2013/130765 ◽

2013 ◽

Vol 2013 ◽

pp. 1-23 ◽

Cited By ~ 1

Author(s):

O. Ahmed ◽

S. Areibi ◽

R. Collier ◽

G. Grewal

Keyword(s):

Poor Performance ◽

Electronic System ◽

General Purpose ◽

Packet Classification ◽

Optimization Techniques ◽

System Level ◽

Coarse Grain ◽

Hardware Accelerator ◽

General Purpose Processor ◽

Incremental Update

Current software-based packet classification algorithms exhibit relatively poor performance, prompting many researchers to concentrate on novel frameworks and architectures that employ both hardware and software components. The Packet Classification with Incremental Update (PCIU) algorithm, Ahmed et al. (2010), is a novel and efficient packet classification algorithm with a unique incremental update capability that demonstrated excellent results and was shown to be scalable for many different tasks and clients. While a pure software implementation can generate powerful results on a server machine, an embedded solution may be more desirable for some applications and clients. Embedded, specialized hardware accelerator based solutions are typically much more efficient in speed, cost, and size than solutions that are implemented on general-purpose processor systems. This paper seeks to explore the design space of translating the PCIU algorithm into hardware by utilizing several optimization techniques, ranging from fine grain to coarse grain and parallel coarse grain approaches. The paper presents a detailed implementation of a hardware accelerator of the PCIU based on an Electronic System Level (ESL) approach. Results obtained indicate that the hardware accelerator achieves on average 27x speedup over a state-of-the-art Xeon processor.

Download Full-text

Optimizing many-field packet classification on FPGA, multi-core general purpose processor, and GPU

2015 ACM/IEEE Symposium on Architectures for Networking and Communications Systems (ANCS) ◽

10.1109/ancs.2015.7110123 ◽

2015 ◽

Cited By ~ 11

Author(s):

Yun R. Qu ◽

Hao H. Zhang ◽

Shijie Zhou ◽

Viktor K. Prasanna

Keyword(s):

General Purpose ◽

Packet Classification ◽

General Purpose Processor

Download Full-text

PCIU: Hardware Implementations of an Efficient Packet Classification Algorithm with an Incremental Update Capability

International Journal of Reconfigurable Computing ◽

10.1155/2011/648483 ◽

2011 ◽

Vol 2011 ◽

pp. 1-21 ◽

Cited By ~ 5

Author(s):

O. Ahmed ◽

S. Areibi ◽

K. Chattha ◽

B. Kelly

Keyword(s):

State Of The Art ◽

Packet Classification ◽

Classification Algorithm ◽

Software Implementation ◽

Network Services ◽

Hardware Accelerator ◽

Memory Consumption ◽

Hardware Implementations ◽

Speed Up ◽

Incremental Update

Packet classification plays a crucial role for a number of network services such as policy-based routing, firewalls, and traffic billing, to name a few. However, classification can be a bottleneck in the above-mentioned applications if not implemented properly and efficiently. In this paper, we propose PCIU, a novel classification algorithm, which improves upon previously published work. PCIU provides lower preprocessing time, lower memory consumption, ease of incremental rule update, and reasonable classification time compared to state-of-the-art algorithms. The proposed algorithm was evaluated and compared to RFC and HiCut using several benchmarks. Results obtained indicate that PCIU outperforms these algorithms in terms of speed, memory usage, incremental update capability, and preprocessing time. The algorithm, furthermore, was improved and made more accessible for a variety of applications through implementation in hardware. Two such implementations are detailed and discussed in this paper. The results indicate that a hardware/software codesign approach results in a slower, but easier to optimize and improve within time constraints, PCIU solution. A hardware accelerator based on an ESL approach using Handel-C, on the other hand, resulted in a 31x speed-up over a pure software implementation running on a state of the art Xeon processor.

Download Full-text

An Efficient FPGA-Based Hardware Accelerator for Convex Optimization-Based SVM Classifier for Machine Learning on Embedded Platforms

Electronics ◽

10.3390/electronics10111323 ◽

2021 ◽

Vol 10 (11) ◽

pp. 1323

Author(s):

Srikanth Ramadurgam ◽

Darshika G. Perera

Keyword(s):

Machine Learning ◽

Convex Optimization ◽

Learning Algorithms ◽

Optimization Techniques ◽

Machine Learning Algorithms ◽

System Level ◽

Svm Classifier ◽

Hardware Accelerator ◽

Machine Learning Applications ◽

Embedded Platforms

Machine learning is becoming the cornerstones of smart and autonomous systems. Machine learning algorithms can be categorized into supervised learning (classification) and unsupervised learning (clustering). Among many classification algorithms, the Support Vector Machine (SVM) classifier is one of the most commonly used machine learning algorithms. By incorporating convex optimization techniques into the SVM classifier, we can further enhance the accuracy and classification process of the SVM by finding the optimal solution. Many machine learning algorithms, including SVM classification, are compute-intensive and data-intensive, requiring significant processing power. Furthermore, many machine learning algorithms have found their way into portable and embedded devices, which have stringent requirements. In this research work, we introduce a novel, unique, and efficient Field Programmable Gate Array (FPGA)-based hardware accelerator for a convex optimization-based SVM classifier for embedded platforms, considering the constraints associated with these platforms and the requirements of the applications running on these devices. We incorporate suitable mathematical kernels and decomposition methods to systematically solve the convex optimization for machine learning applications with a large volume of data. Our proposed architectures are generic, parameterized, and scalable; hence, without changing internal architectures, our designs can be used to process different datasets with varying sizes, can be executed on different platforms, and can be utilized for various machine learning applications. We also introduce system-level architectures and techniques to facilitate real-time processing. Experiments are performed using two different benchmark datasets to evaluate the feasibility and efficiency of our hardware architecture, in terms of timing, speedup, area, and accuracy. Our embedded hardware design achieves up to 79 times speedup compared to its embedded software counterpart, and can also achieve up to 100% classification accuracy.

Download Full-text

INSTRUCTION-SET EXTENSION FOR CRYPTOGRAPHIC APPLICATIONS ON RECONFIGURABLE PLATFORM

Journal of Circuits System and Computers ◽

10.1142/s0218126607004076 ◽

2007 ◽

Vol 16 (06) ◽

pp. 911-927

Author(s):

S. MAJZOUB ◽

H. DIAB

Keyword(s):

Reconfigurable Computing ◽

General Purpose ◽

Coarse Grain ◽

Instruction Set ◽

General Purpose Processor ◽

Instruction Set Extension ◽

Custom Hardware ◽

Reconfigurable Platform ◽

And Performance ◽

Bitwise Operations

Reconfigurable Systems represent a middle trade-off between speed and flexibility in the processor design world. It provides performance close to the custom-hardware and yet preserves some of the general-purpose processor flexibility. Recently, the area of reconfigurable computing has received considerable interest in both its forms: the FPGA and coarse-grain hardware. Since the field is still in its developing stage, it is important to perform hardware analysis and evaluation of certain key applications on target reconfigurable architectures to identify potential limitations and improvements. This paper presents the mapping and performance analysis of two encryption algorithms, namely Rijndael and Twofish, on a coarse grain reconfigurable platform, namely MorphoSys. MorphoSys is a reconfigurable architecture targeted for multimedia applications. Since many cryptographic algorithms involve bitwise operations, bitwise instruction set extension was proposed to enhance the performance. We present the details of the mapping of the bitwise operations involved in the algorithms with thorough analysis. The methodology we used can be utilized in other systems.

Download Full-text

Corrigendum to “An Impulse-C Hardware Accelerator for Packet Classification Based on Fine/Coarse Grain Optimization”

International Journal of Reconfigurable Computing ◽

10.1155/2018/6075043 ◽

2018 ◽

Vol 2018 ◽

pp. 1-1

Author(s):

O. Ahmed ◽

S. Areibi ◽

R. Collier ◽

G. Grewal

Keyword(s):

Packet Classification ◽

Coarse Grain ◽

Hardware Accelerator

Download Full-text

Design and Implementation of Low Energy Wireless Network Nodes based on Hardware Compression Acceleration

Recent Patents on Computer Science ◽

10.2174/2213275912666190715164024 ◽

2019 ◽

Vol 12 ◽

Author(s):

Hui Yang ◽

Anand Nayyar

Keyword(s):

Energy Consumption ◽

Data Compression ◽

Energy Saving ◽

Optimization Design ◽

Hardware Acceleration ◽

Transmission Efficiency ◽

General Purpose ◽

Storage Space ◽

General Purpose Processor ◽

Compression Time

: In the fast development of information, the information data is increasing in geometric multiples, and the speed of information transmission and storage space are required to be higher. In order to reduce the use of storage space and further improve the transmission efficiency of data, data need to be compressed. processing. In the process of data compression, it is very important to ensure the lossless nature of data, and lossless data compression algorithms appear. The gradual optimization design of the algorithm can often achieve the energy-saving optimization of data compression. Similarly, The effect of energy saving can also be obtained by improving the hardware structure of node. In this paper, a new structure is designed for sensor node, which adopts hardware acceleration, and the data compression module is separated from the node microprocessor.On the basis of the ASIC design of the algorithm, by introducing hardware acceleration, the energy consumption of the compressed data was successfully reduced, and the proportion of energy consumption and compression time saved by the general-purpose processor was as high as 98.4 % and 95.8 %, respectively. It greatly reduces the compression time and energy consumption.

Download Full-text

SYSTEMC IMPLEMENTATION AND PERFORMANCE EVALUATION OF A DECOUPLED GENERAL-PURPOSE MATRIX PROCESSOR

Parallel Processing Letters ◽

10.1142/s0129626410000090 ◽

2010 ◽

Vol 20 (02) ◽

pp. 103-121 ◽

Cited By ~ 1

Author(s):

MOSTAFA I. SOLIMAN ◽

ABDULMAJID F. Al-JUNAID

Keyword(s):

Performance Evaluation ◽

Matrix Multiplication ◽

General Purpose ◽

System Level ◽

Memory Latency ◽

Single Chip ◽

Wide Range ◽

Matrix Unit ◽

And Performance ◽

Vector Matrix

Technological advances in IC manufacturing provide us with the capability to integrate more and more functionality into a single chip. Today's modern processors have nearly one billion transistors on a single chip. With the increasing complexity of today's system, the designs have to be modeled at a high-level of abstraction before partitioning into hardware and software components for final implementation. This paper explains in detail the implementation and performance evaluation of a matrix processor called Mat-Core with SystemC (system level modeling language). Mat-Core is a research processor aiming at exploiting the increasingly number of transistors per IC to improve the performance of a wide range of applications. It extends a general-purpose scalar processor with a matrix unit. To hide memory latency, the extended matrix unit is decoupled into two components: address generation and data computation, which communicate through data queues. Like vector architectures, the data computation unit is organized in parallel lanes. However, on parallel lanes, Mat-Core can execute matrix-scalar, matrix-vector, and matrix-matrix instructions in addition to vector-scalar and vector-vector instructions. For controlling the execution of vector/matrix instructions on the matrix core, this paper extends the well known scoreboard technique. Furthermore, the performance of Mat-Core is evaluated on vector and matrix kernels. Our results show that the performance of four lanes Mat-Core with matrix registers of size 4 × 4 or 16 elements each, queues size of 10, start up time of 6 clock cycles, and memory latency of 10 clock cycles is about 0.94, 1.3, 2.3, 1.6, 2.3, and 5.5 FLOPs per clock cycle; achieved on scalar-vector multiplication, SAXPY, Givens, rank-1 update, vector-matrix multiplication, and matrix-matrix multiplication, respectively.

Download Full-text

A Modular and Distributed Setup for Power and Performance Analysis of Multi-Processor System-on-Chip at Electronic System Level

2020 IEEE 39th International Performance Computing and Communications Conference (IPCCC) ◽

10.1109/ipccc50635.2020.9391516 ◽

2020 ◽

Author(s):

Muhammad Mudussir Ayub ◽

Franz Kreupl

Keyword(s):

Performance Analysis ◽

Electronic System ◽

System On Chip ◽

System Level ◽

Electronic System Level ◽

And Performance ◽

On Chip

Download Full-text

SoC-FPGA systems for the acquisition and processing of electroencephalographic signals

International Journal of Reconfigurable and Embedded Systems (IJRES) ◽

10.11591/ijres.v10.i3.pp237-248 ◽

2021 ◽

Vol 10 (3) ◽

pp. 237

Author(s):

Matias Javier Oliva ◽

Pablo Andrés García ◽

Enrique Mario Spinelli ◽

Alejandro Luis Veiga

Keyword(s):

Embedded System ◽

Real Time ◽

General Purpose ◽

System Response ◽

Single Chip ◽

Real Time Processing ◽

General Purpose Processor ◽

Time Operation ◽

Electroencephalographic Signals ◽

High Level

<span lang="EN-US">Real-time acquisition and processing of electroencephalographic signals have promising applications in the implementation of brain-computer interfaces. These devices allow the user to control a device without performing motor actions, and are usually made up of a biopotential acquisition stage and a personal computer (PC). This structure is very flexible and appropriate for research, but for final users it is necessary to migrate to an embedded system, eliminating the PC from the scheme. The strict real-time processing requirements of such systems justify the choice of a system on a chip field-programmable gate arrays (SoC-FPGA) for its implementation. This article proposes a platform for the acquisition and processing of electroencephalographic signals using this type of device, which combines the parallelism and speed capabilities of an FPGA with the simplicity of a general-purpose processor on a single chip. In this scheme, the FPGA is in charge of the real-time operation, acquiring and processing the signals, while the processor solves the high-level tasks, with the interconnection between processing elements solved by buses integrated into the chip. The proposed scheme was used to implement a brain-computer interface based on steady-state visual evoked potentials, which was used to command a speller. The first tests of the system show that a selection time of 5 seconds per command can be achieved. The time delay between the user’s selection and the system response has been estimated at 343 µs.</span>

Download Full-text

An Approach to the Construction of a Network Processing Unit

Modeling and Analysis of Information Systems ◽

10.18255/1818-1015-2019-1-39-62 ◽

2019 ◽

Vol 26 (1) ◽

pp. 39-62

Author(s):

Stanislav O. Bezzubtsev ◽

Vyacheslav V. Vasin ◽

Dmitry Yu. Volkanov ◽

Shynar R. Zhailauova ◽

Vladislav A. Miroshnik ◽

...

Keyword(s):

Simulation Model ◽

General Purpose ◽

Network Processor ◽

Processing Unit ◽

Use Case ◽

General Purpose Processor ◽

Software Products ◽

Processor Architectures ◽

Advantages And Disadvantages ◽

Processor Unit

The paper proposes the architecture and basic requirements for a network processor for OpenFlow switches of software-defined networks. An analysis of the architectures of well-known network processors is presented − NP-5 from EZchip (now Mellanox) and Tofino from Barefoot Networks. The advantages and disadvantages of two different versions of network processor architectures are considered: pipeline-based architecture, the stages of which are represented by a set of general-purpose processor cores, and pipeline-based architecture whose stages correspond to cores specialized for specific packet processing operations. Based on a dedicated set of the most common use case scenarios, a new architecture of the network processor unit (NPU) with functionally specialized pipeline stages was proposed. The article presents a description of the simulation model of the NPU of the proposed architecture. The simulation model of the network processor is implemented in C ++ languages using SystemC, the open-source C++ library. For the functional testing of the obtained NPU model, the described use case scenarios were implemented in C. In order to evaluate the performance of the proposed NPU architecture a set of software products developed by KM211 company and the KMX32 family of microcontrollers were used. Evaluation of NPU performance was made on the basis of a simulation model. Estimates of the processing time of one packet and the average throughput of the NPU model for each scenario are obtained.

Download Full-text