scholarly journals An Impulse-C Hardware Accelerator for Packet Classification Based on Fine/Coarse Grain Optimization

2013 ◽  
Vol 2013 ◽  
pp. 1-23 ◽  
Author(s):  
O. Ahmed ◽  
S. Areibi ◽  
R. Collier ◽  
G. Grewal

Current software-based packet classification algorithms exhibit relatively poor performance, prompting many researchers to concentrate on novel frameworks and architectures that employ both hardware and software components. The Packet Classification with Incremental Update (PCIU) algorithm, Ahmed et al. (2010), is a novel and efficient packet classification algorithm with a unique incremental update capability that demonstrated excellent results and was shown to be scalable for many different tasks and clients. While a pure software implementation can generate powerful results on a server machine, an embedded solution may be more desirable for some applications and clients. Embedded, specialized hardware accelerator based solutions are typically much more efficient in speed, cost, and size than solutions that are implemented on general-purpose processor systems. This paper seeks to explore the design space of translating the PCIU algorithm into hardware by utilizing several optimization techniques, ranging from fine grain to coarse grain and parallel coarse grain approaches. The paper presents a detailed implementation of a hardware accelerator of the PCIU based on an Electronic System Level (ESL) approach. Results obtained indicate that the hardware accelerator achieves on average 27x speedup over a state-of-the-art Xeon processor.

2011 ◽  
Vol 2011 ◽  
pp. 1-21 ◽  
Author(s):  
O. Ahmed ◽  
S. Areibi ◽  
K. Chattha ◽  
B. Kelly

Packet classification plays a crucial role for a number of network services such as policy-based routing, firewalls, and traffic billing, to name a few. However, classification can be a bottleneck in the above-mentioned applications if not implemented properly and efficiently. In this paper, we propose PCIU, a novel classification algorithm, which improves upon previously published work. PCIU provides lower preprocessing time, lower memory consumption, ease of incremental rule update, and reasonable classification time compared to state-of-the-art algorithms. The proposed algorithm was evaluated and compared to RFC and HiCut using several benchmarks. Results obtained indicate that PCIU outperforms these algorithms in terms of speed, memory usage, incremental update capability, and preprocessing time. The algorithm, furthermore, was improved and made more accessible for a variety of applications through implementation in hardware. Two such implementations are detailed and discussed in this paper. The results indicate that a hardware/software codesign approach results in a slower, but easier to optimize and improve within time constraints, PCIU solution. A hardware accelerator based on an ESL approach using Handel-C, on the other hand, resulted in a 31x speed-up over a pure software implementation running on a state of the art Xeon processor.


Electronics ◽  
2021 ◽  
Vol 10 (11) ◽  
pp. 1323
Author(s):  
Srikanth Ramadurgam ◽  
Darshika G. Perera

Machine learning is becoming the cornerstones of smart and autonomous systems. Machine learning algorithms can be categorized into supervised learning (classification) and unsupervised learning (clustering). Among many classification algorithms, the Support Vector Machine (SVM) classifier is one of the most commonly used machine learning algorithms. By incorporating convex optimization techniques into the SVM classifier, we can further enhance the accuracy and classification process of the SVM by finding the optimal solution. Many machine learning algorithms, including SVM classification, are compute-intensive and data-intensive, requiring significant processing power. Furthermore, many machine learning algorithms have found their way into portable and embedded devices, which have stringent requirements. In this research work, we introduce a novel, unique, and efficient Field Programmable Gate Array (FPGA)-based hardware accelerator for a convex optimization-based SVM classifier for embedded platforms, considering the constraints associated with these platforms and the requirements of the applications running on these devices. We incorporate suitable mathematical kernels and decomposition methods to systematically solve the convex optimization for machine learning applications with a large volume of data. Our proposed architectures are generic, parameterized, and scalable; hence, without changing internal architectures, our designs can be used to process different datasets with varying sizes, can be executed on different platforms, and can be utilized for various machine learning applications. We also introduce system-level architectures and techniques to facilitate real-time processing. Experiments are performed using two different benchmark datasets to evaluate the feasibility and efficiency of our hardware architecture, in terms of timing, speedup, area, and accuracy. Our embedded hardware design achieves up to 79 times speedup compared to its embedded software counterpart, and can also achieve up to 100% classification accuracy.


2007 ◽  
Vol 16 (06) ◽  
pp. 911-927
Author(s):  
S. MAJZOUB ◽  
H. DIAB

Reconfigurable Systems represent a middle trade-off between speed and flexibility in the processor design world. It provides performance close to the custom-hardware and yet preserves some of the general-purpose processor flexibility. Recently, the area of reconfigurable computing has received considerable interest in both its forms: the FPGA and coarse-grain hardware. Since the field is still in its developing stage, it is important to perform hardware analysis and evaluation of certain key applications on target reconfigurable architectures to identify potential limitations and improvements. This paper presents the mapping and performance analysis of two encryption algorithms, namely Rijndael and Twofish, on a coarse grain reconfigurable platform, namely MorphoSys. MorphoSys is a reconfigurable architecture targeted for multimedia applications. Since many cryptographic algorithms involve bitwise operations, bitwise instruction set extension was proposed to enhance the performance. We present the details of the mapping of the bitwise operations involved in the algorithms with thorough analysis. The methodology we used can be utilized in other systems.


Author(s):  
Hui Yang ◽  
Anand Nayyar

: In the fast development of information, the information data is increasing in geometric multiples, and the speed of information transmission and storage space are required to be higher. In order to reduce the use of storage space and further improve the transmission efficiency of data, data need to be compressed. processing. In the process of data compression, it is very important to ensure the lossless nature of data, and lossless data compression algorithms appear. The gradual optimization design of the algorithm can often achieve the energy-saving optimization of data compression. Similarly, The effect of energy saving can also be obtained by improving the hardware structure of node. In this paper, a new structure is designed for sensor node, which adopts hardware acceleration, and the data compression module is separated from the node microprocessor.On the basis of the ASIC design of the algorithm, by introducing hardware acceleration, the energy consumption of the compressed data was successfully reduced, and the proportion of energy consumption and compression time saved by the general-purpose processor was as high as 98.4 % and 95.8 %, respectively. It greatly reduces the compression time and energy consumption.


2010 ◽  
Vol 20 (02) ◽  
pp. 103-121 ◽  
Author(s):  
MOSTAFA I. SOLIMAN ◽  
ABDULMAJID F. Al-JUNAID

Technological advances in IC manufacturing provide us with the capability to integrate more and more functionality into a single chip. Today's modern processors have nearly one billion transistors on a single chip. With the increasing complexity of today's system, the designs have to be modeled at a high-level of abstraction before partitioning into hardware and software components for final implementation. This paper explains in detail the implementation and performance evaluation of a matrix processor called Mat-Core with SystemC (system level modeling language). Mat-Core is a research processor aiming at exploiting the increasingly number of transistors per IC to improve the performance of a wide range of applications. It extends a general-purpose scalar processor with a matrix unit. To hide memory latency, the extended matrix unit is decoupled into two components: address generation and data computation, which communicate through data queues. Like vector architectures, the data computation unit is organized in parallel lanes. However, on parallel lanes, Mat-Core can execute matrix-scalar, matrix-vector, and matrix-matrix instructions in addition to vector-scalar and vector-vector instructions. For controlling the execution of vector/matrix instructions on the matrix core, this paper extends the well known scoreboard technique. Furthermore, the performance of Mat-Core is evaluated on vector and matrix kernels. Our results show that the performance of four lanes Mat-Core with matrix registers of size 4 × 4 or 16 elements each, queues size of 10, start up time of 6 clock cycles, and memory latency of 10 clock cycles is about 0.94, 1.3, 2.3, 1.6, 2.3, and 5.5 FLOPs per clock cycle; achieved on scalar-vector multiplication, SAXPY, Givens, rank-1 update, vector-matrix multiplication, and matrix-matrix multiplication, respectively.


Author(s):  
Matias Javier Oliva ◽  
Pablo Andrés García ◽  
Enrique Mario Spinelli ◽  
Alejandro Luis Veiga

<span lang="EN-US">Real-time acquisition and processing of electroencephalographic signals have promising applications in the implementation of brain-computer interfaces. These devices allow the user to control a device without performing motor actions, and are usually made up of a biopotential acquisition stage and a personal computer (PC). This structure is very flexible and appropriate for research, but for final users it is necessary to migrate to an embedded system, eliminating the PC from the scheme. The strict real-time processing requirements of such systems justify the choice of a system on a chip field-programmable gate arrays (SoC-FPGA) for its implementation. This article proposes a platform for the acquisition and processing of electroencephalographic signals using this type of device, which combines the parallelism and speed capabilities of an FPGA with the simplicity of a general-purpose processor on a single chip. In this scheme, the FPGA is in charge of the real-time operation, acquiring and processing the signals, while the processor solves the high-level tasks, with the interconnection between processing elements solved by buses integrated into the chip. The proposed scheme was used to implement a brain-computer interface based on steady-state visual evoked potentials, which was used to command a speller. The first tests of the system show that a selection time of 5 seconds per command can be achieved. The time delay between the user’s selection and the system response has been estimated at 343 µs.</span>


2019 ◽  
Vol 26 (1) ◽  
pp. 39-62
Author(s):  
Stanislav O. Bezzubtsev ◽  
Vyacheslav V. Vasin ◽  
Dmitry Yu. Volkanov ◽  
Shynar R. Zhailauova ◽  
Vladislav A. Miroshnik ◽  
...  

The paper proposes the architecture and basic requirements for a network processor for OpenFlow switches of software-defined networks. An analysis of the architectures of well-known network processors is presented − NP-5 from EZchip (now Mellanox) and Tofino from Barefoot Networks. The advantages and disadvantages of two different versions of network processor architectures are considered: pipeline-based architecture, the stages of which are represented by a set of general-purpose processor cores, and pipeline-based architecture whose stages correspond to cores specialized for specific packet processing operations. Based on a dedicated set of the most common use case scenarios, a new architecture of the network processor unit (NPU) with functionally specialized pipeline stages was proposed. The article presents a description of the simulation model of the NPU of the proposed architecture. The simulation model of the network processor is implemented in C ++ languages using SystemC, the open-source C++ library. For the functional testing of the obtained NPU model, the described use case scenarios were implemented in C. In order to evaluate the performance of the proposed NPU architecture a set of software products developed by KM211 company and the KMX32 family of microcontrollers were used. Evaluation of NPU performance was made on the basis of a simulation model. Estimates of the processing time of one packet and the average throughput of the NPU model for each scenario are obtained.


Sign in / Sign up

Export Citation Format

Share Document