scholarly journals PCIU: Hardware Implementations of an Efficient Packet Classification Algorithm with an Incremental Update Capability

2011 ◽  
Vol 2011 ◽  
pp. 1-21 ◽  
Author(s):  
O. Ahmed ◽  
S. Areibi ◽  
K. Chattha ◽  
B. Kelly

Packet classification plays a crucial role for a number of network services such as policy-based routing, firewalls, and traffic billing, to name a few. However, classification can be a bottleneck in the above-mentioned applications if not implemented properly and efficiently. In this paper, we propose PCIU, a novel classification algorithm, which improves upon previously published work. PCIU provides lower preprocessing time, lower memory consumption, ease of incremental rule update, and reasonable classification time compared to state-of-the-art algorithms. The proposed algorithm was evaluated and compared to RFC and HiCut using several benchmarks. Results obtained indicate that PCIU outperforms these algorithms in terms of speed, memory usage, incremental update capability, and preprocessing time. The algorithm, furthermore, was improved and made more accessible for a variety of applications through implementation in hardware. Two such implementations are detailed and discussed in this paper. The results indicate that a hardware/software codesign approach results in a slower, but easier to optimize and improve within time constraints, PCIU solution. A hardware accelerator based on an ESL approach using Handel-C, on the other hand, resulted in a 31x speed-up over a pure software implementation running on a state of the art Xeon processor.

2020 ◽  
Vol 2020 ◽  
pp. 1-11
Author(s):  
Chuanhong Li ◽  
Xuewen Zeng ◽  
Lei Song ◽  
Yan Jiang

Packet classification algorithms have been the focus of research for the last few years, due to the vital role they play in various services based on packet forwarding. However, as the number of rules in the rule set increases, not only the preprocessing time but also the memory consumption is increasing greatly. In this paper, we first model and analyze the above issue in depth. Then, a fast, smart packet classification algorithm based on decomposition is proposed. By boundary-based rule traversal and smart rule set partitioning, both the preprocessing time and memory consumption are reduced dramatically. Experimental results show that the preprocessing time of our method achieves 8.8-time improvement at maximum compared with the PCIU and achieves about 31.5-time improvement on average compared with CutSplit for large rule sets. Meanwhile, the memory overhead is reduced by 40% at maximum and 27.5% on average compared with the PCIU.


2013 ◽  
Vol 2013 ◽  
pp. 1-23 ◽  
Author(s):  
O. Ahmed ◽  
S. Areibi ◽  
R. Collier ◽  
G. Grewal

Current software-based packet classification algorithms exhibit relatively poor performance, prompting many researchers to concentrate on novel frameworks and architectures that employ both hardware and software components. The Packet Classification with Incremental Update (PCIU) algorithm, Ahmed et al. (2010), is a novel and efficient packet classification algorithm with a unique incremental update capability that demonstrated excellent results and was shown to be scalable for many different tasks and clients. While a pure software implementation can generate powerful results on a server machine, an embedded solution may be more desirable for some applications and clients. Embedded, specialized hardware accelerator based solutions are typically much more efficient in speed, cost, and size than solutions that are implemented on general-purpose processor systems. This paper seeks to explore the design space of translating the PCIU algorithm into hardware by utilizing several optimization techniques, ranging from fine grain to coarse grain and parallel coarse grain approaches. The paper presents a detailed implementation of a hardware accelerator of the PCIU based on an Electronic System Level (ESL) approach. Results obtained indicate that the hardware accelerator achieves on average 27x speedup over a state-of-the-art Xeon processor.


2013 ◽  
Vol 2013 ◽  
pp. 1-33 ◽  
Author(s):  
O. Ahmed ◽  
S. Areibi ◽  
G. Grewal

Packet classification is a ubiquitous and key building block for many critical network devices. However, it remains as one of the main bottlenecks faced when designing fast network devices. In this paper, we propose a novel Group Based Search packet classification Algorithm (GBSA) that is scalable, fast, and efficient. GBSA consumes an average of 0.4 megabytes of memory for a 10 k rule set. The worst-case classification time per packet is 2 microseconds, and the preprocessing speed is 3 M rules/second based on an Xeon processor operating at 3.4 GHz. When compared with other state-of-the-art classification techniques, the results showed that GBSA outperforms the competition with respect to speed, memory usage, and processing time. Moreover, GBSA is amenable to implementation in hardware. Three different hardware implementations are also presented in this paper including an Application Specific Instruction Set Processor (ASIP) implementation and two pure Register-Transfer Level (RTL) implementations based on Impulse-C and Handel-C flows, respectively. Speedups achieved with these hardware accelerators ranged from 9x to 18x compared with a pure software implementation running on an Xeon processor.


2020 ◽  
Vol 36 (10-12) ◽  
pp. 2327-2340 ◽  
Author(s):  
Daniel Ströter ◽  
Johannes S. Mueller-Roemer ◽  
André Stork ◽  
Dieter W. Fellner

Abstract We present a novel bounding volume hierarchy for GPU-accelerated direct volume rendering (DVR) as well as volumetric mesh slicing and inside-outside intersection testing. Our novel octree-based data structure is laid out linearly in memory using space filling Morton curves. As our new data structure results in tightly fitting bounding volumes, boundary markers can be associated with nodes in the hierarchy. These markers can be used to speed up all three use cases that we examine. In addition, our data structure is memory-efficient, reducing memory consumption by up to 75%. Tree depth and memory consumption can be controlled using a parameterized heuristic during construction. This allows for significantly shorter construction times compared to the state of the art. For GPU-accelerated DVR, we achieve performance gain of 8.4$$\times $$ × –13$$\times $$ × . For 3D printing, we present an efficient conservative slicing method that results in a 3$$\times $$ × –25$$\times $$ × speedup when using our data structure. Furthermore, we improve volumetric mesh intersection testing speed by 5$$\times $$ × –52$$\times $$ × .


2021 ◽  
Vol 297 ◽  
pp. 126645
Author(s):  
Gajanan Sampatrao Ghodake ◽  
Surendra Krushna Shinde ◽  
Avinash Ashok Kadam ◽  
Rijuta Ganesh Saratale ◽  
Ganesh Dattatraya Saratale ◽  
...  

2020 ◽  
Vol 14 (4) ◽  
pp. 653-667
Author(s):  
Laxman Dhulipala ◽  
Changwan Hong ◽  
Julian Shun

Connected components is a fundamental kernel in graph applications. The fastest existing multicore algorithms for solving graph connectivity are based on some form of edge sampling and/or linking and compressing trees. However, many combinations of these design choices have been left unexplored. In this paper, we design the ConnectIt framework, which provides different sampling strategies as well as various tree linking and compression schemes. ConnectIt enables us to obtain several hundred new variants of connectivity algorithms, most of which extend to computing spanning forest. In addition to static graphs, we also extend ConnectIt to support mixes of insertions and connectivity queries in the concurrent setting. We present an experimental evaluation of ConnectIt on a 72-core machine, which we believe is the most comprehensive evaluation of parallel connectivity algorithms to date. Compared to a collection of state-of-the-art static multicore algorithms, we obtain an average speedup of 12.4x (2.36x average speedup over the fastest existing implementation for each graph). Using ConnectIt, we are able to compute connectivity on the largest publicly-available graph (with over 3.5 billion vertices and 128 billion edges) in under 10 seconds using a 72-core machine, providing a 3.1x speedup over the fastest existing connectivity result for this graph, in any computational setting. For our incremental algorithms, we show that our algorithms can ingest graph updates at up to several billion edges per second. To guide the user in selecting the best variants in ConnectIt for different situations, we provide a detailed analysis of the different strategies. Finally, we show how the techniques in ConnectIt can be used to speed up two important graph applications: approximate minimum spanning forest and SCAN clustering.


Sign in / Sign up

Export Citation Format

Share Document