Designing efficient algorithms for querying large corpora

Paul Meurer

doi:10.5617/osla.8504

Designing efficient algorithms for querying large corpora

Oslo Studies in Language ◽

10.5617/osla.8504 ◽

2021 ◽

Vol 11 (2) ◽

pp. 283-302

Author(s):

Paul Meurer

Keyword(s):

Regular Expression ◽

Linear Time ◽

Suffix Array ◽

Efficient Algorithms ◽

Regular Expressions ◽

Efficient Treatment ◽

Suffix Arrays ◽

Regular Expression Matching ◽

Finite State ◽

Query System

I describe several new efficient algorithms for querying large annotated corpora. The search algorithms as they are implemented in several popular corpus search engines are less than optimal in two respects: regular expression string matching in the lexicon is done in linear time, and regular expressions over corpus positions are evaluated starting in those corpus positions that match the constraints of the initial edges of the corresponding network. To address these shortcomings, I have developed an algorithm for regular expression matching on suffix arrays that allows fast lexicon lookup, and a technique for running finite state automata from edges with lowest corpus counts. The implementation of the lexicon as suffix array also lends itself to an elegant and efficient treatment of multi-valued and set-valued attributes. The described techniques have been implemented in a fully functional corpus management system and are also used in a treebank query system.

Download Full-text

Software Toolchain for Large-Scale RE-NFA Construction on FPGA

International Journal of Reconfigurable Computing ◽

10.1155/2009/301512 ◽

2009 ◽

Vol 2009 ◽

pp. 1-10 ◽

Cited By ~ 3

Author(s):

Yi-Hua E. Yang ◽

Viktor K. Prasanna

Keyword(s):

High Performance ◽

Large Scale ◽

Regular Expression ◽

Finite Automata ◽

Fixed Number ◽

Regular Expressions ◽

Pattern Complexity ◽

Regular Expression Matching ◽

Area Increase ◽

Prototype Software

We present a software toolchain for constructing large-scaleregular expression matching(REM) on FPGA. The software automates the conversion of regular expressions into compact and high-performance nondeterministic finite automata (RE-NFA). Each RE-NFA is described as an RTL regular expression matching engine (REME) in VHDL for FPGA implementation. Assuming a fixed number of fan-out transitions per state, ann-statem-bytes-per-cycle RE-NFA can be constructed inO(n×m)time andO(n×m)memory by our software. A large number of RE-NFAs are placed onto a two-dimensionalstaged pipeline, allowing scalability to thousands of RE-NFAs with linear area increase and little clock rate penalty due to scaling. On a PC with a 2 GHz Athlon64 processor and 2 GB memory, our prototype software constructs hundreds of RE-NFAs used by Snort in less than 10 seconds. We also designed a benchmark generator which can produce RE-NFAs with configurable pattern complexity parameters, including state count, state fan-in, loop-back and feed-forward distances. Several regular expressions with various complexities are used to test the performance of our RE-NFA construction software.

Download Full-text

Parallel Finite State Machines for Very Fast Distributable Regular Expression Matching

Proceedings of the 7th International Conference on Software Paradigm Trends ◽

10.5220/0003949901050110 ◽

2012 ◽

Keyword(s):

Regular Expression ◽

Finite State Machines ◽

State Machines ◽

Regular Expression Matching ◽

Finite State

Download Full-text

Regular expressions for language engineering

Natural Language Engineering ◽

10.1017/s1351324997001563 ◽

1996 ◽

Vol 2 (4) ◽

pp. 305-328 ◽

Cited By ~ 46

Author(s):

L. KARTTUNEN ◽

J-P. CHANOD ◽

G. GREFENSTETTE ◽

A. SCHILLE

Keyword(s):

Natural Language ◽

Regular Expression ◽

Regular Expressions ◽

Language Engineering ◽

Finite State Transducers ◽

Finite State ◽

Processing Steps

Many of the processing steps in natural language engineering can be performed using finite state transducers. An optimal way to create such transducers is to compile them from regular expressions. This paper is an introduction to the regular expression calculus, extended with certain operators that have proved very useful in natural language applications ranging from tokenization to light parsing. The examples in the paper illustrate in concrete detail some of these applications.

Download Full-text

Proof-directed program transformation: A functional account of efficient regular expression matching

Journal of Functional Programming ◽

10.1017/s0956796820000295 ◽

2021 ◽

Vol 31 ◽

Author(s):

ANDRZEJ FILINSKI

Keyword(s):

Program Transformation ◽

Formal Language ◽

Regular Expression ◽

State Machine ◽

Automata Theory ◽

Regular Expressions ◽

Transformation Techniques ◽

Standard Specification ◽

Correctness Proofs ◽

Regular Expression Matching

Abstract We show how to systematically derive an efficient regular expression (regex) matcher using a variety of program transformation techniques, but very little specialized formal language and automata theory. Starting from the standard specification of the set-theoretic semantics of regular expressions, we proceed via a continuation-based backtracking matcher, to a classical, table-driven state machine. All steps of the development are supported by self-contained (and machine-verified) equational correctness proofs.

Download Full-text

Two Efficient Algorithms for Linear Time Suffix Array Construction

IEEE Transactions on Computers ◽

10.1109/tc.2010.188 ◽

2011 ◽

Vol 60 (10) ◽

pp. 1471-1484 ◽

Cited By ~ 78

Author(s):

Ge Nong ◽

Sen Zhang ◽

Wai Hong Chan

Keyword(s):

Linear Time ◽

Suffix Array ◽

Efficient Algorithms

Download Full-text

ON REGULAR EXPRESSION HASHING TO REDUCE FA SIZE

International Journal of Foundations of Computer Science ◽

10.1142/s0129054109007042 ◽

2009 ◽

Vol 20 (06) ◽

pp. 1069-1086

Author(s):

WIKUS COETSER ◽

DERRICK G. KOURIE ◽

BRUCE W. WATSON

Keyword(s):

Hash Function ◽

Regular Expression ◽

Empirical Work ◽

Hash Functions ◽

Small Sample ◽

Finite State Automaton ◽

Regular Expressions ◽

Large Sample ◽

Finite State

The consequences of regular expression hashing as a means of finite state automaton reduction is explored, based on variations of Brzozowski's algorithm. In this approach, each hash collision results in the merging of the automaton's states, and it is subsequently shown that a super-automaton will always be constructed, regardless of the hash function used. Since direct adaptation of the classical Brzozowski algorithm leads to a non-deterministic super-automaton, a new algorithm is put forward for constructing a deterministic FA. Approaches are proposed for measuring the quality of a hash function. These ideas are empirically tested on a large sample of relatively small regular expressions and their associated automata, as well as on a small sample of relatively large regular expressions. Differences in the quality of tested hash functions are observed. Possible reasons for this are mentioned, but future empirical work is required to investigate the matter.

Download Full-text

THE VIRTUAL SUFFIX TREE

International Journal of Foundations of Computer Science ◽

10.1142/s0129054109007066 ◽

2009 ◽

Vol 20 (06) ◽

pp. 1109-1133 ◽

Cited By ~ 2

Author(s):

JIE LIN ◽

YUE JIANG ◽

DON ADJEROH

Keyword(s):

Suffix Tree ◽

Linear Time ◽

Suffix Array ◽

Intermediate Step ◽

Suffix Trees ◽

String Length ◽

Space Requirement ◽

Suffix Arrays ◽

Tree Construction ◽

Efficient Data

We introduce the VST (virtual suffix tree), an efficient data structure for suffix trees and suffix arrays. Starting from the suffix array, we construct the suffix tree, from which we derive the virtual suffix tree. Later, we remove the intermediate step of suffix tree construction, and build the VST directly from the suffix array. The VST provides the same functionality as the suffix tree, including suffix links, but at a much smaller space requirement. It has the same linear time construction even for large alphabets, Σ, requires O(n) space to store (n is the string length), and allows searching for a pattern of length m to be performed in O(m log |Σ|) time, the same time needed for a suffix tree. Given the VST, we show an algorithm that computes all the suffix links in linear time, independent of Σ. The VST requires less space than other recently proposed data structures for suffix trees and suffix arrays, such as the enhanced suffix array [1], and the linearized suffix tree [17]. On average, the space requirement (including that for suffix arrays and suffix links) is 13.8n bytes for the regular VST, and 12.05n bytes in its compact form.

Download Full-text

Accelerated preprocessing in task of searching substrings in a string

Vestnik of Don State Technical University ◽

10.23947/1992-5980-2019-19-3-290-300 ◽

2019 ◽

Vol 19 (3) ◽

pp. 290-300

Author(s):

A. V. Mazurenko ◽

N. V. Boldyrikhin

Keyword(s):

Database Management ◽

Linear Time ◽

Rapid Development ◽

Suffix Array ◽

Database Management Systems ◽

Management Systems ◽

Research Results ◽

Error Functions ◽

Suffix Arrays ◽

Associative Search

Introduction. A rapid development of the systems such as Yandex, Google, etc., has predetermined the relevance of the task of searching substrings in a string, and approaches to its solution are actively investigated. This task is used to create database management systems that support associative search. Besides, it is applicable in solving information security issues and creating antivirus programs. Algorithms of searching substring in a string are used in signature-based discovery tasks.Materials and Methods. The solution to the problem is based on the Aho-Corasick algorithm which is a typical technique of searching substrings in a string. At the same time, a new approach regarding preprocessing is employed.Research Results. The possibility of constructing the transition function and suffix references through suffix arrays and special mappings, is shown. The relationship between the prefix tree and suffix arrays was investigated, which provided the development of a fundamentally new method of constructing the transition and error functions. The results obtained enable to substantially shorten the time intervals spent on the preelection processing of a set of pattern strings when using an integer alphabet. The paper lists eight algorithms. The developed algorithms are evaluated. The results obtained are compared to the formerly known. Two theorems and eight lemmas are proved. Two examples illustrating features of the practical application of the developed preprocessing procedure are given.Discussion and Conclusions. The preprocessing procedure proposed in this paper is based on the communication between the suffix array built on the ground of a set of pattern strings and the construction of transition and error functions at the initial stages of the Aho-Corasick algorithm. This approach differs from the traditional one and requires the use of algorithms providing a suffix array in linear time. Thus, the algorithms that enable to significantly reduce the time for preprocessing of a set of pattern strings under the condition of using a certain type of alphabet in comparison to the known approach proposed in the Aho- Corasick algorithm are described. The research results presented in the paper can be used in antivirus programs that apply searching for signatures of malicious data objects in the memory of a computer system. In addition, this approach to solving the problem on searching substrings in a string will significantly speed up the operation of database management systems using associative search.

Download Full-text

Regular-expression derivatives re-examined

Journal of Functional Programming ◽

10.1017/s0956796808007090 ◽

2009 ◽

Vol 19 (2) ◽

pp. 173-190 ◽

Cited By ~ 51

Author(s):

SCOTT OWENS ◽

JOHN REPPY ◽

AARON TURON

Keyword(s):

Regular Expression ◽

Finite State Machines ◽

Regular Expressions ◽

Functional Language ◽

State Machines ◽

Boolean Operations ◽

Traditional Algorithm ◽

Computer Scientists ◽

Finite State

AbstractRegular-expression derivatives are an old, but elegant, technique for compiling regular expressions to deterministic finite-state machines. It easily supports extending the regular-expression operators with boolean operations, such as intersection and complement. Unfortunately, this technique has been lost in the sands of time and few computer scientists are aware of it. In this paper, we reexamine regular-expression derivatives and report on our experiences in the context of two different functional-language implementations. The basic implementation is simple and we show how to extend it to handle large character sets (e.g., Unicode). We also show that the derivatives approach leads to smaller state machines than the traditional algorithm given by McNaughton and Yamada.

Download Full-text

NORMALIZED EXPRESSIONS AND FINITE AUTOMATA

International Journal of Algebra and Computation ◽

10.1142/s021819670700355x ◽

2007 ◽

Vol 17 (01) ◽

pp. 141-154 ◽

Cited By ~ 11

Author(s):

J.-M. CHAMPARNAUD ◽

F. OUARDI ◽

D. ZIADI

Keyword(s):

Partial Derivative ◽

Regular Expression ◽

Linear Time ◽

Finite Automata ◽

Experimental Studies ◽

Regular Expressions ◽

Theoretical Comparison ◽

Theoretical Question

There exist two well-known quotients of the position automaton of a regular expression. The first one, called the equation automaton, was first introduced by Mirkin from the notion of prebase and has been redefined by Antimirov from the notion of partial derivative. The second one, due to Ilie and Yu and called the follow automaton, can be obtained by eliminating ε-transitions in an ε-NFA that is always smaller than the classical ε-NFAs (Thompson, Sippu and Soisalon–Soininen). Ilie and Yu discussed the difficulty of succeeding in a theoretical comparison between the size of the follow automaton and the size of the equation automaton and concluded that it is very likely necessary to realize experimental studies. In this paper we solve the theoretical question, by first defining a set of regular expressions, called normalized expressions, such that every regular expression can be normalized in linear time, and proving then that the equation automaton of a normalized expression is always smaller than its follow automaton.

Download Full-text