# Changeset 3630

Ignore:
Timestamp:
Feb 21, 2014, 1:54:04 PM (5 years ago)
Message:

Modify data

Location:
docs/Working/re
Files:
4 edited

### Legend:

Unmodified
 r3154 \documentclass[a4paper,10pt]{article} \usepackage[utf8]{inputenc} %opening \title{Fast Regular Expression Matching with Bit-parallel Data Streams} \author{ {Robert D. Cameron} \\ \and {Kenneth S. Herdy} \\ \and {Ben Hull} \\ \and {Thomas C. Shermer} \\ \\School of Computing Science \\Simon Fraser University } \documentclass[pageno]{jpaper} %replace XXX with the submission number you are given from the PACT submission site. \newcommand{\pactsubmissionnumber}{XXX} \usepackage[normalem]{ulem} \usepackage{amssymb} \usepackage{amsmath} \usepackage{graphicx} \usepackage{tikz} \usepackage{pgfplots} \begin{document} \title{Bitwise Data Parallelism in Regular Expression Matching} \date{} \maketitle \thispagestyle{empty} \begin{abstract} An parallel regular expression matching pattern method is introduced and compared with the state of the art in software tools designed for efficient on-line search. %such as the {\em grep} family pattern matching tools. The method is based on the concept of bit-parallel data streams, in which parallel streams of bits are formed such that each stream comprises bits in one-to-one correspondence with the character code units of a source data stream. An implementation of the method in the form of a regular expression compiler is discussed. The compiler accepts a regular expression and forms unbounded bit-parallel data stream operations. Bit-parallel operations are then transformed into a low-level C-based implementation for compilation into native pattern matching applications. These low-level C-based implementations take advantage of the SIMD (single-instruction multiple-data) capabilities of commodity processors to yield a dramatic speed-up over traditional byte-at-a-time approaches. On processors supporting W-bit addition operations, the method processes W source characters in parallel and performs up to W finite state transitions per clock cycle. We introduce a new bit-parallel scanning primitive, {\em Match Star}. which performs parallel Kleene closure over character classes and without backtracking. % expand and rephrase description of Match Star We evaluate the performance of our method in comparison with several widely known {\em grep} family implemenations, {\em Gnu grep}, {\em agrep}, {\em nr-grep}, and regular expression engines such as {\em Google's re2}. Performance results are analyzed using the performance monitoring counters of commodity hardware. Overall, our results demonstrate a dramatic speed-up over publically available alternatives. \input{abstract} \end{abstract} \section{Introduction} \label{Introduction} %\input{introduction.tex} Regular expresssion matching is an extensively studied problem with application to text processing and bioinformatics and with numerous algorithms and software tools developed to the address the particular processing demands. % reword The pattern matching problem can be stated as follows. Given a text T$_{1..n}$ of n characters and a pattern P, find all the text positions of T that start an occurrence of P. Alternatively, one may want all the final positions of occurrences. Some applications require slightly different output such as the line that matches the pattern. A pattern P can be a simple string, but it can also be a regular expression. A regular expression, is an expression that specifies a set of strings and is composed of (i) simple strings and (ii) the union, concatenation and Kleene closure of other regular expressions. To avoid parentheses it is assumed that the Kleene star has the highest priority, next concatenation and then alternation, however, most formalisms provides grouping operators to allow the definition of scope and operator precedence. Readers unfamiliar with the concept of regular expression matching are referred classical texts such as \cite{aho2007}. Regular expression matching is commonly performed using a wide variety of publically available software tools for on-line pattern matching. For instance, UNIX grep, Gnu grep, agrep, cgrep, nrgrep, and Perl regular expressions \cite{abou-assaleh2004}. Amongst these tools Gnu grep (egrep), agrep, and nrgrep are widely known and considered as the fastest regular expression matching tools in practice \cite{navarro2000}. and are of particular interest to this study. % simple patterns, extended patterns, regular expressions % motivation / previous work Although tradi finite state machine methods used in the scanning and parsing of text streams is considered to be the hardest of the â13 dwarvesâ to parallelize [1], parallel bitstream technology shows considerable promise for these types of applications [3, 4]. In this approach, character streams are processed W positions at a time using the W-bit SIMD registers commonly found on commodity processors (e.g., 128-bit XMM registers on Intel/AMD chips). This is achieved by first slicing the byte streams into eight separate basis bitstreams, one for each bit position within the byte. These basis bitstreams are then combined with bitwise logic and shifting operations to compute further parallel bit streams of interest. We further increase the potential parallelism in our approach by introducing a new parallel scanning primitive which we have termed {\em Match Star}. % expand and reword desc. of Match Star The use of regular expressions to search texts for occurrences of string patterns has a long history and remains a pervasive technique throughout computing applications today. % {\em a brief history} The origins of regular expression matching date back to automata theory developed by Kleene in the 1950s \cite{kleene1951}. Thompson \cite{thompson1968} is credited with the first construction to convert regular expressions to nondeterministic finite automata (NFA). Following Thompson's approach, a regular expression of length $m$ is converted to an NFA with $O(m)$ states. It is then possible to search a text of length $n$ using the NFA in worst case $O(mn)$ time. Often, a more efficient choice is to convert an NFA into a DFA. A DFA has a single active state at any time throughout the matching process and hence it is possible to search a text at of length $n$ in $O(n)$ time \footnote{It is well known that the conversion of an NFA to an equivalent DFA may result in state explosion. That is, the number of resultant DFA states may increase exponentially.}. A significant portion of the research in fast regular expression matching can be regarded as the quest for efficient automata'' \cite{navarro98fastand}. In \cite{baeza1992new}, Baeza-Yates and Gonnet presented a new approach to string search based on bit-parallelism. This technique takes advantage of the intrinsic parallelism of bitwise operations within a computer word. Given a $w$-bit word, the number of operations that a string search algorithms performs can be reduced by a factor $w$. Using this fact, the Shift-Or algorithm simulates an NFA using bitwise operations and achieves $O(nm/w)$ worst-case time \cite{navarro2000}. A disadvantage of the bit-parallel Shift-Or pattern matching approach is an inability to skip input characters. Simple string matching algorithms, such as the Boyer-Moore family of algorithms \cite{boyer1977fast,horspool1980practical} skip input characters to achieve sublinear times in the average case. Backward Dawg Matching (BDM) string matching algorithms \cite{crochemore1994text} based on suffix automata are able to skip characters. The Backward Nondeterministic Dawg Matching (BNDM) pattern matching algorithm \cite{wu1992fast} combines the bit-parallel advantages of the Shift-Or approach with the character skipping property of BDM algorithms. The nrgrep pattern matching tool, is built over the BNDM algorithm. Prior to the data-parallel approach to simultaneous processing of data stream elements as discussed herein, nrgrep was by far the fastest grep tool for matching complex patterns and achieved similar performance to that of the fastest existing string matching tools for simple patterns \cite{navarro2000}. There has been considerable interest in accelerating regular expression matching on parallel hardware such as multi-core processors (CPUs), graphics processing units (GPUs), field-programmable gate arrays (FPGAs), and specialized architectures such as the Cell Broadband Engine (Cell BE). % FPGA results (synthesis of patterns into logic circuits) vs. memory based approaches (STTs in memory) %CPU Scarpazza and Braudaway \cite{scarpazza2008fast} demonstrated that text processing algorithms that exhibit irregular memory access patterns can be efficiently executed on multicore hardware. In related work, Pasetto et al. presented a flexible tool that performs small-ruleset regular expression matching at a rate of 2.88 Gbps per chip on Intel Xeon E5472 hardware \cite{pasetto2010}. Naghmouchi et al. \cite{scarpazza2011top,naghmouchi2010} demonstrated that the Aho-Corasick (AC) string matching algorithm \cite{aho1975} is well suited for parallel implementation on multi-core CPUs, GPUs and the Cell BE. On each hardware, both thread-level parallelism (cores) and data-level parallelism (SIMD units) were leveraged for performance. Salapura et. al. \cite{salapura2012accelerating} advocated the use of vector-style processing for regular expressions in business analytics applications and leveraged the SIMD hardware available on multi-core processors to acheive a speedup of greater than 1.8 over a range of data sizes of interest. %Cell In \cite{scarpazza2008}, Scarpazza and Russell presented a SIMD tokenizer that delivered 1.00--1.78 Gbps on a single Cell BE chip and extended this approach for emulation on the Intel Larrabee instruction set \cite{scarpazza2009larrabee}. On the Cell BE, Scarpazza \cite{scarpazza2009cell} described a pattern matching implementation that delivered a throughput of 40 Gbps for a small dictionary of approximately 100 patterns and a throughput of 3.3-3.4 Gbps for a larger dictionary of thousands of patterns. Iorio and van Lunteren \cite{iorio2008} presented a string matching implementation for automata that achieved 4 Gbps on the Cell BE. % GPU In more recent work, Tumeo et al. \cite{tumeo2010efficient} presented a chunk-based implementation of the AC algorithm for accelerating string matching on GPUs. Lin et al., proposed the Parallel Failureless Aho-Corasick (PFAC) algorithm to accelerate pattern matching on GPU hardware and achieved 143 Gbps throughput, 14.74 times faster than the AC algorithm performed on a four core multi-core processor using OpenMP \cite{lin2013accelerating}. Whereas the existing approaches to parallelization have been based on adapting traditional sequential algorithms to emergent parallel architectures, we introduce both a new algorithmic approach and its implementation on SIMD and GPU architectures. This approach relies on a bitwise data parallel view of text streams as well as a surprising use of addition to match runs of characters in a single step.  The closest previous work is that underlying bit-parallel XML parsing using 128-bit SSE2 SIMD technology together with a parallel scanning primitive also based on addition \cite{cameron2011parallel}. However, in contrast to the deterministic, longest-match scanning associated with the ScanThru primitive of that work, we introduce here a new primitive MatchStar that can be used in full generality for nondeterministic regular expression matching.   We also introduce a long-stream addition technique involving a further application of MatchStar that enables us to scale the technique to $n$-bit addition in $\lceil\log_{64}{n}\rceil$ steps.   We ultimately apply this technique, for example, to perform synchronized 4096-bit addition on GPU warps of 64 threads. There is also a strong keyword match between the bit-parallel data streams used in our approach and the bit-parallelism used for NFA state transitions in the classical algorithms of Wu and Manber \cite{wu1992agrep}, Baez-Yates and Gonnet \cite{baeza1992new} and Navarro and Raffinot \cite{navarro1998bit}. However those algorithms use bit-parallelism in a fundamentally different way: representing all possible current NFA states as a bit vector and performing parallel transitions to a new set of states using table lookups and bitwise logic.    Whereas our approach can match multiple characters per step, bit-parallel NFA algorithms proceed through the input one byte at a time. Nevertheless, the agrep \cite{wu1992agrep} and nrgrep \cite{navarro2000} programs implemented using these techniques remain among the strongest competitors in regular expression matching performance, so we include them in our comparative evaluation. The remainder of this paper is organized as follows. Section~\ref{Basic Concepts} introduces the notations and basic concepts used throughout this paper. Section~\ref{Background} presents background material on classical automata-based approaches to regular expression matching as well as describes the efficient {\em grep} family utilities and regular expression pattern matching software libraries. Next, Section~\ref{Bit-parallel Data Streams} describes our parallel regular expression matching techniques. Section~\ref{Compiler Technology} presents our software toolchain for constructing pattern matching applications. Section~\ref{Methodology} describes the evaluation framework and Section~\ref{Experimental Results} presents a detailed performance analysis of our data parallel bitstream techniques in comparison to several software tools. Section~\ref{Conclusion} concludes the paper. The fundamental contribution of this paper is fully general approach to regular expression matching using bit-parallel data stream operations. The algorithmic aspects of this paper build upon the fundamental concepts of our previous work \cite{cameron2008high, cameron2009parallel, cameron2011parallel, lin2012parabix}. Specific contributions include: Section \ref{sec:grep} briefly describes regular expression notation and the grep problem. Section \ref{sec:bitwise} presents our basic algorithm and MatchStar using a model of arbitrary-length bit-parallel data streams. Section \ref{sec:blockwise} discusses the block-by-block implementation of our techniques including the long stream addition techniques for 256-bit addition with AVX2 and 4096-bit additions with GPGPU SIMT. Section \ref{sec:SSE2} describes our overall SSE2 implementation and carries out a performance study in comparison with existing grep implementations. Given the dramatic variation in grep performance across different implementation techniques, expressions and data sets, Section \ref{sec:analysis} considers a comparison between the bit-stream and NFA approaches from a theoretical perspective. Section \ref{sec:AVX2} then examines and demonstrates the scalability of our bitwise data-parallel approach in moving from 128-bit to 256-bit SIMD on Intel Haswell architecture. To further investigate scalability, Section \ref{sec:GPU} addresses the implementation of our matcher using groups of 64 threads working together SIMT-style on a GPGPU system. Section \ref{sec:Concl} concludes the paper with a discussion of results and areas for future work. \section{Regular Expression Notation and Grep}\label{sec:grep} We follow common POSIX notation for regular expressions. A regular expression specifies a set of strings through a pattern notation.   Individual characters normally stand for themselves, unless they are one of the special characters \verb:*+?[{\(|^$.: that serve as metacharacters of the notation system. Thus the regular expression \verb:cat: is a pattern for the set consisting of the single 3-character string \verb:cat:''. The special characters must be escaped with a backslash to prevent interpretation as metacharacter, thus \verb:\$: represents the dollar-sign and \verb:\\\\: represent the string consisting of two backslash characters. Character class bracket expressions are pattern elements that allow any character in a given class to be used in a particular context.  For example, \verb:[@#%]: is a regular expression that stands for any of the three given symbols.  Contiguous ranges of characters may be specified using hyphens; for example \verb:[0-9]: for digits and \verb:[A-Za-z0-9_]: for any alphanumeric character or underscore.  If the caret character immediately follows the opening bracket, the class is negated, thus \verb:[^0-9]: stands for any character except a digit.  The period metacharacter \verb:.: stands for the class of all characters. Consecutive pattern elements stand for strings formed by concatenation, thus \verb:[cd][ao][tg]: stands for the set of 8 three-letter strings \verb:cat:'' through \verb:dog:''. The alternation operator \verb:|: allows a pattern to be defined to have to alternative forms, thus \verb:cat|dog: matches either \verb:cat:'' or \verb:dog:''.  Concatenation takes precedence over alternation, but parenthesis may be used to change this, thus \verb:(ab|cd)[0-9]: stands for any digit following one of the two prefixes  \verb:ab:'' or \verb:cd:''. Repetition operators may be appended to a pattern to specify a variable number of occurrences of that pattern. The Kleene \verb:*: specifies zero-or-more occurrences matching the previous pattern, while \verb:+: specifies one-or more occurrences.  Thus \verb:[a-z][a-z]*: and \verb:[a-z]+: both specify the same set: strings of at least one lower-case letter.  The postfix operator \verb:?: specifies an optional component, i.e., zero-or-one occurrence of strings matching the element.  Specific bounds may be given within braces: \verb:(ab){3}: specifies the string \verb:ababab:'', \verb:[0-9A-Fa-f]{2,4}: specifies strings of two, three or four hexadecimal digits, and \verb:[A-Z]{4,}: specifies strings of at least 4 consecutive capital letters. The grep program searches a file for lines containing matches to a regular expression using any of the above notations. In addition, the pattern elements \verb:^: and \verb:$: may be used to match respectively the beginning or the end of a line. In line-based tools such as grep, \verb:.: matches any character except newlines; matches cannot extend over lines. Normally, grep prints all matching lines to its output. However, grep programs typically allow a command line flag such as \verb:-c: to specify that only a count of matching lines be produced; we use this option in our experimental evaluation to focus our comparisons on the performance of the underlying matching algorithms. \section{Matching with Bit-Parallel Data Streams}\label{sec:bitwise} Whereas the traditional approaches to regular expression matching using NFAs, DFAs or backtracking all rely on a byte-at-a-time processing model, the approach we introduce in this paper is based on quite a different concept: a data-parallel approach to simultaneous processing of data stream elements. Indeed, our most abstract model is that of unbounded data parallelism: processing all elements of the input data stream simultaneously. In essence, data streams are viewed as (very large) integers. The fundamental operations are bitwise logic, stream shifting and long-stream addition. Depending on the available parallel processing resources, an actual implementation may divide an input stream into blocks and process the blocks sequentially. Within each block all elements of the input stream are processed together, relying the availability of bitwise logic and addition scaled to the block size. On commodity Intel and AMD processors with 128-bit SIMD capabilities (SSE2), we typically process input streams 128 bytes at a time. In this case, we rely on the Parabix tool chain \cite{lin2012parabix} to handle the details of compilation to block-by-block processing. On the latest processors supporting the 256-bit AVX2 SIMD operations, we also use the Parabix tool chain, but substitute a parallelized long-stream addition technique to avoid the sequential chaining of 4 64-bit additions. Our GPGPU implementation uses scripts to modify the output of the Parabix tools, effectively dividing the input into blocks of 4K bytes. We also have adapted our long-stream addition technique to perform 4096-bit additions using 64 threads working in lock-step SIMT fashion. A similar technique is known to the GPU programming community\cite{}. \begin{figure}[tbh] \begin{center} \begin{tabular}{cr}\\ input data & \verba4534q--b29z---az---a4q--bca22z--\\$B_7$& \verb.................................\\$B_6$& \verb1....1..1..1...11...1.1..111..1..\\$B_5$& \verb111111111111111111111111111111111\\$B_4$& \verb.11111...111....1....11.....111..\\$B_3$& \verb......11..11111.1111...11.....111\\$B_2$& \verb.11.1.11....111..111.1.11......11\\$B_1$& \verb...1....11.1....1........11.111..\\$B_0$& \verb1.11.111..1.1111.1111.111.11...11\\ \verb:[a]: & \verb1..............1....1......1.....\\ \verb:[z]: & \verb...........1....1.............1..\\ \verb:[0-9]: & \verb.1111....11..........1......11... \end{tabular} \end{center} \caption{Basis and Character Class Streams} \label{fig:streams} \end{figure} A key concept in this streaming approach is the derivation of bit streams that are parallel to the input data stream, i.e., in one-to-one correspondence with the data element positions of the input streams. Typically, the input stream is a byte stream comprising the 8-bit character code units of a particular encoding such as extended ASCII, ISO-8859-1 or UTF-8. However, the method may also easily be used with wider code units such as the 16-bit code units of UTF-16. In the case of a byte stream, the first step is to transpose the byte stream into eight parallel bit streams, such that bit stream$i$comprises the$i^\text{th}$bit of each byte. These streams form a set of basis bit streams from which many other parallel bit streams can be calculated, such as character class bit streams such that each bit$j$of the stream specifies whether character$j$of the input stream is in the class or not. Figure \ref{fig:streams} shows an example of an input byte stream in ASCII, the eight basis bit streams of the transposed representation, and the character class bit streams \verb:[a]:, \verb:[z]:, and \verb:[0-9]: that may be computed from the basis bit streams using bitwise logic. Zero bits are marked with periods ({\tt .}) so that the one bits stand out. Transposition and character class construction are straightforward using the Parabix tool chain \cite{lin2012parabix}. \begin{figure}[tbh] \begin{center} \begin{tabular}{cr}\\ input data & \verba4534q--b29z---az---a4q--bca22z--\\$M_1$& \verb.1..............1....1......1....\\$M_2$& \verb.11111..........1....11.....111..\\$M_3$& \verb.................1.............1. \end{tabular} \end{center} \caption{Marker Streams in Matching {\tt a[0-9]*z}} \label{fig:streams2} \end{figure} \paragraph*{Marker Streams.} Now consider how bit-parallel data streams can be used in regular expression matching. Consider the problem of searching the input stream of Figure \ref{fig:streams} to finding occurrence of strings matching the regular expression \verb:a[0-9]*z:. The matching process involves the concept of {\em marker streams}, that is streams that mark the positions of current matches during the overall process. In this case there are three marker streams computed during the match process, namely,$M_1$representing match positions after an initial \verb:a: character has been found,$M_2$representing positions reachable from positions marked by$M_1$by further matching zero or more digits (\verb:[0-9]*:) and finally$M_3$the stream marking positions after a final \verb:z: has been found. Without describing the details of how these streams are computed for the time being, Figure \ref{fig:streams2} shows what each of these streams should be for our example matching problem. Note our convention that a marker stream contains a 1 bit at the next character position to be matched, that is, immediately past the last position that was matched. \paragraph*{MatchStar.} MatchStar takes a marker bitstream and a character class bitstream as input. It returns all positions that can be reached by advancing the marker bitstream zero or more times through the character class bitstream. \begin{figure}[tbh] \begin{center} \begin{tabular}{cr}\\ input data & \verba4534q--b29z---az---a4q--bca22z--\\$M_1$& \verb.1..............1....1......1....\\$D = \text{\tt [0-9]}$& \verb.1111....11..........1......11...\\$T_0 = M_1 \wedge D$& \verb.1...................1......1....\\$T_1 = T_0 + D$& \verb.....1...11...........1.......1..\\$T_2 = T_1 \oplus D$& \verb.11111...............11.....111..\\$M_2 = T_2 \, | \, M_1$& \verb.11111..........1....11.....111.. \end{tabular} \end{center} \caption{$M_2 = \text{MatchStar}(M_1, D)$} \label{fig:matchstar} \end{figure} Figure \ref{fig:matchstar} illustrates the MatchStar method. In this figure, it is important to note that our bitstreams are shown in natural left-to-right order reflecting the conventional presentation of our character data input. However, this reverses the normal order of presentation when considering bitstreams as numeric values. The key point here is that when we perform bitstream addition, we will show bit movement from left-to-right. For example,$\verb:111.: + \verb:1...: = \verb:...1:$. The first row of the figure is the input data, the second and third rows are the input bitstreams: the initial marker position bitstream and the character class bitstream for digits derived from input data. In the first operation ($T_0$), marker positions that cannot be advanced are temporarily removed from consideration by masking off marker positions that aren't character class positions using bitwise logic. Next, the temporary marker bitstream is added to the character class bitstream. The addition produces 1s in three types of positions. There will be a 1 immediately following a block of character class positions that spanned one or more marker positions, at any character class positions that weren't affected by the addition (and are not part of the desired output), and at any marker position that wasn't the first in its block of character class positions. Any character class positions that have a 0 in$T_1$were affected by the addition and are part of the desired output. These positions are obtained and the undesired 1 bits are removed by XORing with the character class stream.$T_2$is now only missing marker positions that were removed in the first step as well as marker positions that were 1s in$T_1$. The output marker stream is obtained by ORing$T_2$with the initial marker stream. In general, given a marker stream$M$and a character class stream$C$, the operation of MatchStar is defined by the following equation. $\text{MatchStar}(M, C) = (((M \wedge C) + C) \oplus C) | M$ Given a set of initial marker positions, the result stream marks all possible positions that can be reached by 0 or more occurrences of characters in class$C$from each position in$M$. \input{compilation} \input{re-Unicode} \section{Block-at-a-Time Processing}\label{sec:blockwise} The unbounded stream model of the previous section must of course be translated an implementation that proceeds block-at-a-time for realistic application. In this, we primarily rely on the Pablo compiler of the Parabix toolchain \cite{lin2012parabix}. Given input statements expressed as arbitrary-length bitstream equations, Pablo produces block-at-a-time C++ code that initializes and maintains all the necessary carry bits for each of the additions and shifts involved in the bitstream calculations. In the present work, our principal contribution to the Parabix tool chain is to incorporate the technique of long-stream addition described below. Otherwise, we were able to use Pablo directly in compiling our SSE2 and AVX2 implementations. Our GPU implementation required some scripting to modify the output of the Pablo compiler for our purpose. \paragraph*{Long-Stream Addition.} The maximum word size for addition on commodity processors is typically 64 bits. In order to implement long-stream addition for block sizes of 256 or larger, a method for propagating carries through the individual stages of 64-bit addition is required. However, the normal technique of sequential addition using add-with-carry instructions, for example, is far from ideal. We have developed a general model using SIMD methods for constant-time long-stream addition up to 4096 bits. We assume the availability of the following SIMD/SIMT operations operating on vectors of$f$64-bit fields. \begin{itemize} \item compilation of regular expressions into unbounded bit-parallel data stream equations; \item documentation of character classes compilation into bit-parallel character class data streams; \item the Match Star parallel scanning primitive; \item efficient support for unicode characters. \item \verb#simd<64>::add(X, Y)#: vertical SIMD addition of corresponding 64-bit fields in two vectors to produce a result vector of$f$64-bit fields. \item \verb#simd<64>::eq(X, -1)# : comparison of the 64-bit fields of \verb:x: each with the constant value -1 (all bits 1), producing an$f$-bit mask value, \item \verb#hsimd<64>::mask(X)# : gathering the high bit of each 64-bit field into a single compressed$f$-bit mask value, and \item normal bitwise logic operations on$f$-bit masks, and \item \verb#simd<64>::spread(x)# : distributing the bits of an$f$bit mask, one bit each to the$f$64-bit fields of a vector. \end{itemize} \section{Basic Concepts} \label{Basic Concepts} In this section, we define the notation and basic concepts used throughout this paper. \subsection{Notation} We use the following notations. Let$P=p_{1}p_{2}\ldots{}p_{m}$be a pattern of length m and$T=t_{1}t_{2}\ldots{}t_{n}$be a text of length n both defined over a finite alphabet$sigma$of size$alpha$. The goal of simple pattern expression matching is to find all the text positions of T that follow an occurrence of P. P may be a simple pattern, extended pattern, or a regular expression. We use C notations to represent bitwise operations$\lnot{}$,$\lor{}$,$\land{}$,$\oplus{}$represent bitwise NOT, OR, AND, XOR, respectively. Operators$\ll{}k$, and$\gg{}$represent logical left shift, and logical right shift, respectively and are further modulated by the number of bit positions a given value shall be shifted by, for example shift left by n''. Vacant bit-positions are filled in with zeroes. \subsection{Regular Expressions} % Define regular expressions (a recursive def), character classes, bounded repetition \section{Background} \label{Background} %\input{background.tex} \subsection{Classical Methods} \subsection{Regular Expression and Finite Automata} The origins of regular expression matching date back to automata theory and formal language theory developed by Kleene in the 1950s \cite{kleene1951}. Thompson \cite{thompson1968} is credited with the first construction to convert regular expressions to nondeterministic finite automata (NFA) for regular expression matching. Following Thompson's approach, a regular expression of length m is first converted to an NFA with O(m) nodes. It is possible to search a text of length n using the NFA directly in O(mn) worst case time. Often, a more efficient choice is to convert the NFA into a DFA. A DFA has only a single active state and allows to search the text at O(n) worst-case optimal. It is well known that the conversion of the NFA to the DFA may result in the state explosion problem. That is the resultant DFA may have O($2^m$) states. Thompson's original work marked the beginning of a long line of regular expression pattern matching methods that process an input string, character-at-a-time, and that transition from state to state according to the current state and the next input character. Whereas traditional automata techniques acheive O(n) worst-case optimal efficiency, simple string matching algorithms, such as the Boyer-Moore family of algorithms, skip input characters to achieve sublinear times in the average case \cite{boyer1977fast}. Boyer-Moore methods, begin comparison from the end of the pattern instead of the beginning and precompute skip information to determine how far ahead a pattern search can skip in the input whenever a non-matching character is encountered. Generally, Boyer-Moore family algorithms improve faster as the pattern being searched for becomes longer. In many cases, the techniques used to skip characters in simple string matching approaches can be extended to regular expression matching. Widely known techniques used to facilitate character skipping in regular expression matching include necessary strings and backward suffix matching inspired by the Backward Dawg Matching (BDM) algorithm \cite{Navarro02patternmatching}. \subsection{Bit-parallel Simulation of Automata} Define bit-parallelism \cite{Baeza-yates_anew} Shift-Or / Shift-And \cite{wu1992fast} Bit-parallel suffix automata (Backward Non-Deterministic Dawg Matching (BNDM) \cite{navarro1998bit} algorithm) \subsection{Software Tools} %Thompson created the first grep (UNIX grep) as a standalone adaptation %of the regular expression parser he had written for the UNIX ed utility. %In 1976, Aho improved upon Thompson's implementation that with a DFA-based implementation called egrep. %Egrep was faster then grep for simple patterns but for more complex searches it lagged behind because of the %time it took to build a complete finite automaton for the regular expression before it could even start searching. %Since grep used a nondeterministic finite automaton it took less time to build the state machine but more time %to evaluate strings with it. Aho later used a technique called cached lazy evaluation to improve the performance of egrep. %It took zero set-up time and just one additional test in the inner loop. %http://pages.cs.wisc.edu/~mdant/cs520_4.html %Given a regular expression R and a test T the regular expression matching %problem finds all ending position of substrings in Q that matches a string in %the language denoted by R. %The behaviour of Gnu grep, agrep, and nr-grep are differ ... %Gnu grep %agrep %nr-grep %re2 \section{Bit-parallel Data Streams} \label{Bit-parallel Data Streams} The bit-parallel data streams use the wide SIMD registers commonly found on commodity processors to process byte positions at a time using bitwise logic, shifting and other operations. A signficant advantage of the bit-parallel data stream method over other pattern matching methods that rely on bit-parallel automata simulation is the potential to skip full register width number of characters in low occurence frequency text. % reword Skip characters register width. \subsection{Match Star} %Wikipedia Backtracking is a general algorithm for finding all solutions to some computational problem, that incrementally builds candidates to the solutions. \section{Compiler Technology} \label{Compiler Technology} \section{Methodology} \label{Methodology} %\input{methodology.tex} We compare the performance of our bit-parallel data stream techniques against Gnu grep, agrep, nr-grep, and re2. \section{Experimental Results} \label{Experimental Results} %\input{results.tex} \section{Conclusion} \label{Conclusion} %\input{conclusion.tex} { \bibliographystyle{acm} \bibliography{reference} In this model, the \verb#hsimd<64>::mask(X)# and \verb#simd<64>::spread(x)# model the minimum communication requirements between the parallel processing units (SIMD lanes or SIMT processors). In essence, we just need the ability to quickly send and receive 1 bit of information per parallel unit. The \verb#hsimd<64>::mask(X)# operation gathers 1 bit from each of the processors to a central resource. After calculations on the gather bits are performed, we then just need an operation to invert the communication, i.e., sending 1 bit each from the central processor to each of the parallel units. There are a variety of ways in which these facilities may be implemented depending on the underlying architecture; details of our AVX2 and GPU implementations are presented later. Given these operations, our method for long stream addition of two$f \times 64$bit values \verb:X: and \verb:Y: is the following. \begin{enumerate} \item Form the vector of 64-bit sums of \verb:x: and \verb:y:. $\text{\tt R} = \text{\tt simd<64>::add(X, Y)}$ \item Extract the$f$-bit masks of \verb:X:, \verb:Y: and \verb:R:. $\text{\tt x} = \text{\tt hsimd<64>::mask(X)}$ $\text{\tt y} = \text{\tt hsimd<64>::mask(Y)}$ $\text{\tt r} = \text{\tt hsimd<64>::mask(R)}$ \item Compute an$f$-bit mask of carries generated for each of the 64-bit additions of \verb:X: and \verb:Y:. $\text{\tt c} = (\text{\tt x} \wedge \text{\tt y}) \vee ((\text{\tt x} \vee \text{\tt y}) \wedge \neg \text{\tt r})$ \item Compute an$f$-bit mask of all fields of {\tt R} that will overflow with an incoming carry bit. This is the {\em bubble mask}. $\text{\tt b} = \text{\tt simd<64>::eq(R, -1)}$ \item Determine an$f$-bit mask identifying the fields of {\tt R} that need to be incremented to produce the final sum. Here we find a new application of MatchStar. $\text{\tt i} = \text{\tt MatchStar(c*2, b)}$ This is the key step. The mask {\tt c} of outgoing carries must be shifted one position ({\tt c*2}) so that each outgoing carry bit becomes associated with the next digit. At the incoming position, the carry will increment the 64-bit digit. However, if this digit is all ones (as signalled by the corresponding bit of bubble mask {\tt b}, then the addition will generate another carry. In fact, if there is a sequence of digits that are all ones, then the carry must bubble through each of them. This is just MatchStar. \item Compute the final result {\tt Z}. $\text{\tt Z} = \text{\tt simd<64>::add(R, simd<64>::spread(i))}$ \end{enumerate} \begin{figure} \begin{center} \begin{tabular}{c||r|r|r|r|r|r|r|r||}\cline{2-9} {\tt X} & {\tt 19} & {\tt 31} & {\tt BA} & {\tt 4C} & {\tt 3D} & {\tt 45} & {\tt 21} & {\tt F1} \\ \cline{2-9} {\tt Y} & {\tt 22} & {\tt 12} & {\tt 45} & {\tt B3} & {\tt E2} & {\tt 16} & {\tt 17} & {\tt 36} \\ \cline{2-9} {\tt R} & {\tt 3B} & {\tt 43} & {\tt FF} & {\tt FF} & {\tt 1F} & {\tt 5B} & {\tt 38} & {\tt 27} \\ \cline{2-9} {\tt x} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 1} \\ \cline{2-9} {\tt y} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 1} & {\tt 0} & {\tt 0} & {\tt 0} \\ \cline{2-9} {\tt r} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 1} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 0} \\ \cline{2-9} {\tt c} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 0} & {\tt 0} & {\tt 1} \\ \cline{2-9} {\tt c*2} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 0} \\ \cline{2-9} {\tt b} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 1} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 0} \\ \cline{2-9} {\tt i} & {\tt 0} & {\tt 1} & {\tt 1} & {\tt 1} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 0} \\ \cline{2-9} {\tt Z} & {\tt 3B} & {\tt 44} & {\tt 0} & {\tt 0} & {\tt 1F} & {\tt 5B} & {\tt 39} & {\tt 27} \\ \cline{2-9} \end{tabular} \end{center} \caption{Long Stream Addition}\label{fig:longadd} \end{figure} Figure \ref{fig:longadd} illustrates the process. In the figure, we illustrate the process with 8-bit fields rather than 64-bit fields and show all field values in hexadecimal notation. Note that two of the individual 8-bit additions produce carries, while two others produce {\tt FF} values that generate bubble bits. The net result is that four of the original 8-bit sums must be incremented to produce the long stream result. A slight extension to the process produces a long-stream full adder that can be used in chained addition. In this case, the adder must take an additional carry-in bit {\tt p} and produce a carry-out bit {\tt q}. This may be accomplished by incorporating {\tt p} in calculating the increment mask in the low bit position, and then extracting the carry-out {\tt q} from the high bit position. $\text{\tt i} = \text{\tt MatchStar(c*2+p, b)}$ $\text{\tt q} = \text{\tt i >> f}$ As described subsequently, we use a two-level long-stream addition technique in both our AVX2 and GPU implementations. In principle, one can extend the technique to additional levels. Using 64-bit adders throughout,$\lceil\log_{64}{n}\rceil$steps are needed for$n\$-bit addition. A three-level scheme could coordinate 64 groups each performing 4096-bit long additions in a two-level structure. However, whether there are reasonable architectures that can support fine-grained SIMT style coordinate at this level is an open question. Using the methods outlined, it is quite conceivable that instruction set extensions to support long-stream addition could be added for future SIMD and GPU processors.   Given the fundamental nature of addition as a primitive and its particular application to regular expression matching as shown herein, it seems reasonable to expect such instructions to become available.    Alternatively, it may be worthwhile to simply ensure that the \verb#hsimd<64>::mask(X)# \verb#simd<64>::spread(X)# operations are efficiently supported. \input{sse2} \input{analysis} \input{avx2} \section{GPU Implementation}\label{sec:GPU} To further assess the scalability of our regular expression matching using bit-parallel data streams, we implemented a GPGPU version in OpenCL. We arranged for 64 work groups each having 64 threads.  Input files are divided in data parallel fashion among the 64 work groups.  Each work group carries out the regular expression matching operations 4096 bytes at a time using SIMT processing.  Figure \ref{fig:GPUadd} shows our implementation of long-stream addition on the GPU.  Each thread maintains its own carry and bubble values and performs synchronized updates with the other threads using a six-step parallel-prefix style process. Our GPU test machine was an AMD A10-6800K APU with Radeon(tm) HD Graphics having a processor speed of 2 GHz and 32.0GB of memory. \begin{figure*}[tbh] \begin{center}\small \begin{verbatim} inline BitBlock adc(int idx, BitBlock a, BitBlock b, __local BitBlock *carry, _ _local BitBlock *bubble, BitBlock *group_carry, const int carryno){ BitBlock carry_mask; BitBlock bubble_mask; BitBlock partial_sum = a+b; BitBlock gen = a&b; BitBlock prop = a^b; carry[idx] = ((gen | (prop & ~partial_sum))&CARRY_BIT_MASK)>>(WORK_GROUP_SIZE-1-idx); bubble[idx] = (partial_sum + 1)? 0:(((BitBlock)1)<0; offset=offset>>1){ carry[idx] = carry[idx]|carry[idx^offset]; bubble[idx] = bubble[idx]|bubble[idx^offset]; barrier(CLK_LOCAL_MEM_FENCE); } carry_mask = (carry[0]<<1)|group_carry[carryno]; bubble_mask = bubble[0]; BitBlock s = (carry_mask + bubble_mask) & ~bubble_mask; BitBlock inc = s | (s-carry_mask); BitBlock rslt = partial_sum + ((inc>>idx)&0x1); group_carry[carryno] = (carry[0]|(bubble_mask & inc))>>63; return rslt; } \end{verbatim} \end{center} \caption{OpenCL 4096-bit Addition} \label{fig:GPUadd} \end{figure*} Figure \ref{fig:SSE-AVX-GPU} compares the performance of our SSE2, AVX and GPU implementations. \begin{figure} \begin{center} \begin{tikzpicture} \begin{axis}[ xtick=data, ylabel=Running Time (ms per megabyte), xticklabels={@,Date,Email,URIorEmail,xquote}, tick label style={font=\tiny}, enlarge x limits=0.15, %enlarge y limits={0.15, upper}, ymin=0, legend style={at={(0.5,-0.15)}, anchor=north,legend columns=-1}, ybar, bar width=7pt, ] \addplot file {data/ssetime.dat}; \addplot file {data/avxtime.dat}; \addplot file {data/gputime.dat}; \legend{SSE2,AVX2,GPU,Annot} \end{axis} \end{tikzpicture} \end{center} \caption{Running Time}\label{fig:SSE-AVX-GPU} \end{figure} \input{conclusion} %\appendix %\section{Appendix Title} %This is the text of the appendix, if you need one. This research was supported by grants from the Natural Sciences and Engineering Research Council of Canada and MITACS, Inc. \bibliographystyle{IEEEtranS} \bibliography{reference} \end{document}