Changeset 3630 for docs/Working

Feb 21, 2014, 1:54:04 PM (5 years ago)

Modify data

4 edited


  • docs/Working/re/data/gputime.dat

    r3626 r3630  
    221 0.17
    332 0.26
    4 3 0.47
    5 4 0.36
     43 0.42
     54 0.41
  • docs/Working/re/re-main.tex

    r3625 r3630  
    581581style process.
    583 Our GPU test machine was an AMD A10-5800K APU with Radeon(tm) HD Graphics
    584 having a processor speed of 4.30 GHz and 32.0GB of memory.
     621Our GPU test machine was an AMD A10-6800K APU with Radeon(tm) HD Graphics.
    624622Figure \ref{fig:SSE-AVX-GPU} compares the performance of
    625623our SSE2, AVX and GPU implementations.
  • docs/Working/re/re-main.tex.backup

    r3154 r3630  
    1 \documentclass[a4paper,10pt]{article}
    2 \usepackage[utf8]{inputenc}
    4 %opening
    5 \title{Fast Regular Expression Matching with Bit-parallel Data Streams}
    6 \author{
    7 {Robert D. Cameron} \\
    8 \and
    9 {Kenneth S. Herdy} \\
    10 \and
    11 {Ben Hull} \\
    12 \and
    13 {Thomas C. Shermer} \\
    14 \\School of Computing Science
    15 \\Simon Fraser University
    16 }
     3%replace XXX with the submission number you are given from the PACT submission site.
     15\title{Bitwise Data Parallelism in Regular Expression Matching}
    24 An parallel regular expression matching pattern method is
    25 introduced and compared with the state of the art in software tools designed for
    26 efficient on-line search. %such as the {\em grep} family pattern matching tools.
    27 The method is based on the concept of bit-parallel data streams,
    28 in which parallel streams of bits are formed such
    29 that each stream comprises bits in one-to-one correspondence with the
    30 character code units of a source data stream.
    32 An implementation of the method in the form of a regular expression
    33 compiler is discussed. The compiler accepts a regular expression and
    34 forms unbounded bit-parallel data stream operations. Bit-parallel operations
    35 are then transformed into a low-level C-based implementation for compilation into native
    36 pattern matching applications. These low-level C-based implementations take advantage of
    37 the SIMD (single-instruction multiple-data) capabilities of commodity
    38 processors to yield a dramatic speed-up over traditional byte-at-a-time approaches.
    39 On processors supporting W-bit addition operations, the method processes W source characters
    40 in parallel and performs up to W finite state transitions per clock cycle.
    42 We introduce a new bit-parallel scanning primitive, {\em Match Star}.
    43 which performs parallel Kleene closure over character classes
    44 and without backtracking. % expand and rephrase description of Match Star
    46 We evaluate the performance of our method in comparison with several widely known {\em grep}
    47 family implemenations, {\em Gnu grep}, {\em agrep}, {\em nr-grep},
    48 and regular expression engines such as {\em Google's re2}.
    49 Performance results are analyzed using the performance monitoring counters
    50 of commodity hardware. Overall, our results demonstrate a dramatic speed-up
    51 over publically available alternatives.
    56 \label{Introduction}
    57 %\input{introduction.tex}
    59 Regular expresssion matching is an extensively studied problem with application to
    60 text processing and bioinformatics and with numerous algorithms
    61 and software tools developed to the address the particular
    62 processing demands. % reword
    64 The pattern matching problem can be
    65 stated as follows. Given a text T$_{1..n}$ of n characters and a pattern P,
    66 find all the text positions of T that start an occurrence of P. Alternatively,
    67 one may want all the final positions of occurrences. Some applications require
    68 slightly different output such as the line that matches the pattern.
    70 A pattern P can be a simple string, but it can also be a regular expression.
    71 A regular expression, is an expression that specifies a set of strings
    72 and is composed of (i) simple strings and (ii) the
    73 union, concatenation and Kleene closure of other regular expressions.
    74 To avoid parentheses it is assumed that the Kleene star has the highest priority,
    75 next concatenation and then alternation, however, most formalisms provides grouping
    76 operators to allow the definition of scope and operator precedence.
    77 Readers unfamiliar with the concept of regular expression matching are referred
    78 classical texts such as \cite{aho2007}.
    80 Regular expression matching is commonly performed using a wide variety of
    81 publically available software tools for on-line pattern matching. For instance,
    82 UNIX grep, Gnu grep, agrep, cgrep, nrgrep, and Perl regular
    83 expressions \cite{abou-assaleh2004}. Amongst these tools Gnu grep (egrep),
    84 agrep, and nrgrep are widely known and considered as
    85 the fastest regular expression matching tools in practice \cite{navarro2000}.
    86 and are of particular interest to this study.
    88 % simple patterns, extended patterns, regular expressions
    90 % motivation / previous work
    91 Although tradi finite state machine methods used in the scanning and parsing of
    92 text streams is considered to be the hardest of the “13 dwarves” to parallelize
    93 [1], parallel bitstream technology shows considerable promise for these types of
    94 applications [3, 4]. In this approach, character streams are processed W positions
    95 at a time using the W-bit SIMD registers commonly found on commodity processors
    96 (e.g., 128-bit XMM registers on Intel/AMD chips). This is achieved by
    97 first slicing the byte streams into eight separate basis bitstreams, one for each bit
    98 position within the byte. These basis bitstreams are then combined with bitwise
    99 logic and shifting operations to compute further parallel bit streams of interest.
    101 We further increase the potential parallelism in our approach by introducing a new parallel
    102 scanning primitive which we have termed {\em Match Star}. % expand and reword desc. of Match Star
     30The use of regular expressions to search texts for occurrences
     31of string patterns has a long history and
     32remains a pervasive technique throughout computing applications today.
     33% {\em a brief history}
     34The origins of regular expression matching date back to automata theory
     35developed by Kleene in the 1950s \cite{kleene1951}.
     36Thompson \cite{thompson1968} is credited with the first construction to convert regular expressions
     37to nondeterministic finite automata (NFA).
     38Following Thompson's approach, a regular expression of length $m$ is converted
     39to an NFA with $O(m)$ states. It is then possible to search a text of length $n$ using the
     40NFA in worst case $O(mn)$ time. Often, a more efficient choice
     41is to convert an NFA into a DFA. A DFA has a single active state at any time
     42throughout the matching process and
     43hence it is possible to search a text at of length $n$ in $O(n)$ time
     44\footnote{It is well known that the conversion of an NFA to an equivalent DFA may result
     45in state explosion. That is, the number of resultant DFA states may increase exponentially.}.
     47A significant portion of the research in fast regular expression matching can be
     48regarded as the ``quest for efficient automata'' \cite{navarro98fastand}.
     49In \cite{baeza1992new}, Baeza-Yates and Gonnet
     50presented a new approach to string search based on bit-parallelism.
     51This technique takes advantage of the intrinsic parallelism of bitwise operations
     52within a computer word.
     53Given a $w$-bit word, the number of operations that a string search algorithms
     54performs can be reduced by a factor $w$.
     55Using this fact, the Shift-Or algorithm simulates an NFA using
     56bitwise operations and achieves $O(nm/w)$ worst-case time \cite{navarro2000}.
     57A disadvantage of the bit-parallel Shift-Or pattern matching approach
     58is an inability to skip input characters.
     59Simple string matching algorithms,
     60such as the Boyer-Moore family of algorithms \cite{boyer1977fast,horspool1980practical} skip input characters
     61to achieve sublinear times in the average case.
     62Backward Dawg Matching (BDM) string matching algorithms \cite{crochemore1994text}
     63based on suffix automata are able to skip characters.
     64The Backward Nondeterministic Dawg Matching (BNDM) pattern matching algorithm \cite{wu1992fast}
     65combines the bit-parallel advantages of the Shift-Or approach
     66with the character skipping property of BDM algorithms. The nrgrep pattern matching tool,
     67is built over the BNDM algorithm. Prior to the data-parallel approach to simultaneous
     68processing of data stream elements as discussed herein, nrgrep
     69was by far the fastest grep tool
     70for matching complex patterns and achieved similar performance
     71to that of the fastest existing string
     72matching tools for simple patterns \cite{navarro2000}.
     74There has been considerable
     75interest in accelerating regular expression matching
     76on parallel hardware
     77such as multi-core processors (CPUs), graphics processing units (GPUs),
     78field-programmable gate arrays (FPGAs), and specialized architectures such as
     79the Cell Broadband Engine (Cell BE). % FPGA results (synthesis of patterns into logic circuits) vs. memory based approaches (STTs in memory)
     81Scarpazza and Braudaway \cite{scarpazza2008fast} demonstrated that
     82text processing algorithms that exhibit irregular memory access patterns
     83can be efficiently executed on multicore hardware.
     84In related work, Pasetto et al. presented a flexible tool that
     85performs small-ruleset regular expression matching at a rate of
     862.88 Gbps per chip on Intel Xeon E5472 hardware \cite{pasetto2010}.
     87Naghmouchi et al. \cite{scarpazza2011top,naghmouchi2010} demonstrated that the Aho-Corasick (AC)
     88string matching algorithm \cite{aho1975} is well suited for parallel
     89implementation on multi-core CPUs, GPUs and the Cell BE.
     90On each hardware, both thread-level parallelism (cores) and data-level parallelism
     91(SIMD units) were leveraged for performance.
     92Salapura et. al. \cite{salapura2012accelerating} advocated the use of vector-style processing for regular expressions
     93in business analytics applications and leveraged the SIMD hardware available
     94on multi-core processors to acheive a speedup of greater than 1.8 over a
     95range of data sizes of interest.
     97In \cite{scarpazza2008}, Scarpazza and Russell presented a SIMD tokenizer
     98that delivered 1.00--1.78 Gbps on a single
     99Cell BE chip and extended this approach for emulation on the Intel Larrabee
     100instruction set \cite{scarpazza2009larrabee}.
     101On the Cell BE, Scarpazza \cite{scarpazza2009cell} described a pattern matching
     102implementation that delivered a throughput of 40
     103Gbps for a small dictionary of approximately 100 patterns and a throughput of 3.3-3.4
     104Gbps for a larger dictionary of thousands of patterns. Iorio and van Lunteren \cite{iorio2008}
     105presented a string matching implementation for automata that achieved
     1064 Gbps on the Cell BE.
     107% GPU
     108In more recent work, Tumeo et al. \cite{tumeo2010efficient} presented a chunk-based
     109implementation of the AC algorithm for
     110accelerating string matching on GPUs. Lin et al., proposed
     111the Parallel Failureless Aho-Corasick (PFAC)
     112algorithm to accelerate pattern matching on GPU hardware and
     113achieved 143 Gbps throughput, 14.74 times faster
     114than the AC algorithm performed on a four core
     115multi-core processor using OpenMP \cite{lin2013accelerating}.
     117Whereas the existing approaches to parallelization have been
     118based on adapting traditional sequential algorithms to emergent
     119parallel architectures, we introduce both a new algorithmic
     120approach and its implementation on SIMD and GPU architectures.
     121This approach relies on a bitwise data parallel view of text
     122streams as well as a surprising use of addition to match
     123runs of characters in a single step.  The closest previous
     124work is that underlying bit-parallel XML parsing using 128-bit SSE2 SIMD
     125technology together with a parallel scanning primitive also
     126based on addition \cite{cameron2011parallel}.   
     127However, in contrast to the deterministic, longest-match
     128scanning associated with the ScanThru primitive of that
     129work, we introduce here a new primitive MatchStar
     130that can be used in full generality for nondeterministic
     131regular expression matching.   We also introduce a long-stream
     132addition technique involving a further application of MatchStar
     133that enables us to scale the technique to $n$-bit addition
     134in $\lceil\log_{64}{n}\rceil$ steps.   We ultimately apply this technique,
     135for example, to perform
     136synchronized 4096-bit addition on GPU warps of 64 threads.
     138There is also a strong keyword match between the bit-parallel
     139data streams used in our approach and the bit-parallelism
     140used for NFA state transitions in the classical algorithms of
     141Wu and Manber \cite{wu1992agrep}, Baez-Yates and Gonnet \cite{baeza1992new}
     142and Navarro and Raffinot \cite{navarro1998bit}.
     143However those algorithms use bit-parallelism in a fundamentally
     144different way: representing all possible current NFA states
     145as a bit vector and performing parallel transitions to a new
     146set of states using table lookups and bitwise logic.    Whereas
     147our approach can match multiple characters per step, bit-parallel
     148NFA algorithms proceed through the input one byte at a time.
     149Nevertheless, the agrep \cite{wu1992agrep} and
     150nrgrep \cite{navarro2000} programs implemented using these techniques remain
     151among the strongest competitors in regular expression matching
     152performance, so we include them in our comparative evaluation.
    104154The remainder of this paper is organized as follows.
    105 Section~\ref{Basic Concepts} introduces the notations and basic concepts used throughout this paper.
    106 Section~\ref{Background} presents background material on classical automata-based approaches 
    107 to regular expression matching as well as describes the efficient {\em grep} family utilities
    108 and regular expression pattern matching software libraries.
    109 Next, Section~\ref{Bit-parallel Data Streams} describes our parallel regular expression matching techniques.
    110 Section~\ref{Compiler Technology} presents our software toolchain for constructing pattern matching applications.
    111 Section~\ref{Methodology} describes the evaluation framework and Section~\ref{Experimental Results}
    112 presents a detailed performance analysis of our data parallel bitstream techniques in comparison to
    113 several software tools. Section~\ref{Conclusion} concludes the paper.
    115 The fundamental contribution of this paper is fully general approach to regular expression matching
    116 using bit-parallel data stream operations. The algorithmic aspects of this paper build upon
    117 the fundamental concepts of our previous work \cite{cameron2008high, cameron2009parallel, cameron2011parallel, lin2012parabix}.
    118 Specific contributions include:
     155Section \ref{sec:grep} briefly describes regular expression
     156notation and the grep problem.
     157Section \ref{sec:bitwise} presents our basic algorithm and MatchStar
     158using a model of arbitrary-length bit-parallel data streams.
     159Section \ref{sec:blockwise} discusses the block-by-block
     160implementation of our techniques including the long stream
     161addition techniques for 256-bit addition with AVX2 and
     1624096-bit additions with GPGPU SIMT.
     163Section \ref{sec:SSE2} describes our overall SSE2 implementation
     164and carries out a performance study in comparison with
     165existing grep implementations.
     166Given the dramatic variation in grep performance across
     167different implementation techniques, expressions and data sets,
     168Section \ref{sec:analysis} considers a comparison between
     169the bit-stream and NFA approaches from a theoretical perspective.
     170Section \ref{sec:AVX2} then examines and demonstrates
     171the scalability of our
     172bitwise data-parallel approach in moving from
     173128-bit to 256-bit SIMD on Intel Haswell architecture.
     174To further investigate scalability, Section \ref{sec:GPU}
     175addresses the implementation of our matcher using
     176groups of 64 threads working together SIMT-style on a GPGPU system.
     177Section \ref{sec:Concl} concludes the paper with a discussion of results and
     178areas for future work.
     180\section{Regular Expression Notation and Grep}\label{sec:grep}
     182We follow common POSIX notation for regular expressions.
     183A regular expression specifies a set of strings through
     184a pattern notation.   Individual characters normally
     185stand for themselves, unless they are one of the
     186special characters \verb:*+?[{\(|^$.: that serve as metacharacters
     187of the notation system.  Thus the regular expression \verb:cat:
     188is a pattern for the set consisting of the single 3-character
     189string ``\verb:cat:''.   The special characters must be escaped
     190with a backslash to prevent interpretation as metacharacter, thus
     191\verb:\$: represents the dollar-sign and \verb:\\\\: represent
     192the string consisting of two backslash characters.
     193Character class bracket expressions are pattern elements
     194that allow any character in a given class to be used in a particular
     195context.  For example, \verb:[@#%]: is a regular expression
     196that stands for any of the three given symbols.  Contiguous
     197ranges of characters may be specified using hyphens;
     198for example \verb:[0-9]: for digits and \verb:[A-Za-z0-9_]:
     199for any alphanumeric character or underscore.  If the
     200caret character immediately follows the opening bracket,
     201the class is negated, thus \verb:[^0-9]: stands for
     202any character except a digit.  The period metacharacter
     203\verb:.: stands for the class of all characters.
     205Consecutive pattern elements stand for strings formed by
     206concatenation, thus \verb:[cd][ao][tg]: stands for the
     207set of 8 three-letter strings ``\verb:cat:'' through ``\verb:dog:''.
     208The alternation operator \verb:|: allows a pattern to be
     209defined to have to alternative forms, thus \verb:cat|dog:
     210matches either ``\verb:cat:'' or ``\verb:dog:''.  Concatenation
     211takes precedence over alternation, but parenthesis may be
     212used to change this, thus \verb:(ab|cd)[0-9]: stands for any
     213digit following one of the two prefixes  ``\verb:ab:'' or ``\verb:cd:''.
     215Repetition operators may be appended to a pattern to specify
     216a variable number of occurrences of that pattern. 
     217The Kleene \verb:*: specifies zero-or-more occurrences
     218matching the previous pattern, while \verb:+: specifies one-or
     219more occurrences.  Thus \verb:[a-z][a-z]*: and \verb:[a-z]+:
     220both specify the same set: strings of at least one lower-case
     221letter.  The postfix operator \verb:?: specifies an optional
     222component, i.e., zero-or-one occurrence of strings matching
     223the element.  Specific bounds may be given within braces:
     224\verb:(ab){3}: specifies the string ``\verb:ababab:'',
     225\verb:[0-9A-Fa-f]{2,4}: specifies strings of two, three
     226or four hexadecimal digits, and \verb:[A-Z]{4,}: specifies
     227strings of at least 4 consecutive capital letters.
     229The grep program searches a file for lines containing matches
     230to a regular expression using any of the above notations.
     231In addition, the pattern elements \verb:^: and \verb:$:
     232may be used to match respectively the beginning or the
     233end of a line.  In line-based tools such as grep, \verb:.:
     234matches any character except newlines; matches cannot extend
     235over lines.
     236Normally, grep prints all matching
     237lines to its output.  However, grep programs typically
     238allow a command line flag such as \verb:-c: to specify
     239that only a count of matching lines be produced; we use
     240this option in our experimental evaluation to focus
     241our comparisons on the performance of the underlying matching
     244\section{Matching with Bit-Parallel Data Streams}\label{sec:bitwise}
     246Whereas the traditional approaches to regular expression matching
     247using NFAs, DFAs or backtracking all rely on a byte-at-a-time
     248processing model, the approach  we introduce in this paper is based
     249on quite a different concept:  a data-parallel approach to simultaneous
     250processing of data stream elements.  Indeed, our most abstract model
     251is that of unbounded data parallelism: processing all elements of
     252the input data stream simultaneously.   In essence, data streams are viewed
     253as (very large) integers.   The fundamental operations are bitwise
     254logic, stream shifting and long-stream addition.
     256Depending on the available parallel processing resources, an actual
     257implementation may divide an input stream into blocks  and process
     258the blocks sequentially.   Within each block  all elements of the
     259input stream are processed together, relying the availability of
     260bitwise logic and addition scaled to the block size.   On commodity
     261Intel and AMD processors with 128-bit SIMD capabilities (SSE2),
     262we typically process input streams 128 bytes at a time.   
     263In this
     264case, we rely on the Parabix tool chain \cite{lin2012parabix}
     265to handle the details of compilation to block-by-block processing.
     266On the
     267latest processors supporting the 256-bit AVX2 SIMD operations,
     268we also use the Parabix tool chain, but substitute a parallelized
     269long-stream addition technique to avoid the sequential chaining
     270of 4 64-bit additions.
     271Our GPGPU implementation uses scripts to modify the output
     272of the Parabix tools, effectively dividing the input into blocks
     273of 4K bytes.   
     274We also have adapted our long-stream addition technique
     275to perform 4096-bit additions using 64 threads working in lock-step
     276SIMT fashion.  A similar technique is known to the GPU programming
     282input data  & \verb`a4534q--b29z---az---a4q--bca22z--`\\
     283$B_7$ & \verb`.................................`\\
     284$B_6$ & \verb`1....1..1..1...11...1.1..111..1..`\\
     285$B_5$ & \verb`111111111111111111111111111111111`\\
     286$B_4$ & \verb`.11111...111....1....11.....111..`\\
     287$B_3$ & \verb`......11..11111.1111...11.....111`\\
     288$B_2$ & \verb`.11.1.11....111..111.1.11......11`\\
     289$B_1$ & \verb`...1....11.1....1........11.111..`\\
     290$B_0$ & \verb`1.11.111..1.1111.1111.111.11...11`\\
     291\verb:[a]: & \verb`1..............1....1......1.....`\\
     292\verb:[z]: & \verb`...........1....1.............1..`\\
     293\verb:[0-9]: & \verb`.1111....11..........1......11...`
     297\caption{Basis and Character Class Streams}
     301A key concept in this streaming approach is the derivation of bit streams
     302that are parallel to the input data stream, i.e., in one-to-one
     303correspondence with the data element positions of the input
     304streams.   Typically, the input stream is a byte stream comprising
     305the 8-bit character code units of a particular encoding such
     306as extended ASCII, ISO-8859-1 or UTF-8.   However, the method may also
     307easily be used with wider code units such as the 16-bit code units of
     308UTF-16.   In the case of a byte stream, the first step is to transpose
     309the byte stream into eight parallel bit streams, such that bit stream
     310$i$ comprises the $i^\text{th}$ bit of each byte.   These streams form
     311a set of basis bit streams from which many other parallel bit
     312streams can be calculated, such as character class bit
     313streams such that each bit $j$ of the stream specifies
     314whether character $j$ of the input stream is in the class
     315or not.  Figure \ref{fig:streams} shows an example of an
     316input byte stream in ASCII, the eight basis bit streams of the
     317transposed representation, and the character class bit streams
     319\verb:[z]:, and
     321that may be computed from the basis bit streams using bitwise logic.
     322Zero bits are marked with periods ({\tt .}) so that the one bits stand out.
     323Transposition and character class construction are straightforward
     324using the Parabix tool chain \cite{lin2012parabix}.
     329input data  & \verb`a4534q--b29z---az---a4q--bca22z--`\\
     330$M_1$ & \verb`.1..............1....1......1....`\\
     331$M_2$ & \verb`.11111..........1....11.....111..`\\
     332$M_3$ & \verb`.................1.............1.`
     336\caption{Marker Streams in Matching {\tt a[0-9]*z}}
     340\paragraph*{Marker Streams.}  Now consider how bit-parallel data
     341streams can be used in regular expression matching.   Consider
     342the problem of searching the input stream of Figure \ref{fig:streams}
     343to finding occurrence of strings matching
     344the regular expression \verb:a[0-9]*z:.
     345The matching process involves the concept of {\em marker streams}, that
     346is streams that mark the positions of current matches during the
     347overall process.  In this case there are three marker streams computed
     348during the match process, namely,
     349$M_1$ representing match positions after an initial \verb:a:
     350character has been found, $M_2$ representing positions
     351reachable from positions marked by $M_1$ by further matching zero or
     352more digits (\verb:[0-9]*:) and finally $M_3$ the stream
     353marking positions after a final \verb:z: has been found.
     354Without describing the details of how these streams are computed
     355for the time being, Figure \ref{fig:streams2} shows what each
     356of these streams should be for our example matching problem.
     357Note our convention that a marker stream contains a 1 bit
     358at the next character position to be matched, that is,
     359immediately past the last position that was matched.
     363MatchStar takes a marker bitstream and a character class bitstream as input.  It returns all positions that can be reached by advancing the marker bitstream zero or more times through the character class bitstream.
     368input data  & \verb`a4534q--b29z---az---a4q--bca22z--`\\
     369$M_1$ & \verb`.1..............1....1......1....`\\
     370$D = \text{\tt [0-9]}$ & \verb`.1111....11..........1......11...`\\
     371$T_0 = M_1 \wedge D$ & \verb`.1...................1......1....`\\
     372$T_1 = T_0 + D$ & \verb`.....1...11...........1.......1..`\\
     373$T_2 = T_1 \oplus D$ & \verb`.11111...............11.....111..`\\
     374$M_2 = T_2 \, | \, M_1$ & \verb`.11111..........1....11.....111..`
     378\caption{$M_2 = \text{MatchStar}(M_1, D)$}
     383Figure \ref{fig:matchstar} illustrates the MatchStar method. In this figure,
     384it is important to note that our bitstreams are shown in natural left-to-right order reflecting the
     385conventional presentation of our character data input.   However, this reverses the normal
     386order of presentation when considering bitstreams as numeric values.  The key point here is
     387that when we perform bitstream addition, we will show bit movement from left-to-right.
     388For example, $\verb:111.: + \verb:1...: = \verb:...1:$.
     390The first row of the figure is the input data,
     391the second and third rows are the input bitstreams: the initial marker position bitstream and the
     392character class bitstream for digits derived from input data. 
     394In the first operation ($T_0$), marker positions that cannot be advanced are temporarily removed from consideration by masking off marker positions that aren't character class positions using bitwise logic.  Next, the temporary marker bitstream is added to the character class bitstream. 
     395The addition produces 1s in three types of positions.  There will be a 1 immediately following a block of character class positions that spanned one or more marker positions, at any character class positions that weren't affected by the addition (and are not part of the desired output), and at any marker position that wasn't the first in its block of character class positions.  Any character class positions that have a 0 in $T_1$ were affected by the addition and are part of the desired output.  These positions are obtained and the undesired 1 bits are removed by XORing with the character class stream. $T_2$ is now only missing marker positions that were removed in the first step as well as marker positions that were 1s in $T_1$.  The
     396output marker stream is obtained by ORing $T_2$ with the initial marker stream.
     398In general, given a marker stream $M$ and a character class stream $C$,
     399the operation of MatchStar is defined by the following equation. 
     400\[\text{MatchStar}(M, C) = (((M \wedge C) + C)  \oplus C) | M\]
     401Given a set of initial marker positions, the result stream marks
     402all possible positions that can be reached by 0 or more occurrences
     403of characters in class $C$ from each position in $M$. 
     409\section{Block-at-a-Time Processing}\label{sec:blockwise}
     411The unbounded stream model of the previous section must of course
     412be translated an implementation that proceeds block-at-a-time for
     413realistic application.  In this, we primarily rely on the Pablo
     414compiler of the Parabix toolchain \cite{lin2012parabix}.  Given input
     415statements expressed as arbitrary-length bitstream equations, Pablo
     416produces block-at-a-time C++ code that initializes and maintains all the necessary
     417carry bits for each of the additions and shifts involved in the
     418bitstream calculations.   
     420In the present work, our principal contribution to the Parabix tool
     421chain is to incorporate the technique of long-stream addition described below.
     422Otherwise, we were able to use Pablo directly in compiling our
     423SSE2 and AVX2 implementations.   Our GPU implementation required
     424some scripting to modify the output of the Pablo compiler for our
     427\paragraph*{Long-Stream Addition.}  The maximum word size for
     428addition on commodity processors is typically 64 bits.  In order
     429to implement long-stream addition for block sizes of 256 or larger,
     430a method for propagating carries through the individual stages of
     43164-bit addition is required.  However, the normal technique of
     432sequential addition using add-with-carry instructions, for example,
     433is far from ideal.
     435We have developed a general model using SIMD methods for constant-time
     436long-stream addition up to 4096 bits.   
     437We assume the availability of the following SIMD/SIMT operations
     438operating on vectors of $f$ 64-bit fields.
    120 \item compilation of regular expressions into unbounded bit-parallel data stream equations;
    121 \item documentation of character classes compilation into bit-parallel character class data streams;
    122 \item the Match Star parallel scanning primitive;
    123 \item efficient support for unicode characters.
     440\item \verb#simd<64>::add(X, Y)#: vertical SIMD addition of corresponding 64-bit fields
     441in two vectors to produce a result vector of $f$ 64-bit fields.
     442\item  \verb#simd<64>::eq(X, -1)# :  comparison of the 64-bit fields
     443of \verb:x: each with the constant value -1 (all bits 1), producing
     444an $f$-bit mask value,
     445\item  \verb#hsimd<64>::mask(X)# : gathering the high bit of each 64-bit
     446field into a single compressed $f$-bit mask value, and
     447\item normal bitwise logic operations on $f$-bit masks, and
     448\item  \verb#simd<64>::spread(x)# : distributing the bits of
     449an $f$ bit mask, one bit each to the $f$ 64-bit fields of a vector.
    126 \section{Basic Concepts}
    127 \label{Basic Concepts}
    128 In this section, we define the notation and basic concepts used throughout this paper.
    130 \subsection{Notation}
    131 We use the following notations. Let $P=p_{1}p_{2}\ldots{}p_{m}$
    132 be a pattern of length m and $T=t_{1}t_{2}\ldots{}t_{n}$
    133 be a text of length n both defined over a finite alphabet $sigma$ of size $alpha$.
    134 The goal of simple pattern expression matching is to find all the text positions of T that follow
    135 an occurrence of P. P may be a simple pattern, extended pattern, or a regular expression.
    137 We use C notations to represent bitwise operations $\lnot{}$, $\lor{}$, $\land{}$, $\oplus{}$
    138 represent bitwise NOT, OR, AND, XOR, respectively. Operators $\ll{}k$, and $\gg{}$ represent
    139 logical left shift, and logical right shift, respectively and are further modulated by
    140 the number of bit positions a given value shall be shifted by, for example ``shift left by n''.
    141 Vacant bit-positions are filled in with zeroes.
    143 \subsection{Regular Expressions}
    145 % Define regular expressions (a recursive def), character classes, bounded repetition
    147 \section{Background}
    148 \label{Background}
    149 %\input{background.tex}
    151 \subsection{Classical Methods}
    153 \subsection{Regular Expression and Finite Automata}
    155 The origins of regular expression matching date back to automata theory
    156 and formal language theory developed by Kleene in the 1950s \cite{kleene1951}.
    157 Thompson \cite{thompson1968} is credited with the first construction to convert regular expressions
    158 to nondeterministic finite automata (NFA) for regular expression matching.
    159 Following Thompson's approach, a regular expression of length m is first converted
    160 to an NFA with O(m) nodes. It is possible to search a text of length n using the
    161 NFA directly in O(mn) worst case time. Often, a more efficient choice
    162 is to convert the NFA into a DFA. A DFA has only a single active state and
    163 allows to search the text at O(n) worst-case optimal. It is well known that the
    164 conversion of the NFA to the DFA may result in the state explosion problem.
    165 That is the resultant DFA may have O($2^m$) states.
    167 Thompson's original work marked the beginning of a long line of
    168 regular expression pattern matching methods that
    169 process an input string, character-at-a-time,
    170 and that transition from state to state
    171 according to the current state and the next input character.
    173 Whereas traditional automata techniques acheive
    174 O(n) worst-case optimal efficiency, simple string matching algorithms,
    175 such as the Boyer-Moore family of algorithms, skip input characters
    176 to achieve sublinear times in the average case \cite{boyer1977fast}.
    178 Boyer-Moore methods, begin comparison from the end of the pattern instead of the
    179 beginning and precompute skip information
    180 to determine how far ahead a pattern search can skip in the input whenever
    181 a non-matching character is encountered. Generally, Boyer-Moore family
    182 algorithms improve faster as the pattern being searched for becomes longer.
    183 In many cases, the techniques used to skip characters in simple string matching
    184 approaches can be extended to regular expression matching.
    185 Widely known techniques used to facilitate character skipping in regular
    186 expression matching include necessary strings and backward suffix matching
    187 inspired by the Backward Dawg Matching (BDM) algorithm \cite{Navarro02patternmatching}.
    190 \subsection{Bit-parallel Simulation of Automata}
    192 Define bit-parallelism \cite{Baeza-yates_anew}
    194 Shift-Or / Shift-And \cite{wu1992fast}
    196 Bit-parallel suffix automata (Backward Non-Deterministic Dawg Matching (BNDM) \cite{navarro1998bit} algorithm)
    198 \subsection{Software Tools}
    200 %Thompson created the first grep (UNIX grep) as a standalone adaptation
    201 %of the regular expression parser he had written for the UNIX ed utility.
    202 %In 1976, Aho improved upon Thompson's implementation that with a DFA-based implementation called egrep.
    203 %Egrep was faster then grep for simple patterns but for more complex searches it lagged behind because of the
    204 %time it took to build a complete finite automaton for the regular expression before it could even start searching.
    205 %Since grep used a nondeterministic finite automaton it took less time to build the state machine but more time
    206 %to evaluate strings with it. Aho later used a technique called cached lazy evaluation to improve the performance of egrep.
    207 %It took zero set-up time and just one additional test in the inner loop.
    208 %
    210 %Given a regular expression R and a test T the regular expression matching
    211 %problem finds all ending position of substrings in Q that matches a string in
    212 %the language denoted by R.
    213 %The behaviour of Gnu grep, agrep, and nr-grep are differ ...
    214 %Gnu grep
    215 %agrep
    216 %nr-grep
    217 %re2
    220 \section{Bit-parallel Data Streams}
    221 \label{Bit-parallel Data Streams}
    223 The bit-parallel data streams use the wide
    224 SIMD registers commonly found on commodity processors
    225 to process  byte positions at a time using
    226 bitwise logic, shifting and other operations.
    228 A signficant advantage of the bit-parallel
    229 data stream method over other
    230 pattern matching methods that rely on
    231 bit-parallel automata simulation
    232 is the potential to skip full register width
    233 number of characters in low occurence frequency text. % reword
    236 Skip characters register width.
    238 \subsection{Match Star}
    240 %Wikipedia
    241 Backtracking is a general algorithm for finding all solutions to some computational problem,
    242 that incrementally builds candidates to the solutions.
    244 \section{Compiler Technology}
    245 \label{Compiler Technology}
    247 \section{Methodology}
    248 \label{Methodology}
    249 %\input{methodology.tex}
    251 We compare the performance of our bit-parallel data stream techniques against
    252 Gnu grep, agrep, nr-grep, and re2.
    256 \section{Experimental Results}
    257 \label{Experimental Results}
    258 %\input{results.tex}
    260 \section{Conclusion}
    261 \label{Conclusion}
    262 %\input{conclusion.tex}
    264 {
    265   \bibliographystyle{acm}
    266   \bibliography{reference}
     452In this model, the \verb#hsimd<64>::mask(X)# and
     453\verb#simd<64>::spread(x)# model the minimum
     454communication requirements between the parallel processing units
     455(SIMD lanes or SIMT processors).    In essence, we just need
     456the ability to quickly send and receive 1 bit of information
     457per parallel unit.    The \verb#hsimd<64>::mask(X)# operation
     458gathers 1 bit from each of the processors to a central resource.
     459After calculations on the gather bits are performed, we then
     460just need an operation to invert the communication, i.e.,
     461sending 1 bit each from the central processor to each of
     462the parallel units.   There are a variety of ways in which
     463these facilities may be implemented depending on the
     464underlying architecture; details of our AVX2 and GPU implementations
     465are presented later.   
     467Given these operations, our method for long stream addition of
     468two $f \times 64$ bit values \verb:X: and \verb:Y: is the following.
     470\item Form the vector of 64-bit sums of \verb:x: and \verb:y:.
     471\[\text{\tt R} = \text{\tt simd<64>::add(X, Y)} \]
     473\item Extract the $f$-bit masks of \verb:X:, \verb:Y: and \verb:R:.
     474\[\text{\tt x} = \text{\tt hsimd<64>::mask(X)} \]
     475\[\text{\tt y} = \text{\tt hsimd<64>::mask(Y)} \]
     476\[\text{\tt r} = \text{\tt hsimd<64>::mask(R)} \]
     478\item Compute an $f$-bit mask of carries generated for each of the
     47964-bit additions of \verb:X: and \verb:Y:.
     480\[\text{\tt c} = (\text{\tt x} \wedge \text{\tt y}) \vee ((\text{\tt x} \vee \text{\tt y}) \wedge \neg \text{\tt r})\]
     482\item Compute an $f$-bit mask of all fields of {\tt R} that will overflow with
     483an incoming carry bit.  This is the {\em bubble mask}.
     484\[\text{\tt b} = \text{\tt simd<64>::eq(R, -1)}\]
     486\item Determine an $f$-bit mask identifying the fields of {\tt R} that need to be
     487incremented to produce the final sum.  Here we find a new application of
     489\[\text{\tt i} = \text{\tt MatchStar(c*2, b)}\]
     491This is the key step.  The mask {\tt c} of outgoing carries must be
     492shifted one position ({\tt c*2}) so that each outgoing carry bit becomes associated
     493with the next digit.  At the incoming position, the carry will
     494increment the 64-bit digit.   However, if this digit is all ones (as
     495signalled by the corresponding bit of bubble mask {\tt b}, then the addition
     496will generate another carry.  In fact, if there is a sequence of
     497digits that are all ones, then the carry must bubble through
     498each of them.   This is just MatchStar.
     500\item Compute the final result {\tt Z}.
     501\[\text{\tt Z} = \text{\tt simd<64>::add(R, simd<64>::spread(i))}\]
     507{\tt X} & {\tt 19} & {\tt 31} & {\tt BA} & {\tt 4C} & {\tt 3D} & {\tt 45} & {\tt 21} & {\tt F1} \\ \cline{2-9}
     508{\tt Y} & {\tt 22} & {\tt 12} & {\tt 45} & {\tt B3} & {\tt E2} & {\tt 16} & {\tt 17} & {\tt 36} \\ \cline{2-9}
     509{\tt R} & {\tt 3B} & {\tt 43} & {\tt FF} & {\tt FF} & {\tt 1F} & {\tt 5B} & {\tt 38} & {\tt 27} \\ \cline{2-9}
     510{\tt x} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 1} \\ \cline{2-9}
     511{\tt y} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 1} & {\tt 0} & {\tt 0} & {\tt 0} \\ \cline{2-9}
     512{\tt r} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 1} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 0} \\ \cline{2-9}
     513{\tt c} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 0} & {\tt 0} & {\tt 1} \\ \cline{2-9}
     514{\tt c*2} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 0} \\ \cline{2-9}
     515{\tt b} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 1} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 0} \\ \cline{2-9}
     516{\tt i} & {\tt 0} & {\tt 1} & {\tt 1} & {\tt 1} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 0} \\ \cline{2-9}
     517{\tt Z} & {\tt 3B} & {\tt 44} & {\tt 0} & {\tt 0} & {\tt 1F} & {\tt 5B} & {\tt 39} & {\tt 27} \\ \cline{2-9}
     520\caption{Long Stream Addition}\label{fig:longadd}
     523Figure \ref{fig:longadd} illustrates the process.  In the figure,
     524we illustrate the process with 8-bit fields rather than 64-bit fields
     525and show all field values in hexadecimal notation.  Note that
     526two of the individual 8-bit additions produce carries, while two
     527others produce {\tt FF} values that generate bubble bits.  The
     528net result is that four of the original 8-bit sums must be
     529incremented to produce the long stream result.
     531A slight extension to the process produces a long-stream full adder
     532that can be used in chained addition.  In this case, the
     533adder must take an additional carry-in bit
     534{\tt p} and produce a carry-out bit {\tt q}.
     535This may be accomplished by incorporating {\tt p}
     536in calculating the increment mask in the low bit position,
     537and then extracting the carry-out {\tt q} from the high bit position.
     538\[\text{\tt i} = \text{\tt MatchStar(c*2+p, b)}\]
     539\[\text{\tt q} = \text{\tt i >> f}\]
     541As described subsequently, we use a two-level long-stream addition technique
     542in both our AVX2 and GPU implementations.  In principle, one can extend
     543the technique to additional levels.  Using 64-bit adders throughout,
     544$\lceil\log_{64}{n}\rceil$ steps are needed for $n$-bit addition.
     545A three-level scheme could coordinate
     54664 groups each performing 4096-bit long additions in a two-level structure.
     547However, whether there are reasonable architectures that can support fine-grained
     548SIMT style coordinate at this level is an open question.
     550Using the methods outlined, it is quite conceivable that instruction
     551set extensions to support long-stream addition could be added for
     552future SIMD and GPU processors.   Given the fundamental nature
     553of addition as a primitive and its particular application to regular
     554expression matching as shown herein, it seems reasonable to expect
     555such instructions to become available.    Alternatively, it may
     556be worthwhile to simply ensure that the \verb#hsimd<64>::mask(X)#
     557\verb#simd<64>::spread(X)# operations are efficiently supported.
     568\section{GPU Implementation}\label{sec:GPU}
     570To further assess the scalability of our regular expression matching
     571using bit-parallel data streams, we implemented a GPGPU version
     572in OpenCL.   
     573We arranged for 64 work groups each having 64
     574threads.  Input files are divided in data parallel fashion among
     575the 64 work groups.  Each work group carries out the regular
     576expression matching operations 4096 bytes at a time using SIMT
     577processing.  Figure \ref{fig:GPUadd} shows our implementation
     578of long-stream addition on the GPU.  Each thread maintains
     579its own carry and bubble values and performs synchronized
     580updates with the other threads using a six-step parallel-prefix
     581style process.
     583Our GPU test machine was an AMD A10-6800K APU with Radeon(tm) HD Graphics
     584having a processor speed of 2 GHz and 32.0GB of memory.
     590inline BitBlock adc(int idx, BitBlock a, BitBlock b, __local BitBlock *carry, _
     591                    _local BitBlock *bubble, BitBlock *group_carry, const int carryno){
     592        BitBlock carry_mask;
     593        BitBlock bubble_mask;
     595        BitBlock partial_sum = a+b;
     596        BitBlock gen = a&b;
     597        BitBlock prop = a^b;
     598        carry[idx] = ((gen | (prop & ~partial_sum))&CARRY_BIT_MASK)>>(WORK_GROUP_SIZE-1-idx);
     599        bubble[idx] = (partial_sum + 1)? 0:(((BitBlock)1)<<idx);
     601        barrier(CLK_LOCAL_MEM_FENCE);
     602        for(int offset=WORK_GROUP_SIZE/2; offset>0; offset=offset>>1){
     603                carry[idx] = carry[idx]|carry[idx^offset];
     604                bubble[idx] = bubble[idx]|bubble[idx^offset];
     605                barrier(CLK_LOCAL_MEM_FENCE);
     606        }
     608        carry_mask = (carry[0]<<1)|group_carry[carryno];
     609        bubble_mask = bubble[0];
     611        BitBlock s = (carry_mask + bubble_mask) & ~bubble_mask;
     612        BitBlock inc = s | (s-carry_mask);
     613        BitBlock rslt = partial_sum + ((inc>>idx)&0x1);
     614        group_carry[carryno] = (carry[0]|(bubble_mask & inc))>>63;
     615        return rslt;
     620\caption{OpenCL 4096-bit Addition}
     624Figure \ref{fig:SSE-AVX-GPU} compares the performance of
     625our SSE2, AVX and GPU implementations.
     632ylabel=Running Time (ms per megabyte),
     634tick label style={font=\tiny},
     635enlarge x limits=0.15,
     636%enlarge y limits={0.15, upper},
     638legend style={at={(0.5,-0.15)},
     639anchor=north,legend columns=-1},
     641bar width=7pt,
     644file {data/ssetime.dat};
     646file {data/avxtime.dat};
     648file {data/gputime.dat};
     654\caption{Running Time}\label{fig:SSE-AVX-GPU}
     669%\section{Appendix Title}
     671%This is the text of the appendix, if you need one.
     674This research was supported by grants from the Natural Sciences and Engineering Research Council of Canada and
     675MITACS, Inc.
Note: See TracChangeset for help on using the changeset viewer.