Feb 10, 2015, 5:55:14 PM (4 years ago)

Intro and Background, remove section 3

1 edited


  • docs/Working/icGrep/background.tex

    r4488 r4490  
    11\section{Background} \label{sec:background}
    2 \subsection{Unicode Regular Expression Requirements}
    3 Unicode is system for organizing the characters from all languages
    4 and symbol systems into a single numeric space and encoding those
    5 values in convenient formats for computerized processing.
    6 The numeric values form a space of integer {\em codepoints} from 0
    7 through hexadecimal 10FFFF.  The computerized formats represent
    8 codepoint values as (possibly variable length) sequences of fixed-width {\em code units}.
    9 UTF-8 represents each codepoint using a sequence of one to four octets (8-bit bytes),
    10 UTF-16 represents each codepoint using one or two 16-bit code units and UTF-32
    11 represents each codepoint as a single 32-bit unit.  The format used most
    12 often for storage and transmission of Unicode data is UTF-8; this is the format
    13 assumed through this paper.
    15 Traditional grep syntax is oriented towards
     3%\subsection{Unicode Regular Expression Requirements}
     4\paragraph{Unicode Regular Expressions.}
     5Traditional regular expression syntax is oriented towards
    166string search using regular expressions over ASCII or extended-ASCII byte sequences.
    177A grep search for a line beginning with a capitalized word might use the
    2818practical to use named character classes, such as \verb:Lu: for upper case letters and
    2919\verb:Ll: for lower case letters.   Using these names, our search might be rewritten
    30 to find capitalized words in any language as ``\verb!^[[:Lu:]][[:Ll:]]+!'' (Posix
    31 syntax)  or ``\verb:^\p{Lu}\p{Ll}+:'' (Perl-compatible syntax).   
     20to find capitalized words in any language as %``\verb!^[[:Lu:]][[:Ll:]]+!'' (Posix syntax)  or
     21``\verb:^\p{Lu}\p{Ll}+:'' (Perl-compatible syntax).   
    3222The Unicode consortium has defined an extensive list of named properties that can
    3323be used in regular expressions.
    5141denotes the class of all non-ASCII lower case letters.
    53 \subsection{Parabix}
    55 The Parabix toolchain is a set of compilers and run-time libraries designed
    56 to take advantage of the SIMD features of commodity processors
    57 to support high-performance streaming text processing.
    58 % based on a bit-parallel
    59 %transform representation of text.
    60 Parabix leverages a bit-parallel transform representation of text, where a
    61 text $T$ is represented as a set of parallel
    62 bit streams $B_i$, such that bit $j$ of stream $B_i$ is the $i^{\mbox{th}}$ bit of
    63 character code unit $j$ of $T$.
    64 The Parabix methods have been used to
    65 accelerate Unicode transcoding~\cite{cameron2008case}, protein search~\cite{green2009modeling},
    66 XML parsing~\cite{cameron2011parallel}, and, most recently, regular expression search~\cite{cameron2014bitwise}.
     43%\subsection{Bitwise Data Parallel Regular Expression Matching}
     44\paragraph{Bitwise Data Parallel Matching.}
     45Regular expression search using bitwise data parallelism has been
     46recently introduced and shown to considerably outperform methods
     47based on DFAs or NFAs \cite{cameron2014bitwise}.   In essence, the
     48method is 100\% data parallel, considering all input positions in a file
     49simultaneously.   A set of parallel bit streams is computed, with each bit
     50position corresponding to one code-unit position within input character stream.
     51{\em Character class streams}, such as {\tt [d]} for the stream that marks the
     52position of ``d'' characters and {\tt [a-z]} for the stream of lower case
     53ASCII alphabetics are first computed in a fully data-parallel manner.
     54Then the matching process proper begins taking advance of bitwise logic
     55and shifting operations as well as an operation for finding all matches to a
     56character class repetition known as MatchStar.  At each step of the
     57process a {\em marker} bit stream identifying the full set of positions
     58within the input data stream that match the regular expression to this point.
    83 The central concept in regular expression matching is that of marker bitstreams.
    84 At each step in the matching process, a marker bitstream indicates the matches
    85 that have been found to this point.   The bitwise matching method uses the
    86 operations Advance to move an entire stream of markers one step forward, and MatchStar
    87 to find all positions matched after a character class repetition.
    8874For example,
    8975Figure \ref{fig:bitwisematch} shows how the regular expression {\tt d[a-z]*ed} is matched
    90 against some input text using bitwise methods. 
     76against some input text using bitwise methods.  In this diagram we use periods to denote
     770 bits so that the 1 bits stand out.
    9178In the first step the character class stream
    92 {\tt [d]} is matched and the results advanced one position to produce marker bitstream $M_1$.   Four match results are now in play
    93 simultaneously. 
    94 The next step applies the
    95 MatchStar operation to find all the matches that may then be reached with the Kleene-* repetition
    96 {\tt [a-z]*} ($M_2$).  We now have pending matches at almost all possible positions.   However, most of
    97 the positions are filtered out with the match to {\tt [e]} in the computation of $M_3$.
    98 The final step produces marker stream $M_4$ indicating that single position at which the entire regular expression is matched.
     79{\tt [d]} is matched and the results shifted one position (Advance) to produce marker bitstream $M_1$.
     80Four matches indicated by marker bits are now in play simultaneously. 
     81The next step applies the  MatchStar operation to find all the matches that may then be
     82reached with the Kleene-* repetition
     83{\tt [a-z]*} ($M_2$).   This produces pending matches at many positions.  However, there
     84is no need to consider these matches one at a time using lazy or greedy matching strategies.
     85Rather, the full marker stream $M_3$ of remaining possibilites after matching {\tt [e]} is easily
     86computed using a shift and bitwise and.
     87The final step produces marker stream $M_4$ indicating that single position
     88at which the entire regular expression is matched.
     90The MatchStar operation turns out to be surprisingly simple \cite{cameron2014bitwise}.
     91\[\text{MatchStar}(M, C) = (((M \wedge C) + C)  \oplus C) \vee M\]
     92A single bit stream addition operation suffices to find all reachable
     93positions from marker stream $M$ through character class stream $C$.
     94Interestingly, the MatchStar operation also has application to the
     95parallelized long-stream addition itself \cite{cameron2014bitwise}, as well as
     96the bit-parallel edit distance algorithm of Myers\cite{myers1999fast}.
     98The Parabix toolchain \cite{lin2012parabix} provides a set of compilers
     99and run-time libraries that target the SIMD instructions of commodity
     100processors (e.g., SSE or AVX instructions on x86-64 architecture).
     101Input is processed in blocks of code units equal to the size in bits
     102of the SIMD registers, for example, 128 bytes at a time using 128-bit registers.
     103Using the Parabix facilities, the bitwise data parallel approach to
     104regular expression search was shown to deliver substantial performance
     105acceleration for traditional ASCII regular expression matching tasks,
     106often 5X or better \cite{cameron2014bitwise}.
    117124targetting the SIMD integer instructions of commodity processors.
    119128\subsection{Parabix Regular Expression Matching}
Note: See TracChangeset for help on using the changeset viewer.