Changeset 4490 for docs/Working/icGrep

Feb 10, 2015, 5:55:14 PM (4 years ago)

Intro and Background, remove section 3

6 edited


  • docs/Working/icGrep/background.tex

    r4488 r4490  
    11\section{Background} \label{sec:background}
    2 \subsection{Unicode Regular Expression Requirements}
    3 Unicode is system for organizing the characters from all languages
    4 and symbol systems into a single numeric space and encoding those
    5 values in convenient formats for computerized processing.
    6 The numeric values form a space of integer {\em codepoints} from 0
    7 through hexadecimal 10FFFF.  The computerized formats represent
    8 codepoint values as (possibly variable length) sequences of fixed-width {\em code units}.
    9 UTF-8 represents each codepoint using a sequence of one to four octets (8-bit bytes),
    10 UTF-16 represents each codepoint using one or two 16-bit code units and UTF-32
    11 represents each codepoint as a single 32-bit unit.  The format used most
    12 often for storage and transmission of Unicode data is UTF-8; this is the format
    13 assumed through this paper.
    15 Traditional grep syntax is oriented towards
     3%\subsection{Unicode Regular Expression Requirements}
     4\paragraph{Unicode Regular Expressions.}
     5Traditional regular expression syntax is oriented towards
    166string search using regular expressions over ASCII or extended-ASCII byte sequences.
    177A grep search for a line beginning with a capitalized word might use the
    2818practical to use named character classes, such as \verb:Lu: for upper case letters and
    2919\verb:Ll: for lower case letters.   Using these names, our search might be rewritten
    30 to find capitalized words in any language as ``\verb!^[[:Lu:]][[:Ll:]]+!'' (Posix
    31 syntax)  or ``\verb:^\p{Lu}\p{Ll}+:'' (Perl-compatible syntax).   
     20to find capitalized words in any language as %``\verb!^[[:Lu:]][[:Ll:]]+!'' (Posix syntax)  or
     21``\verb:^\p{Lu}\p{Ll}+:'' (Perl-compatible syntax).   
    3222The Unicode consortium has defined an extensive list of named properties that can
    3323be used in regular expressions.
    5141denotes the class of all non-ASCII lower case letters.
    53 \subsection{Parabix}
    55 The Parabix toolchain is a set of compilers and run-time libraries designed
    56 to take advantage of the SIMD features of commodity processors
    57 to support high-performance streaming text processing.
    58 % based on a bit-parallel
    59 %transform representation of text.
    60 Parabix leverages a bit-parallel transform representation of text, where a
    61 text $T$ is represented as a set of parallel
    62 bit streams $B_i$, such that bit $j$ of stream $B_i$ is the $i^{\mbox{th}}$ bit of
    63 character code unit $j$ of $T$.
    64 The Parabix methods have been used to
    65 accelerate Unicode transcoding~\cite{cameron2008case}, protein search~\cite{green2009modeling},
    66 XML parsing~\cite{cameron2011parallel}, and, most recently, regular expression search~\cite{cameron2014bitwise}.
     43%\subsection{Bitwise Data Parallel Regular Expression Matching}
     44\paragraph{Bitwise Data Parallel Matching.}
     45Regular expression search using bitwise data parallelism has been
     46recently introduced and shown to considerably outperform methods
     47based on DFAs or NFAs \cite{cameron2014bitwise}.   In essence, the
     48method is 100\% data parallel, considering all input positions in a file
     49simultaneously.   A set of parallel bit streams is computed, with each bit
     50position corresponding to one code-unit position within input character stream.
     51{\em Character class streams}, such as {\tt [d]} for the stream that marks the
     52position of ``d'' characters and {\tt [a-z]} for the stream of lower case
     53ASCII alphabetics are first computed in a fully data-parallel manner.
     54Then the matching process proper begins taking advance of bitwise logic
     55and shifting operations as well as an operation for finding all matches to a
     56character class repetition known as MatchStar.  At each step of the
     57process a {\em marker} bit stream identifying the full set of positions
     58within the input data stream that match the regular expression to this point.
    83 The central concept in regular expression matching is that of marker bitstreams.
    84 At each step in the matching process, a marker bitstream indicates the matches
    85 that have been found to this point.   The bitwise matching method uses the
    86 operations Advance to move an entire stream of markers one step forward, and MatchStar
    87 to find all positions matched after a character class repetition.
    8874For example,
    8975Figure \ref{fig:bitwisematch} shows how the regular expression {\tt d[a-z]*ed} is matched
    90 against some input text using bitwise methods. 
     76against some input text using bitwise methods.  In this diagram we use periods to denote
     770 bits so that the 1 bits stand out.
    9178In the first step the character class stream
    92 {\tt [d]} is matched and the results advanced one position to produce marker bitstream $M_1$.   Four match results are now in play
    93 simultaneously. 
    94 The next step applies the
    95 MatchStar operation to find all the matches that may then be reached with the Kleene-* repetition
    96 {\tt [a-z]*} ($M_2$).  We now have pending matches at almost all possible positions.   However, most of
    97 the positions are filtered out with the match to {\tt [e]} in the computation of $M_3$.
    98 The final step produces marker stream $M_4$ indicating that single position at which the entire regular expression is matched.
     79{\tt [d]} is matched and the results shifted one position (Advance) to produce marker bitstream $M_1$.
     80Four matches indicated by marker bits are now in play simultaneously. 
     81The next step applies the  MatchStar operation to find all the matches that may then be
     82reached with the Kleene-* repetition
     83{\tt [a-z]*} ($M_2$).   This produces pending matches at many positions.  However, there
     84is no need to consider these matches one at a time using lazy or greedy matching strategies.
     85Rather, the full marker stream $M_3$ of remaining possibilites after matching {\tt [e]} is easily
     86computed using a shift and bitwise and.
     87The final step produces marker stream $M_4$ indicating that single position
     88at which the entire regular expression is matched.
     90The MatchStar operation turns out to be surprisingly simple \cite{cameron2014bitwise}.
     91\[\text{MatchStar}(M, C) = (((M \wedge C) + C)  \oplus C) \vee M\]
     92A single bit stream addition operation suffices to find all reachable
     93positions from marker stream $M$ through character class stream $C$.
     94Interestingly, the MatchStar operation also has application to the
     95parallelized long-stream addition itself \cite{cameron2014bitwise}, as well as
     96the bit-parallel edit distance algorithm of Myers\cite{myers1999fast}.
     98The Parabix toolchain \cite{lin2012parabix} provides a set of compilers
     99and run-time libraries that target the SIMD instructions of commodity
     100processors (e.g., SSE or AVX instructions on x86-64 architecture).
     101Input is processed in blocks of code units equal to the size in bits
     102of the SIMD registers, for example, 128 bytes at a time using 128-bit registers.
     103Using the Parabix facilities, the bitwise data parallel approach to
     104regular expression search was shown to deliver substantial performance
     105acceleration for traditional ASCII regular expression matching tasks,
     106often 5X or better \cite{cameron2014bitwise}.
    117124targetting the SIMD integer instructions of commodity processors.
    119128\subsection{Parabix Regular Expression Matching}
  • docs/Working/icGrep/bitgrep.bib

    r4452 r4490  
    324324  title={GPU-based NFA implementation for memory efficient high speed regular expression matching},
    325325  author={Zu, Yuan and Yang, Ming and Xu, Zhonghu and Wang, Lin and Tian, Xin and Peng, Kunyang and Dong, Qunfeng},
    326   booktitle={ACM SIGPLAN Notices},
    327   volume={47},
    328   number={8},
     326  booktitle={PPoPP '12 - Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming},
    329327  pages={129--140},
    330328  year={2012},
  • docs/Working/icGrep/icgrep.tex

    r4476 r4490  
    54 \input{paradigm.tex}
  • docs/Working/icGrep/introduction.tex

    r4489 r4490  
    3 Unix {\tt grep} is a tool widely used
    4 to search for lines in text files matching a given regular expression
    5 pattern.   Historical comments ...
    7 Unicode regular expression matching adds performance challenges...
     4Although well-established technical standards exist for
     5Unicode regular expressions \cite{davis2012unicode}, most of today's
     6regular expression processing toolsets fail to support the full set of processing
     7features even at the most basic level \cite{stewart2013unicode}. 
     8One of the fundamental issues is performance and so it makes good sense
     9to consider the ways in which parallel processing approaches can help
     10address the gap.
    912Efforts to improve the performance of regular expression matching through
    3336require thousands of DFA states for named Unicode properties.
    3437Building on the Parabix framework, Cameron et al.~\cite{cameron2014bitwise} introduce
    35 regular expression matching using the bitwise
    36 data parallel approach together with the MatchStar primitive
    37 for efficient implementation of
    38 Kleene-* character-class repetitions.
     38regular expression matching using a new bitwise
     39data parallel approach.
    4041In this paper, we report on the use of the implementation of a full
    4445The result is \icGrep{},
    4546a high-performance, full-featured open-source grep implementation
    46 with systematic support for Unicode regular expressions addressing the
    47 requirements of Unicode Technical Standard \#18 \cite{davis2012unicode}.  As an alternative
     47with systematic support for Unicode regular expressions.  As an alternative
    4848to classical grep implementations, \icGrep{} offers dramatic performance
    49 acceleration in ASCII-based and Unicode matching performance alike.
     49acceleration in Unicode regular expression matching.
    5151The remainder of this paper is organized as follows.   Section \ref{sec:background}
    5454the Parabix framework and regular expression matching techniques
    5555using bitwise data parallelism.   
    56 Section \ref{sec:seqpar} expands on previous work on bitwise data
    57 parallelism by more fully characterizing the paradigm and documenting
    58 important techniques.
     56%Section \ref{sec:seqpar} expands on previous work on bitwise data
     57%parallelism by more fully characterizing the paradigm and documenting
     58%important techniques.
    5959Section \ref{sec:Unicode} addresses
    6060the issues and performance challenges associated with meeting Unicode
  • docs/Working/icGrep/unicode-re.tex

    r4489 r4490  
    1 \section{Unicode Regular Expression Methods}\label{sec:Unicode}
     1\section{Bitwise Methods for Unicode}\label{sec:Unicode}
    32\subsection{UTF-8 Transformation}\label{sec:Unicode:toUTF8}
    2120% The UTF-8 encoded regular expression for the range \verb:[\u{2030}-\u{2137}]: becomes:
    2221% \newline
    25 The \icGrep{} parser produces a representation of an input regular expression over variable-length code \emph{points}.
    26 %
    27 Parabix, however, operates on fixed-size code \emph{units}.
    28 %
    29 To support code points, the toUTF8 transformation converts the expression into an equivalent expression over code units
    30 by splitting code points into corresponding sequences of bytes (UTF-8 code units).
    31 %and assigning them to new character classes.
    32 %
     22The \icGrep{} regular expression parser produces a representation of an input regular expression over Unicode code points.
     23To process UTF-8 data streams, however, these expressions must first be converted to equivalent expressions in terms of UTF-8 code units.   
    3325Consider the Unicode regular expression `\verb:\u{244}[\u{2030}-\u{2137}]:`.
    3527The parser produces a sequence starting with a \verb:0x244: followed by the range of \verb:0x2030: to \verb:0x2137:.
    37 After toUTF8, the first code point in the sequence becomes the two byte sequence ``\verb:\u{C9}\u{84}:'',
     29After toUTF8, the first codepoint in the sequence becomes the two byte sequence ``\verb:\u{C9}\u{84}:'',
    3830and the range expands into the series of sequences and alternations shown below:
Note: See TracChangeset for help on using the changeset viewer.