Changeset 4472


Ignore:
Timestamp:
Feb 6, 2015, 10:30:04 PM (5 years ago)
Author:
nmedfort
Message:

General clean up and to-UTF8

Location:
docs/Working/icGrep
Files:
8 edited

Legend:

Unmodified
Added
Removed
  • docs/Working/icGrep/architecture.tex

    r4466 r4472  
    55
    66As show in Figure \ref{fig:compiler},
    7 icGrep is composed of three logical layers: \RegularExpression{}, \Pablo{} and the LLVM layer, each with their own intermediate representation
     7\icGrep{} is composed of three logical layers: \RegularExpression{}, \Pablo{} and the LLVM layer, each with their own intermediate representation
    88(IR), transformation and compilation modules.
    99%
     
    1414The AST is a minimalistic representation that, unlike traditional \RegularExpression{}, is not converted into a NFA or DFA for further processing.
    1515%
    16 Instead, icGrep passes the AST into the transformation module, which includes a set of \RegularExpression{} specific optimization passes.
     16Instead, \icGrep{} passes the AST into the transformation module, which includes a set of \RegularExpression{} specific optimization passes.
    1717%
    1818The initial \emph{Nullable} pass, determines whether the \RegularExpression{} contains any prefixes or suffixes that may be removed or
     
    6767the only complexities it introduces is the generation of Phi nodes, linking of statically-compiled functions, and assignment of carry variables.
    6868%
    69 It produces the dynamically-generated match function used by the icGrep.
     69It produces the dynamically-generated match function used by the \icGrep{}.
    7070
    7171\subsection{Dynamic Grep Engine}
     
    7373
    7474
    75 As shown in Figure \ref{fig:execution}, icGrep takes the input data and transposed it into 8 parallel bit streams through S2P module.
     75As shown in Figure \ref{fig:execution}, \icGrep{} takes the input data and transposed it into 8 parallel bit streams through S2P module.
    7676The required streams, e.g. line break stream, can then be generated using the 8 basis bits streams.
    7777The JIT function retrieves the 8 basis bits and the required streams from their memory addresses and starts the matching process.
     
    8080A match scanner will scan through this bitstream and calculate the total counts or write the context of each match position.
    8181
    82 We can also apply a pipeline parallelism strategy to further speed up the process of icGrep.
     82We can also apply a pipeline parallelism strategy to further speed up the process of \icGrep{}.
    8383S2P and Required Streams Generator can be process in a separate thread and start even before the dynamic compilation starts.
    8484The output of S2P and Required Streams Generator, that is the 8 basis bits streams and the required streams,
  • docs/Working/icGrep/background.tex

    r4452 r4472  
    1818pattern ``\verb:^[A-Z][a-z]+:'' (``extended'' syntax).  Here, ``\verb:^:'' is a zero-width assertion
    1919matching only at the start of a line, ``\verb:[A-Z]:'' is a character class
    20 that matches any single character in the contiguous range of characters froms A through Z,
     20that matches any single character in the contiguous range of characters from A through Z,
    2121while the  plus operator in ``\verb:[a-z]+:'' denotes repetition of one or more lower
    2222case ASCII letters.   
     
    3535Beyond named properties, Unicode Technical Standard \#18 defines additional
    3636requirements for Unicode regular expressions, at three levels of
    37 complexity \cite{davis2012unicode}.   Level 1 generally relates to
     37complexity~\cite{davis2012unicode}.   Level 1 generally relates to
    3838properties expressed in terms of individual Unicode codepoints, while level 2
    3939introduces complexities due to codepoint sequences that form grapheme clusters,
     
    5555The Parabix toolchain is a set of compilers and run-time libraries designed
    5656to take advantage of the SIMD features of commodity processors
    57 to support high-performance streaming text processing based on a bit-parallel
    58 transform represenation of text.  In this representation, a text $T$ is represented as a set of parallel
     57to support high-performance streaming text processing.
     58% based on a bit-parallel
     59%transform representation of text.
     60Parabix leverages a bit-parallel transform representation of text, where a
     61text $T$ is represented as a set of parallel
    5962bit streams $B_i$, such that bit $j$ of stream $B_i$ is the $i^{\mbox{th}}$ bit of
    60 character code unit $j$ of $T$.  The Parabix methods have been used to
    61 accelerate Unicode transcoding \cite{cameron2008case}, protein search \cite{green2009modeling},
    62 XML parsing \cite{cameron2011parallel}, and, most recently, regular expression search \cite{cameron2014bitwise}.
     63character code unit $j$ of $T$.
     64The Parabix methods have been used to
     65accelerate Unicode transcoding~\cite{cameron2008case}, protein search~\cite{green2009modeling},
     66XML parsing~\cite{cameron2011parallel}, and, most recently, regular expression search~\cite{cameron2014bitwise}.
    6367
     68
     69\comment{
    6470
    6571\subsection{LLVM}
     
    7480LLVM features a flexible multi-stage compilation structure that can be organized in
    7581passes in many ways, including fully dynamic just-in-time compilation.   Of particular
    76 importance to the icGrep project, the LLVM IR supports arbitrarily-sized vectors of
     82importance to the \icGrep{} project, the LLVM IR supports arbitrarily-sized vectors of
    7783arbitrary-width integers, well suited to code generation
    7884targetting the SIMD integer instructions of commodity processors.
    7985
    8086\subsection{Parabix Regular Expression Matching}
     87}
  • docs/Working/icGrep/evaluation.tex

    r4469 r4472  
    4949
    5050In order to support evaluation of bitwise methods, as well as to support
    51 the teaching of those methods and ongoing research, icGrep has an array
     51the teaching of those methods and ongoing research, \icGrep{} has an array
    5252of command-line options.   This makes it relatively straightforward
    5353to report on certain performance aspects of ICgrep, while others require
     
    5656For example, the command-line switch {\tt -disable-matchstar} can be used
    5757to eliminate the use of the MatchStar operation for handling
    58 Kleene-* repetition of character classes.   In this case, icGrep substitutes
     58Kleene-* repetition of character classes.   In this case, \icGrep{} substitutes
    5959a while loop that iteratively extends match results.   
    6060Surprisingly, this
     
    7676number of non-nullable pattern elements between the if-tests
    7777can be set with the {\tt -if-insertion-gap=} option.   The
    78 default value in icGrep is 3, setting the gap to 100 effectively
     78default value in \icGrep{} is 3, setting the gap to 100 effectively
    7979turns of if-insertion.   Eliminating if-insertion sometimes improves
    8080performance by avoiding the extra if tests and branch mispredications.
  • docs/Working/icGrep/icgrep.tex

    r4455 r4472  
    1010\usepackage[utf8]{inputenc}
    1111
     12
     13
     14\newcommand{\comment}[1]{}
     15\newcommand{\icGrep}[1]{icGrep}
     16\newcommand{\ICgrep}[1]{ICgrep}
    1217
    1318
  • docs/Working/icGrep/introduction.tex

    r4452 r4472  
    11\section{Introduction}
    22
    3 The venerable Unix {\tt grep} program is an everyday tool widely used
     3Unix {\tt grep} is a tool widely used
    44to search for lines in text files matching a given regular expression
    55pattern.   Historical comments ...
     
    1111or GPU technologies to accelerate multiple instances of independent
    1212matching problems. 
    13 Scarpazza \cite{scarpazza2011top} used SIMD and multicore parallelism
     13Scarpazza~\cite{scarpazza2011top} used SIMD and multicore parallelism
    1414to accelerate small ruleset tokenization applications on the Cell
    15 Broadband Engine while Valaspura \cite{salapura2012accelerating}
     15Broadband Engine. Valaspura~\cite{salapura2012accelerating}
    1616built on these techniques to accelerate business analytics applications
    1717using SSE instructions on commodity processors.   
    18 Zu et al \cite{zu2012gpu} use GPU technology to implement NFA-based regular
     18Zu et al~\cite{zu2012gpu} use GPU technology to implement NFA-based regular
    1919expression matching with parallelism devoted both to processing
    2020a compressed active state array as well as to handling matching of
     
    4141Unicode regular expression search tool, building on the bitwise
    4242data parallel methods of the Parabix framework combined with the
    43 dynamic compilation capabilities of LLVM.   The result is ICgrep,
     43dynamic compilation capabilities of LLVM~\cite{lattner2004llvm}.
     44The result is \icGrep{},
    4445a high-performance, full-featured open-source grep implementation
    4546with systematic support for Unicode regular expressions addressing the
    4647requirements of Unicode Technical Standard \#18 \cite{davis2012unicode}.  As an alternative
    47 to classical grep implementations, ICgrep offers dramatic performance
     48to classical grep implementations, \icGrep{} offers dramatic performance
    4849acceleration in ASCII-based and Unicode matching performance alike.
    4950
    5051The remainder of this paper is organized as follows.   Section \ref{sec:background}
    5152presents background material dealing with Unicode regular expressions,
    52 LLVM, the Parabix framework and regular expression matching techniques
     53%LLVM,
     54the Parabix framework and regular expression matching techniques
    5355using bitwise data parallelism.   
    5456Section \ref{sec:seqpar} expands on previous work on bitwise data
     
    6062Parabix techniques that we have developed to address them. 
    6163Section \ref{sec:architecture} describes the overall architecture of
    62 the ICgrep implementation with a focus on the integration of Parabix
     64the \icGrep{} implementation with a focus on the integration of Parabix
    6365and LLVM technologies.
    64 Section \ref{sec:evaluation} evaluates the performance of ICgrep on
     66Section \ref{sec:evaluation} evaluates the performance of \icGrep{} on
    6567several types of matching problems with contemporary competitors,
    6668including the latest versions of GNU grep, pcregrep, ugrep of the
  • docs/Working/icGrep/paradigm.tex

    r4448 r4472  
    2020is transposition of input data.   In previous work, the Parabix transform
    2121has been reported as imposing an amoritized cost of 1 CPU cycle/input byte,
    22 when working with SSE2 \cite{lin2012parabix}.  This is consistent with {\tt icGrep}
     22when working with SSE2 \cite{lin2012parabix}.  This is consistent with \icGrep{}
    2323results.  However, the cost of the cost of this transposition can
    2424be hidden through multithreading and pipeline parallelism, having one
     
    4646
    4747We have incorporated an elegant technique for bounded repetitions
    48 in icGrep.   This technique allows the matches to $C{m,n}$ for some
     48in \icGrep{}.   This technique allows the matches to $C{m,n}$ for some
    4949character class $C$, lower bound $m$ and upper bound $n$ to be determined
    5050in $\lceil{\log_2 m}\rceil + \lceil{\log_2{n-m}}\rceil$ steps. 
     
    6666is yes: whenever the bit vector of match positions in play for the current
    6767input block reduce to all zero, the remainder of the pattern can be
    68 skipped for processing the block.   This method has been implemented in icGrep.
     68skipped for processing the block.   This method has been implemented in \icGrep{}.
    6969
    7070
  • docs/Working/icGrep/unicode-re.tex

    r4471 r4472  
    33\subsection{toUTF8 Transformation}\label{sec:Unicode:toUTF8}
    44
    5 The icGrep parser generates an abstract syntax tree (AST) that represents an input regular expression over code points.  This AST is passed as input to a toUTF-8 transformation that generates a new AST that represents the equivalent regular expression over UTF-8 byte sequences. The transformation accomplishes this by first determining the number of UTF-8 bytes that are required to represent each code point contained within each character class.  The code points are then split into sequences of bytes, with each byte containing the necessary UTF-8 prefix. The UTF-8 encoded bytes are each assigned to a new character class in the new AST.  For an example, consider the following regular expression that consists entirely of multibyte Unicode characters: `\verb:\u{244}[\u{2030}-\u{2137}]:`.  The AST from the parser would represent this as a sequence starting with a character class containing the code point \verb:0x244: followed by a second character class containing the range from \verb:0x2030: to \verb:0x2137:.  After being passed through the toUTF-8 transformation this AST would become considerably more complex.  The first code point in the sequence would be encoded as the two byte sequence `\verb:\u{C9}\u{84}:`. The character class containing the range, which is a range of three byte sequences would be expanded into the series of sequences and alternations that are necessary to specify all of the possible byte encodings that would be contained within the range.  The UTF-8 encoded regular expression for the range \verb:[\u{2030}-\u{2137}]: would be encoded as follows:
     5The \icGrep{} parser produces a representation of an input regular expression over code points.
     6Parabix, however, operates on fixed size code \emph{units} instead of varyingly sized code \emph{points}.
     7We devised a toUTF-8 transformation that converts the expression over code points into an equivalent expression over code units.
     8%The transformation accomplishes this by first determining the number of UTF-8 bytes
     9%required to represent each code point contained within each character class.
     10The process splits code points into corresponding sequences of bytes (UTF-8 code units), %, with each byte containing the necessary UTF-8 prefix.
     11%The UTF-8 encoded bytes are each assigned to
     12and assigns these code units to new character classes.% in the new AST.
     13Consider the multibyte Unicode regular expression `\verb:\u{244}[\u{2030}-\u{2137}]:`.
     14The parser produces a sequence starting with a character class for \verb:0x244:
     15followed by a character class for the range from \verb:0x2030: to \verb:0x2137:.
     16After toUTF-8, %this AST becomes more complex.
     17the first code point in the sequence becomes the two byte sequence `\verb:\u{C9}\u{84}:`,
     18and the character class for the range%, which is a range of three byte sequences,
     19expands into a series of sequences and alternations necessary to specify the byte encodings within the range.
     20The UTF-8 encoded regular expression for the range \verb:[\u{2030}-\u{2137}]: becomes:
    621\newline
    722
     
    1227}
    1328
    14 The benefit of transforming the regular expression immediately after parsing from being a regular expression over code points into a regular expression over bytes is that it simplifies the rest of the compiler, as the compiler then only needs to be concerned with single bytes as opposed to code points, which vary in size.
     29%The benefit of transforming the regular expression immediately after parsing from being a regular expression over code points into a regular expression over bytes is that it simplifies the rest of the compiler, as the compiler then only needs to be concerned with single bytes as opposed to code points, which vary in size.
    1530
    1631
Note: See TracChangeset for help on using the changeset viewer.