# Changeset 4472 for docs

Ignore:
Timestamp:
Feb 6, 2015, 10:30:04 PM (4 years ago)
Message:

General clean up and to-UTF8

Location:
docs/Working/icGrep
Files:
8 edited

Unmodified
Removed
• ## docs/Working/icGrep/architecture.tex

 r4466 As show in Figure \ref{fig:compiler}, icGrep is composed of three logical layers: \RegularExpression{}, \Pablo{} and the LLVM layer, each with their own intermediate representation \icGrep{} is composed of three logical layers: \RegularExpression{}, \Pablo{} and the LLVM layer, each with their own intermediate representation (IR), transformation and compilation modules. % The AST is a minimalistic representation that, unlike traditional \RegularExpression{}, is not converted into a NFA or DFA for further processing. % Instead, icGrep passes the AST into the transformation module, which includes a set of \RegularExpression{} specific optimization passes. Instead, \icGrep{} passes the AST into the transformation module, which includes a set of \RegularExpression{} specific optimization passes. % The initial \emph{Nullable} pass, determines whether the \RegularExpression{} contains any prefixes or suffixes that may be removed or the only complexities it introduces is the generation of Phi nodes, linking of statically-compiled functions, and assignment of carry variables. % It produces the dynamically-generated match function used by the icGrep. It produces the dynamically-generated match function used by the \icGrep{}. \subsection{Dynamic Grep Engine} As shown in Figure \ref{fig:execution}, icGrep takes the input data and transposed it into 8 parallel bit streams through S2P module. As shown in Figure \ref{fig:execution}, \icGrep{} takes the input data and transposed it into 8 parallel bit streams through S2P module. The required streams, e.g. line break stream, can then be generated using the 8 basis bits streams. The JIT function retrieves the 8 basis bits and the required streams from their memory addresses and starts the matching process. A match scanner will scan through this bitstream and calculate the total counts or write the context of each match position. We can also apply a pipeline parallelism strategy to further speed up the process of icGrep. We can also apply a pipeline parallelism strategy to further speed up the process of \icGrep{}. S2P and Required Streams Generator can be process in a separate thread and start even before the dynamic compilation starts. The output of S2P and Required Streams Generator, that is the 8 basis bits streams and the required streams,
• ## docs/Working/icGrep/background.tex

 r4452 pattern \verb:^[A-Z][a-z]+:'' (extended'' syntax).  Here, \verb:^:'' is a zero-width assertion matching only at the start of a line, \verb:[A-Z]:'' is a character class that matches any single character in the contiguous range of characters froms A through Z, that matches any single character in the contiguous range of characters from A through Z, while the  plus operator in \verb:[a-z]+:'' denotes repetition of one or more lower case ASCII letters. Beyond named properties, Unicode Technical Standard \#18 defines additional requirements for Unicode regular expressions, at three levels of complexity \cite{davis2012unicode}.   Level 1 generally relates to complexity~\cite{davis2012unicode}.   Level 1 generally relates to properties expressed in terms of individual Unicode codepoints, while level 2 introduces complexities due to codepoint sequences that form grapheme clusters, The Parabix toolchain is a set of compilers and run-time libraries designed to take advantage of the SIMD features of commodity processors to support high-performance streaming text processing based on a bit-parallel transform represenation of text.  In this representation, a text $T$ is represented as a set of parallel to support high-performance streaming text processing. % based on a bit-parallel %transform representation of text. Parabix leverages a bit-parallel transform representation of text, where a text $T$ is represented as a set of parallel bit streams $B_i$, such that bit $j$ of stream $B_i$ is the $i^{\mbox{th}}$ bit of character code unit $j$ of $T$.  The Parabix methods have been used to accelerate Unicode transcoding \cite{cameron2008case}, protein search \cite{green2009modeling}, XML parsing \cite{cameron2011parallel}, and, most recently, regular expression search \cite{cameron2014bitwise}. character code unit $j$ of $T$. The Parabix methods have been used to accelerate Unicode transcoding~\cite{cameron2008case}, protein search~\cite{green2009modeling}, XML parsing~\cite{cameron2011parallel}, and, most recently, regular expression search~\cite{cameron2014bitwise}. \comment{ \subsection{LLVM} LLVM features a flexible multi-stage compilation structure that can be organized in passes in many ways, including fully dynamic just-in-time compilation.   Of particular importance to the icGrep project, the LLVM IR supports arbitrarily-sized vectors of importance to the \icGrep{} project, the LLVM IR supports arbitrarily-sized vectors of arbitrary-width integers, well suited to code generation targetting the SIMD integer instructions of commodity processors. \subsection{Parabix Regular Expression Matching} }
• ## docs/Working/icGrep/evaluation.tex

 r4469 In order to support evaluation of bitwise methods, as well as to support the teaching of those methods and ongoing research, icGrep has an array the teaching of those methods and ongoing research, \icGrep{} has an array of command-line options.   This makes it relatively straightforward to report on certain performance aspects of ICgrep, while others require For example, the command-line switch {\tt -disable-matchstar} can be used to eliminate the use of the MatchStar operation for handling Kleene-* repetition of character classes.   In this case, icGrep substitutes Kleene-* repetition of character classes.   In this case, \icGrep{} substitutes a while loop that iteratively extends match results. Surprisingly, this number of non-nullable pattern elements between the if-tests can be set with the {\tt -if-insertion-gap=} option.   The default value in icGrep is 3, setting the gap to 100 effectively default value in \icGrep{} is 3, setting the gap to 100 effectively turns of if-insertion.   Eliminating if-insertion sometimes improves performance by avoiding the extra if tests and branch mispredications.
• ## docs/Working/icGrep/icgrep.tex

• ## docs/Working/icGrep/introduction.tex

 r4452 \section{Introduction} The venerable Unix {\tt grep} program is an everyday tool widely used Unix {\tt grep} is a tool widely used to search for lines in text files matching a given regular expression pattern.   Historical comments ... or GPU technologies to accelerate multiple instances of independent matching problems. Scarpazza \cite{scarpazza2011top} used SIMD and multicore parallelism Scarpazza~\cite{scarpazza2011top} used SIMD and multicore parallelism to accelerate small ruleset tokenization applications on the Cell Broadband Engine while Valaspura \cite{salapura2012accelerating} Broadband Engine. Valaspura~\cite{salapura2012accelerating} built on these techniques to accelerate business analytics applications using SSE instructions on commodity processors. Zu et al \cite{zu2012gpu} use GPU technology to implement NFA-based regular Zu et al~\cite{zu2012gpu} use GPU technology to implement NFA-based regular expression matching with parallelism devoted both to processing a compressed active state array as well as to handling matching of Unicode regular expression search tool, building on the bitwise data parallel methods of the Parabix framework combined with the dynamic compilation capabilities of LLVM.   The result is ICgrep, dynamic compilation capabilities of LLVM~\cite{lattner2004llvm}. The result is \icGrep{}, a high-performance, full-featured open-source grep implementation with systematic support for Unicode regular expressions addressing the requirements of Unicode Technical Standard \#18 \cite{davis2012unicode}.  As an alternative to classical grep implementations, ICgrep offers dramatic performance to classical grep implementations, \icGrep{} offers dramatic performance acceleration in ASCII-based and Unicode matching performance alike. The remainder of this paper is organized as follows.   Section \ref{sec:background} presents background material dealing with Unicode regular expressions, LLVM, the Parabix framework and regular expression matching techniques %LLVM, the Parabix framework and regular expression matching techniques using bitwise data parallelism. Section \ref{sec:seqpar} expands on previous work on bitwise data Parabix techniques that we have developed to address them. Section \ref{sec:architecture} describes the overall architecture of the ICgrep implementation with a focus on the integration of Parabix the \icGrep{} implementation with a focus on the integration of Parabix and LLVM technologies. Section \ref{sec:evaluation} evaluates the performance of ICgrep on Section \ref{sec:evaluation} evaluates the performance of \icGrep{} on several types of matching problems with contemporary competitors, including the latest versions of GNU grep, pcregrep, ugrep of the
 r4448 is transposition of input data.   In previous work, the Parabix transform has been reported as imposing an amoritized cost of 1 CPU cycle/input byte, when working with SSE2 \cite{lin2012parabix}.  This is consistent with {\tt icGrep} when working with SSE2 \cite{lin2012parabix}.  This is consistent with \icGrep{} results.  However, the cost of the cost of this transposition can be hidden through multithreading and pipeline parallelism, having one We have incorporated an elegant technique for bounded repetitions in icGrep.   This technique allows the matches to $C{m,n}$ for some in \icGrep{}.   This technique allows the matches to $C{m,n}$ for some character class $C$, lower bound $m$ and upper bound $n$ to be determined in $\lceil{\log_2 m}\rceil + \lceil{\log_2{n-m}}\rceil$ steps. is yes: whenever the bit vector of match positions in play for the current input block reduce to all zero, the remainder of the pattern can be skipped for processing the block.   This method has been implemented in icGrep. skipped for processing the block.   This method has been implemented in \icGrep{}.
 r4471 \subsection{toUTF8 Transformation}\label{sec:Unicode:toUTF8} The icGrep parser generates an abstract syntax tree (AST) that represents an input regular expression over code points.  This AST is passed as input to a toUTF-8 transformation that generates a new AST that represents the equivalent regular expression over UTF-8 byte sequences. The transformation accomplishes this by first determining the number of UTF-8 bytes that are required to represent each code point contained within each character class.  The code points are then split into sequences of bytes, with each byte containing the necessary UTF-8 prefix. The UTF-8 encoded bytes are each assigned to a new character class in the new AST.  For an example, consider the following regular expression that consists entirely of multibyte Unicode characters: \verb:\u{244}[\u{2030}-\u{2137}]:.  The AST from the parser would represent this as a sequence starting with a character class containing the code point \verb:0x244: followed by a second character class containing the range from \verb:0x2030: to \verb:0x2137:.  After being passed through the toUTF-8 transformation this AST would become considerably more complex.  The first code point in the sequence would be encoded as the two byte sequence \verb:\u{C9}\u{84}:. The character class containing the range, which is a range of three byte sequences would be expanded into the series of sequences and alternations that are necessary to specify all of the possible byte encodings that would be contained within the range.  The UTF-8 encoded regular expression for the range \verb:[\u{2030}-\u{2137}]: would be encoded as follows: The \icGrep{} parser produces a representation of an input regular expression over code points. Parabix, however, operates on fixed size code \emph{units} instead of varyingly sized code \emph{points}. We devised a toUTF-8 transformation that converts the expression over code points into an equivalent expression over code units. %The transformation accomplishes this by first determining the number of UTF-8 bytes %required to represent each code point contained within each character class. The process splits code points into corresponding sequences of bytes (UTF-8 code units), %, with each byte containing the necessary UTF-8 prefix. %The UTF-8 encoded bytes are each assigned to and assigns these code units to new character classes.% in the new AST. Consider the multibyte Unicode regular expression \verb:\u{244}[\u{2030}-\u{2137}]:. The parser produces a sequence starting with a character class for \verb:0x244: followed by a character class for the range from \verb:0x2030: to \verb:0x2137:. After toUTF-8, %this AST becomes more complex. the first code point in the sequence becomes the two byte sequence \verb:\u{C9}\u{84}:, and the character class for the range%, which is a range of three byte sequences, expands into a series of sequences and alternations necessary to specify the byte encodings within the range. The UTF-8 encoded regular expression for the range \verb:[\u{2030}-\u{2137}]: becomes: \newline } The benefit of transforming the regular expression immediately after parsing from being a regular expression over code points into a regular expression over bytes is that it simplifies the rest of the compiler, as the compiler then only needs to be concerned with single bytes as opposed to code points, which vary in size. %The benefit of transforming the regular expression immediately after parsing from being a regular expression over code points into a regular expression over bytes is that it simplifies the rest of the compiler, as the compiler then only needs to be concerned with single bytes as opposed to code points, which vary in size.