# Changeset 4490 for docs

Ignore:
Timestamp:
Feb 10, 2015, 5:55:14 PM (4 years ago)
Message:

Intro and Background, remove section 3

Location:
docs/Working/icGrep
Files:
6 edited

Unmodified
Removed
• ## docs/Working/icGrep/background.tex

 r4488 \section{Background} \label{sec:background} \subsection{Unicode Regular Expression Requirements} Unicode is system for organizing the characters from all languages and symbol systems into a single numeric space and encoding those values in convenient formats for computerized processing. The numeric values form a space of integer {\em codepoints} from 0 through hexadecimal 10FFFF.  The computerized formats represent codepoint values as (possibly variable length) sequences of fixed-width {\em code units}. UTF-8 represents each codepoint using a sequence of one to four octets (8-bit bytes), UTF-16 represents each codepoint using one or two 16-bit code units and UTF-32 represents each codepoint as a single 32-bit unit.  The format used most often for storage and transmission of Unicode data is UTF-8; this is the format assumed through this paper. Traditional grep syntax is oriented towards %\subsection{Unicode Regular Expression Requirements} \paragraph{Unicode Regular Expressions.} Traditional regular expression syntax is oriented towards string search using regular expressions over ASCII or extended-ASCII byte sequences. A grep search for a line beginning with a capitalized word might use the practical to use named character classes, such as \verb:Lu: for upper case letters and \verb:Ll: for lower case letters.   Using these names, our search might be rewritten to find capitalized words in any language as \verb!^[[:Lu:]][[:Ll:]]+!'' (Posix syntax)  or \verb:^\p{Lu}\p{Ll}+:'' (Perl-compatible syntax). to find capitalized words in any language as %\verb!^[[:Lu:]][[:Ll:]]+!'' (Posix syntax)  or \verb:^\p{Lu}\p{Ll}+:'' (Perl-compatible syntax). The Unicode consortium has defined an extensive list of named properties that can be used in regular expressions. denotes the class of all non-ASCII lower case letters. \subsection{Parabix} The Parabix toolchain is a set of compilers and run-time libraries designed to take advantage of the SIMD features of commodity processors to support high-performance streaming text processing. % based on a bit-parallel %transform representation of text. Parabix leverages a bit-parallel transform representation of text, where a text $T$ is represented as a set of parallel bit streams $B_i$, such that bit $j$ of stream $B_i$ is the $i^{\mbox{th}}$ bit of character code unit $j$ of $T$. The Parabix methods have been used to accelerate Unicode transcoding~\cite{cameron2008case}, protein search~\cite{green2009modeling}, XML parsing~\cite{cameron2011parallel}, and, most recently, regular expression search~\cite{cameron2014bitwise}. %\subsection{Bitwise Data Parallel Regular Expression Matching} \paragraph{Bitwise Data Parallel Matching.} Regular expression search using bitwise data parallelism has been recently introduced and shown to considerably outperform methods based on DFAs or NFAs \cite{cameron2014bitwise}.   In essence, the method is 100\% data parallel, considering all input positions in a file simultaneously.   A set of parallel bit streams is computed, with each bit position corresponding to one code-unit position within input character stream. {\em Character class streams}, such as {\tt [d]} for the stream that marks the position of d'' characters and {\tt [a-z]} for the stream of lower case ASCII alphabetics are first computed in a fully data-parallel manner. Then the matching process proper begins taking advance of bitwise logic and shifting operations as well as an operation for finding all matches to a character class repetition known as MatchStar.  At each step of the process a {\em marker} bit stream identifying the full set of positions within the input data stream that match the regular expression to this point. \end{figure} The central concept in regular expression matching is that of marker bitstreams. At each step in the matching process, a marker bitstream indicates the matches that have been found to this point.   The bitwise matching method uses the operations Advance to move an entire stream of markers one step forward, and MatchStar to find all positions matched after a character class repetition. For example, Figure \ref{fig:bitwisematch} shows how the regular expression {\tt d[a-z]*ed} is matched against some input text using bitwise methods. against some input text using bitwise methods.  In this diagram we use periods to denote 0 bits so that the 1 bits stand out. In the first step the character class stream {\tt [d]} is matched and the results advanced one position to produce marker bitstream $M_1$.   Four match results are now in play simultaneously. The next step applies the MatchStar operation to find all the matches that may then be reached with the Kleene-* repetition {\tt [a-z]*} ($M_2$).  We now have pending matches at almost all possible positions.   However, most of the positions are filtered out with the match to {\tt [e]} in the computation of $M_3$. The final step produces marker stream $M_4$ indicating that single position at which the entire regular expression is matched. {\tt [d]} is matched and the results shifted one position (Advance) to produce marker bitstream $M_1$. Four matches indicated by marker bits are now in play simultaneously. The next step applies the  MatchStar operation to find all the matches that may then be reached with the Kleene-* repetition {\tt [a-z]*} ($M_2$).   This produces pending matches at many positions.  However, there is no need to consider these matches one at a time using lazy or greedy matching strategies. Rather, the full marker stream $M_3$ of remaining possibilites after matching {\tt [e]} is easily computed using a shift and bitwise and. The final step produces marker stream $M_4$ indicating that single position at which the entire regular expression is matched. The MatchStar operation turns out to be surprisingly simple \cite{cameron2014bitwise}. $\text{MatchStar}(M, C) = (((M \wedge C) + C) \oplus C) \vee M$ A single bit stream addition operation suffices to find all reachable positions from marker stream $M$ through character class stream $C$. Interestingly, the MatchStar operation also has application to the parallelized long-stream addition itself \cite{cameron2014bitwise}, as well as the bit-parallel edit distance algorithm of Myers\cite{myers1999fast}. The Parabix toolchain \cite{lin2012parabix} provides a set of compilers and run-time libraries that target the SIMD instructions of commodity processors (e.g., SSE or AVX instructions on x86-64 architecture). Input is processed in blocks of code units equal to the size in bits of the SIMD registers, for example, 128 bytes at a time using 128-bit registers. Using the Parabix facilities, the bitwise data parallel approach to regular expression search was shown to deliver substantial performance acceleration for traditional ASCII regular expression matching tasks, often 5X or better \cite{cameron2014bitwise}. targetting the SIMD integer instructions of commodity processors. \subsection{Parabix Regular Expression Matching} }
• ## docs/Working/icGrep/bitgrep.bib

 r4452 title={GPU-based NFA implementation for memory efficient high speed regular expression matching}, author={Zu, Yuan and Yang, Ming and Xu, Zhonghu and Wang, Lin and Tian, Xin and Peng, Kunyang and Dong, Qunfeng}, booktitle={ACM SIGPLAN Notices}, volume={47}, number={8}, booktitle={PPoPP '12 - Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming}, pages={129--140}, year={2012},
• ## docs/Working/icGrep/icgrep.tex

 r4489 \section{Unicode Regular Expression Methods}\label{sec:Unicode} \section{Bitwise Methods for Unicode}\label{sec:Unicode} \subsection{UTF-8 Transformation}\label{sec:Unicode:toUTF8} % The UTF-8 encoded regular expression for the range \verb:[\u{2030}-\u{2137}]: becomes: % \newline The \icGrep{} parser produces a representation of an input regular expression over variable-length code \emph{points}. % Parabix, however, operates on fixed-size code \emph{units}. % To support code points, the toUTF8 transformation converts the expression into an equivalent expression over code units by splitting code points into corresponding sequences of bytes (UTF-8 code units). %and assigning them to new character classes. % The \icGrep{} regular expression parser produces a representation of an input regular expression over Unicode code points. To process UTF-8 data streams, however, these expressions must first be converted to equivalent expressions in terms of UTF-8 code units. \% Consider the Unicode regular expression \verb:\u{244}[\u{2030}-\u{2137}]:. % The parser produces a sequence starting with a \verb:0x244: followed by the range of \verb:0x2030: to \verb:0x2137:. % After toUTF8, the first code point in the sequence becomes the two byte sequence \verb:\u{C9}\u{84}:'', After toUTF8, the first codepoint in the sequence becomes the two byte sequence \verb:\u{C9}\u{84}:'', and the range expands into the series of sequences and alternations shown below: \newline