Changeset 4490
- Timestamp:
- Feb 10, 2015, 5:55:14 PM (4 years ago)
- Location:
- docs/Working/icGrep
- Files:
-
- 6 edited
Legend:
- Unmodified
- Added
- Removed
-
docs/Working/icGrep/background.tex
r4488 r4490 1 1 \section{Background} \label{sec:background} 2 \subsection{Unicode Regular Expression Requirements}3 Unicode is system for organizing the characters from all languages4 and symbol systems into a single numeric space and encoding those5 values in convenient formats for computerized processing.6 The numeric values form a space of integer {\em codepoints} from 07 through hexadecimal 10FFFF. The computerized formats represent8 codepoint values as (possibly variable length) sequences of fixed-width {\em code units}.9 UTF-8 represents each codepoint using a sequence of one to four octets (8-bit bytes),10 UTF-16 represents each codepoint using one or two 16-bit code units and UTF-3211 represents each codepoint as a single 32-bit unit. The format used most12 often for storage and transmission of Unicode data is UTF-8; this is the format13 assumed through this paper.14 2 15 Traditional grep syntax is oriented towards 3 %\subsection{Unicode Regular Expression Requirements} 4 \paragraph{Unicode Regular Expressions.} 5 Traditional regular expression syntax is oriented towards 16 6 string search using regular expressions over ASCII or extended-ASCII byte sequences. 17 7 A grep search for a line beginning with a capitalized word might use the … … 28 18 practical to use named character classes, such as \verb:Lu: for upper case letters and 29 19 \verb:Ll: for lower case letters. Using these names, our search might be rewritten 30 to find capitalized words in any language as ``\verb!^[[:Lu:]][[:Ll:]]+!'' (Posix31 syntax) or``\verb:^\p{Lu}\p{Ll}+:'' (Perl-compatible syntax).20 to find capitalized words in any language as %``\verb!^[[:Lu:]][[:Ll:]]+!'' (Posix syntax) or 21 ``\verb:^\p{Lu}\p{Ll}+:'' (Perl-compatible syntax). 32 22 The Unicode consortium has defined an extensive list of named properties that can 33 23 be used in regular expressions. … … 51 41 denotes the class of all non-ASCII lower case letters. 52 42 53 \subsection{Parabix} 54 55 The Parabix toolchain is a set of compilers and run-time libraries designed 56 to take advantage of the SIMD features of commodity processors 57 to support high-performance streaming text processing. 58 % based on a bit-parallel 59 %transform representation of text. 60 Parabix leverages a bit-parallel transform representation of text, where a 61 text $T$ is represented as a set of parallel 62 bit streams $B_i$, such that bit $j$ of stream $B_i$ is the $i^{\mbox{th}}$ bit of 63 character code unit $j$ of $T$. 64 The Parabix methods have been used to 65 accelerate Unicode transcoding~\cite{cameron2008case}, protein search~\cite{green2009modeling}, 66 XML parsing~\cite{cameron2011parallel}, and, most recently, regular expression search~\cite{cameron2014bitwise}. 43 %\subsection{Bitwise Data Parallel Regular Expression Matching} 44 \paragraph{Bitwise Data Parallel Matching.} 45 Regular expression search using bitwise data parallelism has been 46 recently introduced and shown to considerably outperform methods 47 based on DFAs or NFAs \cite{cameron2014bitwise}. In essence, the 48 method is 100\% data parallel, considering all input positions in a file 49 simultaneously. A set of parallel bit streams is computed, with each bit 50 position corresponding to one code-unit position within input character stream. 51 {\em Character class streams}, such as {\tt [d]} for the stream that marks the 52 position of ``d'' characters and {\tt [a-z]} for the stream of lower case 53 ASCII alphabetics are first computed in a fully data-parallel manner. 54 Then the matching process proper begins taking advance of bitwise logic 55 and shifting operations as well as an operation for finding all matches to a 56 character class repetition known as MatchStar. At each step of the 57 process a {\em marker} bit stream identifying the full set of positions 58 within the input data stream that match the regular expression to this point. 67 59 68 60 … … 80 72 \end{figure} 81 73 82 83 The central concept in regular expression matching is that of marker bitstreams.84 At each step in the matching process, a marker bitstream indicates the matches85 that have been found to this point. The bitwise matching method uses the86 operations Advance to move an entire stream of markers one step forward, and MatchStar87 to find all positions matched after a character class repetition.88 74 For example, 89 75 Figure \ref{fig:bitwisematch} shows how the regular expression {\tt d[a-z]*ed} is matched 90 against some input text using bitwise methods. 76 against some input text using bitwise methods. In this diagram we use periods to denote 77 0 bits so that the 1 bits stand out. 91 78 In the first step the character class stream 92 {\tt [d]} is matched and the results advanced one position to produce marker bitstream $M_1$. Four match results are now in play 93 simultaneously. 94 The next step applies the 95 MatchStar operation to find all the matches that may then be reached with the Kleene-* repetition 96 {\tt [a-z]*} ($M_2$). We now have pending matches at almost all possible positions. However, most of 97 the positions are filtered out with the match to {\tt [e]} in the computation of $M_3$. 98 The final step produces marker stream $M_4$ indicating that single position at which the entire regular expression is matched. 79 {\tt [d]} is matched and the results shifted one position (Advance) to produce marker bitstream $M_1$. 80 Four matches indicated by marker bits are now in play simultaneously. 81 The next step applies the MatchStar operation to find all the matches that may then be 82 reached with the Kleene-* repetition 83 {\tt [a-z]*} ($M_2$). This produces pending matches at many positions. However, there 84 is no need to consider these matches one at a time using lazy or greedy matching strategies. 85 Rather, the full marker stream $M_3$ of remaining possibilites after matching {\tt [e]} is easily 86 computed using a shift and bitwise and. 87 The final step produces marker stream $M_4$ indicating that single position 88 at which the entire regular expression is matched. 99 89 90 The MatchStar operation turns out to be surprisingly simple \cite{cameron2014bitwise}. 91 \[\text{MatchStar}(M, C) = (((M \wedge C) + C) \oplus C) \vee M\] 92 A single bit stream addition operation suffices to find all reachable 93 positions from marker stream $M$ through character class stream $C$. 94 Interestingly, the MatchStar operation also has application to the 95 parallelized long-stream addition itself \cite{cameron2014bitwise}, as well as 96 the bit-parallel edit distance algorithm of Myers\cite{myers1999fast}. 97 98 The Parabix toolchain \cite{lin2012parabix} provides a set of compilers 99 and run-time libraries that target the SIMD instructions of commodity 100 processors (e.g., SSE or AVX instructions on x86-64 architecture). 101 Input is processed in blocks of code units equal to the size in bits 102 of the SIMD registers, for example, 128 bytes at a time using 128-bit registers. 103 Using the Parabix facilities, the bitwise data parallel approach to 104 regular expression search was shown to deliver substantial performance 105 acceleration for traditional ASCII regular expression matching tasks, 106 often 5X or better \cite{cameron2014bitwise}. 100 107 101 108 … … 117 124 targetting the SIMD integer instructions of commodity processors. 118 125 126 127 119 128 \subsection{Parabix Regular Expression Matching} 120 129 } -
docs/Working/icGrep/bitgrep.bib
r4452 r4490 324 324 title={GPU-based NFA implementation for memory efficient high speed regular expression matching}, 325 325 author={Zu, Yuan and Yang, Ming and Xu, Zhonghu and Wang, Lin and Tian, Xin and Peng, Kunyang and Dong, Qunfeng}, 326 booktitle={ACM SIGPLAN Notices}, 327 volume={47}, 328 number={8}, 326 booktitle={PPoPP '12 - Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming}, 329 327 pages={129--140}, 330 328 year={2012}, -
docs/Working/icGrep/icgrep.tex
r4476 r4490 52 52 \input{background.tex} 53 53 54 \input{paradigm.tex}54 %\input{paradigm.tex} 55 55 56 56 \input{unicode-re.tex} -
docs/Working/icGrep/introduction.tex
r4489 r4490 1 1 \section{Introduction} 2 2 3 Unix {\tt grep} is a tool widely used4 to search for lines in text files matching a given regular expression5 pattern. Historical comments ...6 3 7 Unicode regular expression matching adds performance challenges... 4 Although well-established technical standards exist for 5 Unicode regular expressions \cite{davis2012unicode}, most of today's 6 regular expression processing toolsets fail to support the full set of processing 7 features even at the most basic level \cite{stewart2013unicode}. 8 One of the fundamental issues is performance and so it makes good sense 9 to consider the ways in which parallel processing approaches can help 10 address the gap. 8 11 9 12 Efforts to improve the performance of regular expression matching through … … 33 36 require thousands of DFA states for named Unicode properties. 34 37 Building on the Parabix framework, Cameron et al.~\cite{cameron2014bitwise} introduce 35 regular expression matching using the bitwise 36 data parallel approach together with the MatchStar primitive 37 for efficient implementation of 38 Kleene-* character-class repetitions. 38 regular expression matching using a new bitwise 39 data parallel approach. 39 40 40 41 In this paper, we report on the use of the implementation of a full … … 44 45 The result is \icGrep{}, 45 46 a high-performance, full-featured open-source grep implementation 46 with systematic support for Unicode regular expressions addressing the 47 requirements of Unicode Technical Standard \#18 \cite{davis2012unicode}. As an alternative 47 with systematic support for Unicode regular expressions. As an alternative 48 48 to classical grep implementations, \icGrep{} offers dramatic performance 49 acceleration in ASCII-based and Unicode matching performance alike.49 acceleration in Unicode regular expression matching. 50 50 51 51 The remainder of this paper is organized as follows. Section \ref{sec:background} … … 54 54 the Parabix framework and regular expression matching techniques 55 55 using bitwise data parallelism. 56 Section \ref{sec:seqpar} expands on previous work on bitwise data57 parallelism by more fully characterizing the paradigm and documenting58 important techniques.56 %Section \ref{sec:seqpar} expands on previous work on bitwise data 57 %parallelism by more fully characterizing the paradigm and documenting 58 %important techniques. 59 59 Section \ref{sec:Unicode} addresses 60 60 the issues and performance challenges associated with meeting Unicode -
docs/Working/icGrep/unicode-re.tex
r4489 r4490 1 \section{Unicode Regular Expression Methods}\label{sec:Unicode} 2 1 \section{Bitwise Methods for Unicode}\label{sec:Unicode} 3 2 \subsection{UTF-8 Transformation}\label{sec:Unicode:toUTF8} 4 3 … … 21 20 % The UTF-8 encoded regular expression for the range \verb:[\u{2030}-\u{2137}]: becomes: 22 21 % \newline 23 24 25 The \icGrep{} parser produces a representation of an input regular expression over variable-length code \emph{points}. 26 % 27 Parabix, however, operates on fixed-size code \emph{units}. 28 % 29 To support code points, the toUTF8 transformation converts the expression into an equivalent expression over code units 30 by splitting code points into corresponding sequences of bytes (UTF-8 code units). 31 %and assigning them to new character classes. 32 % 22 The \icGrep{} regular expression parser produces a representation of an input regular expression over Unicode code points. 23 To process UTF-8 data streams, however, these expressions must first be converted to equivalent expressions in terms of UTF-8 code units. 24 \% 33 25 Consider the Unicode regular expression `\verb:\u{244}[\u{2030}-\u{2137}]:`. 34 26 % 35 27 The parser produces a sequence starting with a \verb:0x244: followed by the range of \verb:0x2030: to \verb:0x2137:. 36 28 % 37 After toUTF8, the first code 29 After toUTF8, the first codepoint in the sequence becomes the two byte sequence ``\verb:\u{C9}\u{84}:'', 38 30 and the range expands into the series of sequences and alternations shown below: 39 31 \newline
Note: See TracChangeset
for help on using the changeset viewer.