source: docs/ICA3PP2015/background.tex @ 5433

Last change on this file since 5433 was 4782, checked in by cameron, 4 years ago

Figure placement according to Springer

File size: 7.6 KB
1\section{Background} \label{sec:background}
3\paragraph{Unicode Regular Expressions.}
4Traditional regular expression syntax is oriented towards
5string search using regular expressions over ASCII or extended-ASCII byte sequences.
6A grep search for a line beginning with a capitalized word might use the
7pattern ``\verb:^[A-Z][a-z]+:'' (``extended'' syntax).  Here, ``\verb:^:'' is a zero-width assertion
8matching only at the start of a line, ``\verb:[A-Z]:'' is a character class
9that matches any single character in the contiguous range of characters from A through Z,
10while the  plus operator in ``\verb:[a-z]+:'' denotes repetition of one or more lower
11case ASCII letters.   
13While explicit listing of characters of interest is
14practical with ASCII, it is less so with Unicode.   In the Unicode 7.0 database,
15there are 1490 characters categorized as upper case and 1841 categorized as lower case.
16Rather than explicitly listing the individual characters, then, it is more
17practical to use named character classes, such as \verb:Lu: for upper case letters and
18\verb:Ll: for lower case letters.   Using these names, our search might be rewritten
19to find capitalized words in any language as %``\verb!^[[:Lu:]][[:Ll:]]+!'' (Posix syntax)  or
20``\verb:^\p{Lu}\p{Ll}+:'' (Perl-compatible syntax).   
21The Unicode consortium has defined an extensive list of named properties that can
22be used in regular expressions.
24Beyond named properties, Unicode Technical Standard \#18 defines additional
25requirements for Unicode regular expressions, at three levels of
26complexity~\cite{davis2012unicode}.   Level 1 generally relates to
27properties expressed in terms of individual Unicode codepoints, while level 2
28introduces complexities due to codepoint sequences that form grapheme clusters,
29and level 3 relates to tailored locale-specific support.
30We consider only Unicode level 1 requirements
31in this paper, as most grep implementations are incomplete with respect
32the requirements even at this level.   
33The additional level 1 regular expression requirements primarily relate to larger
34classes of characters that are used in identifying line breaks, word breaks
35and case-insensitive matching.   
36Beyond this, there is one important syntactic extension: the ability to refine
37character class specifications using set
38intersection and subtraction.   For example, \verb:[\p{Greek}&&\p{Lu}]:
39denotes the class of upper case Greek letters, while \verb:[\p{Ll}--\p{ASCII}]:
40denotes the class of all non-ASCII lower case letters.
42\paragraph{Bitwise Data Parallel Matching.}
43Regular expression search using bitwise data parallelism has been
44recently introduced and shown to considerably outperform methods
45based on DFAs or NFAs \cite{cameron2014bitwise}.   In essence, the
46method is 100\% data parallel, considering all input positions in a file
47simultaneously.   A set of parallel bit streams is computed, with each bit
48position corresponding to one code-unit position within input character stream.
49Each bit stream may be considered an $L$-bit integer, where $L$ is the length
50of the input stream.  Matching is performed using bitwise logic and addition on these long integers.
52Consider the following simplified definition of regular expressions.
53A regular expression may be a character class $C$ or formed by
54composition.  If $R$ and $S$ are regular expressions, then the concatenation $RS$ is the regular
55expression standing for the set of all strings formed by concatenating
56one string from $R$ and one string from $S$, $R|S$ is the alternation
57standing for the union of the sets of strings from $R$ and $S$, and
58$R*$ is the Kleene-star closure denoting the set of strings
59formed from 0 or more occurrences of strings from $R$.
61{\em Character class streams}, such as CharClass({\tt [d]}) for the stream that marks the
62position of ``d'' characters and CharClass({\tt [a-z]}) for the stream of lower case
63ASCII alphabetics are first computed in a fully data-parallel manner. 
64Then the search process proceeds by computing {\em marker streams} that mark
65the positions of matches so far.  Each bit in a marker stream indicates
66that all characters prior to the given position have been matched.
67The initial marker stream $m_0$ consists of all
68ones, i.e., $m_0 = 2^{L} - 1$, indicating that all positions are in play.
69Matching then proceeds in accord with the  following equations.
72\mbox{\rm Match}(m, C) & = &  \mbox{\rm Advance}(\mbox{\rm CharClass}(C) \wedge m) \\
73\mbox{\rm Match}(m, RS) & = &  \mbox{\rm Match}(\mbox{\rm Match}(m, R), S) \\
74\mbox{\rm Match}(m, R|S) & = &  \mbox{\rm Match}(m, R) \vee \mbox{\rm Match}(m, S)) \\
75\mbox{\rm Match}(m, C*) & = &  \mbox{\rm MatchStar}(m, \mbox{\rm CharClass}(C)) \\
76\mbox{\rm Match}(m, R*) & = &  m \vee \mbox{\rm Match}(\mbox{\rm Match}(m, R), R*)
78Here, Advance is an operation that advances all markers by a single position. \[\mbox{\rm Advance}(m) = m+m\]
79MatchStar finds all matches of character class repetitions in a
80surprisingly simple manner \cite{cameron2014bitwise}.
81\[\text{MatchStar}(M, C) = (((M \wedge C) + C)  \oplus C) \vee M\]
82Interestingly, the MatchStar operation also has application to the
83parallelized long-stream addition itself \cite{cameron2014bitwise}, as well as
84the bit-parallel edit distance algorithm of Myers\cite{myers1999fast}.
85For repetitions of other regular expressions, the final recursive equation
86accounts for the repeated extension of current matches; this
87equation is applied until no new match positions result.
89For example,
90Figure \ref{fig:bitwisematch} shows how the regular expression {\tt d[a-z]*ed} is matched
91against some input text using bitwise methods.  In this diagram we use periods to denote
920 bits so that the 1 bits stand out.
93In the first step the character class stream
94{\tt [d]} is matched and the results shifted one position (Advance) to produce marker bitstream $M_1$.
95Five matches indicated by marker bits are now in play simultaneously. 
96The next step applies the  MatchStar operation to find all the matches that may then be
97reached with the Kleene-* repetition
98{\tt [a-z]*} ($M_2$).   This produces pending matches at many positions.  However, there
99is no need to consider these matches one at a time using lazy or greedy matching strategies.
100Rather, the full marker stream $M_3$ of remaining possibilities after matching {\tt [e]} is easily
101computed using bitwise logic and shift.
102The final step produces marker stream $M_4$ indicating the single position
103at which the entire regular expression is matched.
109input data  & \verb`dead dreams defeated.`\\
110$M_1 = {\text{Advance}}({\text{\tt [d]}})$ & \verb`.1..1.1......1......1`\\
111$M_2 = {\text{MatchStar}}(M_1, {\text{\tt [a-z]}})$ & \verb`.1111.111111.11111111`\\
112$M_3 = {\text{Advance}}(M_2 \wedge {\text{\tt [e]}})$ & \verb`..1.....1.....1.1..1.`\\
113$M_4 = {\text{Advance}}(M_3 \wedge {\text{\tt [d]}})$ & \verb`....................1`
117\caption{Matching {\tt d[a-z]*ed} using bitwise data parallelism}\label{fig:bitwisematch}
121The Parabix toolchain \cite{lin2012parabix} provides a set of compilers
122and run-time libraries that target the SIMD instructions of commodity
123processors (e.g., SSE or AVX instructions on x86-64 architecture).
124Input is processed in blocks of code units equal to the size in bits
125of the SIMD registers, for example, 128 bytes at a time using 128-bit registers.
126Using the Parabix facilities, the bitwise data parallel approach to
127regular expression search was shown to deliver substantial performance
128acceleration for traditional ASCII regular expression matching tasks,
129often 5$\times$ or better \cite{cameron2014bitwise}.
Note: See TracBrowser for help on using the repository browser.