source: docs/Working/icGrep/unicode-re.tex @ 4562

Last change on this file since 4562 was 4561, checked in by cameron, 4 years ago

Wrap up

File size: 7.2 KB
Line 
1\section{Bitwise Methods for UTF-8}\label{sec:Unicode}
2
3As described in the following section, \icGrep{} is a reimplementation
4of the bitwise data parallel method implemented on top of LLVM
5infrastructure and adapted for Unicode regular expression search
6through data streams represented in UTF-8.  In this section,
7we present the techniques we have used to extend the bitwise
8matching techniques to the variable-length encodings of UTF-8.
9
10The first requirement in implementing a regular expression processor
11over UTF-8 data streams is to translate Unicode regular expressions
12over codepoints to corresponding regular expressions over
13sequences of UTF-8 bytes or \emph{code units}.   The {\tt toUTF8} transformation
14performs this as a regular expression transformation, transforming
15input expressions such as `\verb:\u{244}[\u{2030}-\u{2137}]:`
16to the corresponding UTF-8 regular expression consisting of the series of sequences and alternations shown below:
17\newline
18{
19\small
20\verb:\xE2((\x84[\x80-\xB7])|(([\x81-\x83][\x80-\xBF])|(\x80[\xB0-\xBF]))):
21\newline
22}
23
24\paragraph{UTF-8 Byte Classification and Validation.}
25In UTF-8, bytes are classified as individual ASCII bytes, or as
26prefixes of two-, three-, or four-byte sequences, or as suffix bytes.
27In addition, we say that the {\em scope} bytes of a prefix are the
28immediately following byte positions at which a suffix byte is
29expected.
30Mismatches between scope expectations and occurrences of suffix
31bytes indicate errors  (we omit other error equations for brevity).
32Two helper streams are also useful.
33The Initial stream marks ASCII bytes and prefixes of multibyte sequences,
34while the NonFinal stream marks any position that is not the final
35position of a Unicode character.
36\begin{eqnarray*}
37\mbox{\rm ASCII} & = & \mbox{\rm CharClass}(\verb:[\x00-\x7F]:) \\
38\mbox{\rm Prefix} & = & \mbox{\rm CharClass}(\verb:[\xC2-\F4]:) \\
39\mbox{\rm Prefix3or4} & = & \mbox{\rm CharClass}(\verb:[\xE0-\xF4]:) \\
40\mbox{\rm Prefix4} & = & \mbox{\rm CharClass}(\verb:[\xF0-\xF4]:) \\
41\mbox{\rm Suffix} & = & \mbox{\rm CharClass}(\verb:[\x80-\xBF]:) \\
42\mbox{\rm Scope} & = & \mbox{\rm Advance}(\mbox{Prefix}) \vee \mbox{\rm
43  Advance}(\mbox{Prefix3or4},2) \vee \mbox{\rm
44  Advance}(\mbox{Prefix4}, 3) \\
45\mbox{\rm Mismatch} & = & \mbox{\rm Scope} \oplus \mbox{Suffix} \\
46\mbox{\rm Initial} & = & \mbox{\rm ASCII} \vee \mbox{Prefix} \\
47\mbox{\rm NonFinal} & = & \mbox{\rm Prefix} \vee \mbox{\rm
48  Advance}(\mbox{Prefix3or4}) \vee \mbox{\rm Advance}(\mbox{Prefix4}, 2)
49\end{eqnarray*}
50
51\paragraph{Unicode Character Classes.}  Whereas ASCII character classes
52can be determined by simple bitwise logic at a single code unit position,
53the UnicodeClass stream for a given class involves logic for up to four positions.
54By convention, we define UnicodeClass($U$) for a given Unicode character class
55$U$ to be the stream marking the {\em final} position of any characters in
56the class.
57
58Using these definitions, it is then possible to extend the matching
59equations to operate with Unicode character classes as follows.
60
61\begin{eqnarray*}
62\mbox{\rm Match}(m, U) & = &  \mbox{\rm Advance}(\mbox{\rm ScanThru}(m, \mbox{\rm NonFinal}) \wedge \mbox{\rm UnicodeClass}(U)) \\
63\mbox{\rm Match}(m, U*) & = &  \mbox{\rm MatchStar}(m, \mbox{\rm UnicodeClass}(U) \vee \mbox{\rm NonFinal}) \wedge \mbox{\rm Initial}\\
64\end{eqnarray*}
65
66Here, we use
67the ScanThru~\cite{cameron2011parallel} operation to move a set of markers
68each through the nonfinal bytes of UTF-8 sequences to the final
69byte position.
70\[{\mbox ScanThru}(m, c) = (m + c)  \wedge \neg c\]
71Figure~\ref{fig:multibytesequence} shows this
72technique in operation in the case of advancing through byte
73sequences (each 3 bytes in length) corresponding to Chinese characters.
74To better demonstrate the process, we use \texttt{ni3}, \texttt{hao}
75and \texttt{men} to represent these characters.
76$\text{CC}_{\texttt{ni3}}$ is the bitstream that marks character
77\texttt{ni3} and $\text{CC}_{\texttt{hao}}$ is the bitstream that
78marks character \texttt{hao}.
79We start with the marker stream $m_0$ initialized to Initial, indicating all positions are in play.
80Using ScanThru, we move to the final position of each character $t_1$.
81Applying bitwise and with $\text{CC}_{\texttt{ni3}}$ and advancing gives the
82two matches $m_1$ for ni3.  Applying ScanThru once more advances to the
83final position of the character after \texttt{ni3}
84The final result stream $m_2$ shows the lone match for the multibyte sequence
85\texttt{ni3hao}.
86
87\begin{figure}[tbh]
88\vspace{-0.5em}
89\begin{center}
90%\begin{tabular}{cr}\\
91\begin{tabular}{c@{\hspace{1em}}r}\\
92input data                                                         & \verb`ni3hao(Hello),ni3men(You),`\\
93$\text{CC}_{\text{ni3}}$                                           & \verb`..1.............1.........`\\
94$\text{CC}_{\text{hao}}$                                           & \verb`.....1....................`\\
95$m_0 = \mbox{\rm Initial}$                                         & \verb`1..1..111111111..1..111111`\\
96NonFinal                                                           & \verb`11.11.........11.11.......`\\
97$t_1 = \text{ScanThru}(m_0, \text{NonFinal})$                      & \verb`..1..111111111..1..1111111`\\
98$m_1 = \text{Advance}(t_1 \land \text{CC}_{\text{ni3}})$           & \verb`...1.............1........`\\
99$t_2 = \text{ScanThru}(t_2, \text{NonFinal})$                      & \verb`.....1.............1......`\\
100$m_2 = \text{Advance}(t_2 \land CC_{\text{hao}})$                  & \verb`......1...................`
101\end{tabular}
102\end{center}
103\vspace{-1em}
104\caption{Processing of a Multibyte Sequence ni3hao}
105\label{fig:multibytesequence}
106\vspace{-0.5em}
107\end{figure}
108
109\paragraph{Unicode MatchStar.}
110The $\text{MatchStar}(M, C)$ operation directly implements
111the operation of finding all positions reachable from a
112marker bit in $M$ through a character class repetition of
113an ASCII byte class $C$.   In UTF-8 matching, however,
114the character class byte streams are marked at their
115{\em final} positions.   Thus the one bits of a Unicode character
116class stream are not necessarily contiguous.  This in turn
117means that carry propagation within the MatchStar
118operation may terminate prematurely.
119
120In order to remedy this problem, \icGrep{} again uses the NonFinal
121stream  to ``fill in the gaps'' in the UnicodeClass($U$) bitstream so that the
122 MatchStar addition can move through a contiguous sequence of one
123 bits.  In this way, matching of an arbitrary Unicode character class
124 $U$
125can be implemented using ${\mbox{MatchStar}(m, \mbox{UnicodeClass}(U) \vee \mbox{NonFinal})}$.
126
127\paragraph{Predefined Unicode Classes.}
128\icGrep{} employs a set of bitstreams that are precompiled
129into the executable.  These include all bitstreams corresponding
130to Unicode property expressions such as \verb:\p{Greek}:.
131Each property potentially contains many code points, so we further
132embed the calculations within an if hierarchy.   Each if-statement
133within the hierarchy determines whether the current block contains
134any codepoints at all in a given Unicode range.   At the outer
135level, the ranges are quite coarse, becoming successively refined
136at deeper levels.  This technique works well when input documents
137contain long runs of text confined to one or a few ranges.
138
139%\subsection{Character Class Intersection and Difference}
140%\subsection{Unicode Case-Insensitive Matching}
Note: See TracBrowser for help on using the repository browser.