source: docs/Working/icGrep/unicode-re.tex @ 4555

Last change on this file since 4555 was 4555, checked in by cameron, 4 years ago

Initial revision of UTF-8 matching

File size: 7.8 KB
Line 
1\section{Bitwise Methods for UTF-8}\label{sec:Unicode}
2
3As described in the following section, \icGrep{} is a reimplementation
4of the bitwise data parallel method implemented on top of LLVM
5infrastructure and adapted for Unicode regular expression search
6through data streams represented in UTF-8.  In this section,
7we present the techniques we have used to extend the bitwise
8matching techniques to the variable-length encodings of UTF-8.
9
10The first requirement in implementing a regular expression processor
11over UTF-8 data streams is to translate Unicode regular expressions
12over codepoints to corresponding regular expressions over
13sequences of UTF-8 bytes or \emph{code units}.   The {\tt toUTF8} transformation
14performs this as a regular expression transformation, transforming
15input expressions such as `\verb:\u{244}[\u{2030}-\u{2137}]:`
16to the corresponding UTF-8 regular expression consisting of the series of sequences and alternations shown below:
17\newline
18{
19\small
20\verb:\xE2((\x84[\x80-\xB7])|(([\x81-\x83][\x80-\xBF])|(\x80[\xB0-\xBF]))):
21\newline
22}
23
24\paragraph{UTF-8 Byte Classification and Validation.}
25In UTF-8, bytes are classified as individual ASCII bytes, or as
26prefixes of two-, three-, or four-byte sequences, or as suffix bytes.
27In addition, we say that the {\em scope} bytes of a prefix are the
28immediately following byte positions at which a suffix byte is expected.
29Two helper streams are also useful.
30The Initial stream marks ASCII bytes and prefixes of multibyte sequences,
31while the NonFinal stream marks any position that is not the final
32position of a Unicode character.
33\begin{eqnarray*}
34\mbox{\rm ASCII} & = & \mbox{\rm CharClass}(\verb:[\x00-\x7F]:) \\
35\mbox{\rm Prefix2} & = & \mbox{\rm CharClass}(\verb:[\xC2-\xDF]:) \\
36\mbox{\rm Prefix3} & = & \mbox{\rm CharClass}(\verb:[\xE0-\xEF]:) \\
37\mbox{\rm Prefix4} & = & \mbox{\rm CharClass}(\verb:[\xF0-\xF4]:) \\
38\mbox{\rm Suffix} & = & \mbox{\rm CharClass}(\verb:[\x80-\xBF]:) \\
39\mbox{\rm Scope2} & = & \mbox{\rm Advance}(\mbox{Prefix2} \vee \mbox{Prefix3} \vee \mbox{Prefix4}) \\
40\mbox{\rm Scope3} & = & \mbox{\rm Advance}(\mbox{\rm Advance}(\mbox{Prefix3} \vee \mbox{Prefix4})) \\
41\mbox{\rm Scope4} & = & \mbox{\rm Advance}(\mbox{\rm Advance}(\mbox{\rm Advance}(\mbox{Prefix4}))) \\
42\mbox{\rm Initial} & = & \mbox{\rm ASCII} \vee \mbox{Prefix2} \vee \mbox{Prefix3} \vee \mbox{Prefix4} \\
43\mbox{\rm NonFinal} & = & \mbox{\rm Prefix} \vee \mbox{\rm Advance}(\mbox{Prefix3} \vee \mbox{Prefix4}) \vee \mbox{\rm Advance}(\mbox{\rm Advance}(\mbox{Prefix4}))
44\end{eqnarray*}
45
46\paragraph{Unicode Character Classes.}  Whereas ASCII character classes
47can be determined by simple bitwise logic at a single code unit position,
48the UnicodeClass stream for a given class involves logic for up to four positions.
49By convention, we define UnicodeClass($U$) for a given Unicode character class
50$U$ to be the stream marking the {\em final} position of
51Unicode character classes.
52
53Using these definitions, it is then possible to extend the matching
54equations to operate with Unicode character classes as follows.
55
56\begin{eqnarray*}
57\mbox{\rm Match}(m, U) & = &  \mbox{\rm Advance}(\mbox{\rm ScanThru}(m, \mbox{\rm NonFinal}) \wedge \mbox{\rm UnicodeClass}(U)) \\
58\mbox{\rm Match}(m, U*) & = &  \mbox{\rm MatchStar}(m, \mbox{\rm UnicodeClass}(U) \vee \mbox{\rm NonFinal}) \wedge \mbox{\rm Initial}\\
59\end{eqnarray*}
60
61Here, we use
62the ScanThru~\cite{cameron2011parallel} operation to move a set of markers
63each through the nonfinal bytes of UTF-8 sequences to the final
64byte position.
65\[{\mbox ScanThru}(m, c) = (m + c)  \wedge \neg c\]
66Figure~\ref{fig:multibytesequence} shows this
67technique in operation in the case of advancing through byte
68sequences (each 3 bytes in length) corresponding to Chinese characters.
69To better demonstrate the process, we use \texttt{ni3}, \texttt{hao}
70and \texttt{men} to represent these characters.
71$\text{CC}_{\texttt{ni3}}$ is the bitstream that marks character
72\texttt{ni3} and $\text{CC}_{\texttt{hao}}$ is the bitstream that
73marks character \texttt{hao}.
74To match a two UTF-8 character sequence \texttt{ni3hao}, we first
75create an Initial stream that marks the first byte of all the valid characters.
76We also produce a NonFinal stream that marks every byte of all
77multibyte characters \emph{except for} the last byte.
78Using Initial to ScanThru NonFinal, we construct bitstream $M_1$, which
79marks the positions of the last byte of every character.
80An overlap between $M_1$ and $\text{CC}_{\texttt{ni3}}$ gives the start
81position for matching the next character.
82As illustrated by $M_2$, we find two matches for \texttt{ni3},
83and from these two positions we can start the matching process for
84the next character \texttt{hao}.
85The final result stream $M_4$ shows one match for the multibyte sequence
86\texttt{ni3hao}.
87
88\begin{figure}[tbh]
89\vspace{-0.5em}
90\begin{center}
91%\begin{tabular}{cr}\\
92\begin{tabular}{c@{\hspace{1em}}r}\\
93input data                                                         & \verb`ni3hao(Hello),ni3men(You),`\\
94$\text{CC}_{\text{ni3}}$                                           & \verb`..1.............1.........`\\
95$\text{CC}_{\text{hao}}$                                           & \verb`.....1....................`\\
96Initial                                                            & \verb`1..1..111111111..1..111111`\\
97NonFinal                                                           & \verb`11.11.........11.11.......`\\
98$M_1 = \text{ScanThru}(\text{Initial}, \text{NonFinal})$           & \verb`..1..111111111..1..1111111`\\
99$M_2 = \text{Advance}(M_1 \land \text{CC}_{\text{ni3}})$           & \verb`...1.............1........`\\
100$M_3 = \text{ScanThru}(\text{Initial} \land M_2, \text{NonFinal})$ & \verb`.....1.............1......`\\
101$M_4 = M_3 \land CC_{\text{hao}}$                                  & \verb`.....1....................`
102\end{tabular}
103\end{center}
104\vspace{-1em}
105\caption{Processing of a Multibyte Sequence ni3hao}
106\label{fig:multibytesequence}
107\vspace{-0.5em}
108\end{figure}
109
110\paragraph{Unicode MatchStar.}
111The $\mathit{MatchStar}(M, C)$ operation directly implements
112the operation of finding all positions reachable from a
113marker bit in $M$ through a character class repetition of
114an ASCII byte class $C$.   In UTF-8 matching, however,
115the character class byte streams are marked at their
116{\em final} positions.   Thus the one bits of a Unicode character
117class stream are not necessarily contiguous.  This in turn
118means that carry propagation within the MatchStar
119operation may terminate prematurely.
120
121In order to remedy this problem, \icGrep{} again uses the two helper bitstreams
122\emph{Initial} and \emph{NonFinal}.   Any full match to a multibyte sequence must
123reach the initial position of the next character. 
124The {\em NonFinal} bitstream consists of all positions except those
125that are final positions of UTF-8 sequences.
126It is used to ``fill in the gaps'' in the CC bitstream so that the
127 MatchStar addition can move through a contiguous sequence of one
128 bits.  In this way, matching of an arbitrary Unicode character class
129 $C$ (with a 1 bit set at final positions of any members of the class),
130can be implemented using ${\mathit{MatchStar}(M, C |\mathit{NonFinal})}$.
131
132\paragraph{Predefined Unicode Classes.}
133\icGrep{} employs a set of bitstreams that are precompiled
134into the executable.  These include all bitstreams corresponding
135to Unicode property expressions such as \verb:\p{Greek}:.
136Each property potentially contains many code points, so we further
137embed the calculations within an if hierarchy.   Each if-statement
138within the hierarchy determines whether the current block contains
139any codepoints at all in a given Unicode range.   At the outer
140level, the ranges are quite coarse, becoming successively refined
141at deeper levels.  This technique works well when input documents
142contain long runs of text confined to one or a few ranges.
143
144%\subsection{Character Class Intersection and Difference}
145%\subsection{Unicode Case-Insensitive Matching}
Note: See TracBrowser for help on using the repository browser.