source: docs/Working/icGrep/unicode-re.tex @ 4493

Last change on this file since 4493 was 4493, checked in by cameron, 4 years ago

Updates

File size: 6.1 KB
Line 
1\section{Bitwise Methods for UTF-8}\label{sec:Unicode}
2
3As described in the following section, \icGrep{} is a reimplementation
4of the bitwise data parallel method implemented on top of LLVM
5infrastructure and adapted for Unicode regular expression search
6through data streams represented in UTF-8.  In this section,
7we present the techniques we have used to extend the bitwise
8matching techniques to the variable-length encodings of UTF-8.
9
10The first requirement in implementing a regular expression processor
11over UTF-8 data streams is to translate Unicode regular expressions
12over codepoints to corresponding regular expressions over
13sequences of UTF-8 bytes.   The {\tt toUTF8} transformation
14performs this as a regular expression transformation, transforming
15input expressions such as `\verb:\u{244}[\u{2030}-\u{2137}]:`
16to the corresponding UTF-8 regular expression consisting of the series of sequences and alternations shown below:
17\newline
18{
19\small
20\verb:\xE2((\x84[\x80-\xB7])|(([\x81-\x83][\x80-\xBF])|(\x80[\xB0-\xBF]))):
21\newline
22}
23
24\paragraph{Unicode Advance.}  As illustrated above,  a bitwise shift
25(Advance) operation is frequently used in shifting a marker stream
26from positions matched by one regular expression element to the next.
27However, characters are represented by variable-length byte sequences
28in UTF-8, so that a simple shift operation is inadequate to implement
29the operation of advancing bit stream positions from one Unicode
30character to the next. 
31
32In order to address the requirements of Unicode advance, we use
33the $\mathit{ScanThru}$ \cite{cameron2011parallel} operation to move a set of markers
34each through the nonfinal bytes of UTF-8 sequences to the final
35byte position.   Figure \ref{fig:multibytesequence} shows this
36technique in operation in the case of advancing through byte
37sequences (each 3 bytes in length) corresponding to Chinese characters.   
38To better demonstrate the process, we use \emph{ni3}, \emph{hao} and \emph{men} to represent these characters.
39$CC_{\mbox{ni3}}$ is the bitstream that marks character \emph{ni3} and $CC_{\mbox{hao}}$ is the bitstream that marks character stream \emph{hao}.
40To match a two UTF-8 character sequence \emph{ni3hao}, we first create an \emph{Initial} steam that marks the first byte of all the valid characters.
41We also produce a \emph{NonFinal} stream that marks every byte of all the multibyte characters except for the last byte.
42Using \emph{Initial} to ScanThru \emph{NonFinal}, we get the bitstream $M_2$, which marks the positions of the last byte of every character.
43An overlap between $M_2$ and $CC_{\mbox{ni3}}$ gives the start position for matching the next character.
44As illustrated by \emph{Adv}, we find two matches for \emph{ni3} and from these two positions we can start the matching process for the next character \emph{hao}.
45The final result stream shows 1 match for the multibyte sequence ni3hao.
46
47\begin{figure}[tbh]
48\begin{center}
49%\begin{tabular}{cr}\\
50
51\begin{tabular}{r@{\hspace{1em}}r}\\
52input data  & \verb`ni3hao(Hello),ni3men(You),`\\
53$CC_{\mbox{ni3}}$ & \verb`..1.............1.........`\\
54$CC_{\mbox{hao}}$ & \verb`.....1....................`\\
55\emph{Initial} & \verb`1..1..111111111..1..111111`\\
56\emph{NonFinal} & \verb`11.11.........11.11.......`\\
57$M_1 = \mathit{ScanThru}(\mathit{Initial}, \mathit{NonFinal})$ & \verb`..1..111111111..1..1111111`\\
58$\mathit{Adv} = \mathit{Advance}(M_1 \land CC_{\mbox{ni3}})$ & \verb`...1.............1........`\\
59$M_2 = \mathit{ScanThru}(\mathit{Initial} \land \mathit{Adv}, \mathit{NonFinal})$ & \verb`.....1.............1......`\\
60$\mathit{match} = M_2 \land CC_{\mbox{hao}}$ & \verb`.....1....................`
61\end{tabular}
62\end{center}
63\caption{Processing of a Multibyte Sequence ni3hao}
64\label{fig:multibytesequence}
65\end{figure}
66
67\paragraph{Unicode MatchStar.}
68
69The $\mathit{MatchStar}(M, C)$ operation directly implements
70the operation of finding all positions reachable from a
71marker bit in $M$ through a character class repetition of
72an ASCII byte class $C$.   In UTF-8 matching, however,
73the character class byte streams are marked at their
74{\em final} positions.   Thus the one bits of a Unicode character
75class stream are not necessarily contiguous.  This in turn
76means that the addition operation within the MatchStar
77operation may terminate prematurely.
78
79In order to remedy this problem, \icGrep{} forms two helper bitstreams
80\emph{Initial} and \emph{NonFinal}.  The {\em Initial} bitstream marks the
81locations of all single byte characters and the first bytes of all
82multibyte characters.  Any full match to a multibyte sequence must
83reach the initial position of the next character. 
84The {\em NonFinal} bitstream consists of all positions except those
85that are final positions of UTF-8 sequences. 
86 It is used to ``fill in the gaps'' in the CC bitstream so that the
87 MatchStar addition can move through a contiguous sequence of one
88 bits.  In this way, matching of an arbitrary Unicode character class
89 $C$ (with a 1 bit set at final positions of any members of the
90 class),
91can be implemented using ${\mathit{MatchStar}(M, C |
92  \mathit{NonFinal})}$.
93
94\paragraph{Predefined Unicode Classes.}
95\icGrep{} employs a set of predefined bitstreams that are precompiled
96into the executable.  These include all bitstreams corresponding
97to Unicode property expressions such as \verb:\p{Greek}:.
98Each category potentially contains  many code points, so we further
99devised an \emph{If Hierarchy} optimization for
100these properties.
101This optimization applies whenever an input document contains
102codepoints from only a small number of writing systems -- the normal case.
103The \emph{If Hierarchy} optimization determines the ranges of the code points in the input text and only processes the character class
104equations for observed code point ranges.
105The optimization tests the input text with a series of nested \emph{if else} statements, similar to a static binary search.
106As the nesting of the \emph{if} statements increases, the range of the code points narrows and the equations for a precise range of
107code points are selected.
108
109%\subsection{Character Class Intersection and Difference}
110%\subsection{Unicode Case-Insensitive Matching}
Note: See TracBrowser for help on using the repository browser.