source: docs/Working/icGrep/unicode-re.tex @ 4505

Last change on this file since 4505 was 4504, checked in by nmedfort, 4 years ago

Small fix for unicode matching example along with formatting changes

File size: 6.0 KB
Line 
1\section{Bitwise Methods for UTF-8}\label{sec:Unicode}
2
3As described in the following section, \icGrep{} is a reimplementation
4of the bitwise data parallel method implemented on top of LLVM
5infrastructure and adapted for Unicode regular expression search
6through data streams represented in UTF-8.  In this section,
7we present the techniques we have used to extend the bitwise
8matching techniques to the variable-length encodings of UTF-8.
9
10The first requirement in implementing a regular expression processor
11over UTF-8 data streams is to translate Unicode regular expressions
12over codepoints to corresponding regular expressions over
13sequences of UTF-8 bytes or \emph{code units}.   The {\tt toUTF8} transformation
14performs this as a regular expression transformation, transforming
15input expressions such as `\verb:\u{244}[\u{2030}-\u{2137}]:`
16to the corresponding UTF-8 regular expression consisting of the series of sequences and alternations shown below:
17\newline
18{
19\small
20\verb:\xE2((\x84[\x80-\xB7])|(([\x81-\x83][\x80-\xBF])|(\x80[\xB0-\xBF]))):
21\newline
22}
23
24\paragraph{Unicode Advance.}  As illustrated in Section \ref{sec:background},  a bitwise shift
25(Advance) operation is frequently used in shifting a marker stream
26from positions matched by one regular expression element to the next.
27However, characters are represented by variable-length byte sequences
28in UTF-8, so that a simple shift operation is inadequate to implement
29the operation of advancing bit stream positions from one Unicode
30character to the next. 
31
32To address the requirements of Unicode advance, we use
33the ScanThru~\cite{cameron2011parallel} operation to move a set of markers
34each through the nonfinal bytes of UTF-8 sequences to the final
35byte position.
36Figure~\ref{fig:multibytesequence} shows this
37technique in operation in the case of advancing through byte
38sequences (each 3 bytes in length) corresponding to Chinese characters.
39To better demonstrate the process, we use \texttt{ni3}, \texttt{hao}
40and \texttt{men} to represent these characters.
41$\text{CC}_{\texttt{ni3}}$ is the bitstream that marks character
42\texttt{ni3} and $\text{CC}_{\texttt{hao}}$ is the bitstream that
43marks character \texttt{hao}.
44To match a two UTF-8 character sequence \texttt{ni3hao}, we first
45create an Initial steam that marks the first byte of all the valid characters.
46We also produce a NonFinal stream that marks every byte of all
47multibyte characters \emph{except for} the last byte.
48Using Initial to ScanThru NonFinal, we construct bitstream $M_1$, which
49marks the positions of the last byte of every character.
50An overlap between $M_1$ and $\text{CC}_{\texttt{ni3}}$ gives the start
51position for matching the next character.
52As illustrated by $M_2$, we find two matches for \texttt{ni3},
53and from these two positions we can start the matching process for
54the next character \texttt{hao}.
55The final result stream $M_4$ shows one match for the multibyte sequence
56\texttt{ni3hao}.
57
58\begin{figure}[tbh]
59\vspace{-0.5em}
60\begin{center}
61%\begin{tabular}{cr}\\
62\begin{tabular}{c@{\hspace{1em}}r}\\
63input data                                                         & \verb`ni3hao(Hello),ni3men(You),`\\
64$\text{CC}_{\text{ni3}}$                                           & \verb`..1.............1.........`\\
65$\text{CC}_{\text{hao}}$                                           & \verb`.....1....................`\\
66Initial                                                            & \verb`1..1..111111111..1..111111`\\
67NonFinal                                                           & \verb`11.11.........11.11.......`\\
68$M_1 = \text{ScanThru}(\text{Initial}, \text{NonFinal})$           & \verb`..1..111111111..1..1111111`\\
69$M_2 = \text{Advance}(M_1 \land \text{CC}_{\text{ni3}})$           & \verb`...1.............1........`\\
70$M_3 = \text{ScanThru}(\text{Initial} \land M_2, \text{NonFinal})$ & \verb`.....1.............1......`\\
71$M_4 = M_3 \land CC_{\text{hao}}$                                  & \verb`.....1....................`
72\end{tabular}
73\end{center}
74\vspace{-1em}
75\caption{Processing of a Multibyte Sequence ni3hao}
76\label{fig:multibytesequence}
77\vspace{-0.5em}
78\end{figure}
79
80\paragraph{Unicode MatchStar.}
81The $\mathit{MatchStar}(M, C)$ operation directly implements
82the operation of finding all positions reachable from a
83marker bit in $M$ through a character class repetition of
84an ASCII byte class $C$.   In UTF-8 matching, however,
85the character class byte streams are marked at their
86{\em final} positions.   Thus the one bits of a Unicode character
87class stream are not necessarily contiguous.  This in turn
88means that carry propagation within the MatchStar
89operation may terminate prematurely.
90
91In order to remedy this problem, \icGrep{} again uses the two helper bitstreams
92\emph{Initial} and \emph{NonFinal}.   Any full match to a multibyte sequence must
93reach the initial position of the next character. 
94The {\em NonFinal} bitstream consists of all positions except those
95that are final positions of UTF-8 sequences.
96It is used to ``fill in the gaps'' in the CC bitstream so that the
97 MatchStar addition can move through a contiguous sequence of one
98 bits.  In this way, matching of an arbitrary Unicode character class
99 $C$ (with a 1 bit set at final positions of any members of the class),
100can be implemented using ${\mathit{MatchStar}(M, C |\mathit{NonFinal})}$.
101
102\paragraph{Predefined Unicode Classes.}
103\icGrep{} employs a set of bitstreams that are precompiled
104into the executable.  These include all bitstreams corresponding
105to Unicode property expressions such as \verb:\p{Greek}:.
106Each property potentially contains many code points, so we further
107embed the calculations within an if hierarchy.   Each if-statement
108within the hierarchy determines whether the current block contains
109any codepoints at all in a given Unicode range.   At the outer
110level, the ranges are quite coarse, becoming successively refined
111at deeper levels.  This technique works well when input documents
112contain long runs of text confined to one or a few ranges.
113
114%\subsection{Character Class Intersection and Difference}
115%\subsection{Unicode Case-Insensitive Matching}
Note: See TracBrowser for help on using the repository browser.