# source:docs/Working/icGrep/unicode-re.tex@4490

Last change on this file since 4490 was 4490, checked in by cameron, 4 years ago

Intro and Background, remove section 3

File size: 5.9 KB
Line
1\section{Bitwise Methods for Unicode}\label{sec:Unicode}
2\subsection{UTF-8 Transformation}\label{sec:Unicode:toUTF8}
3
4% The \icGrep{} parser produces a representation of an input regular expression over code points.
5% Parabix, however, operates on fixed size code \emph{units} instead of varying sized code \emph{points}.
6% We devised a toUTF-8 transformation that converts the expression over code points into an equivalent expression over code units.
7% %The transformation accomplishes this by first determining the number of UTF-8 bytes
8% %required to represent each code point contained within each character class.
9% The process splits code points into corresponding sequences of bytes (UTF-8 code units), %, with each byte containing the necessary UTF-8 prefix.
10% %The UTF-8 encoded bytes are each assigned to
11% and assigns these code units to new character classes.% in the new AST.
12% %
13% Consider the multibyte Unicode regular expression \verb:\u{244}[\u{2030}-\u{2137}]:.
14% The parser produces a sequence starting with a character class for \verb:0x244:
15% followed by a character class for the range from \verb:0x2030: to \verb:0x2137:.
16% After toUTF-8, %this AST becomes more complex.
17% the first code point in the sequence becomes the two byte sequence \verb:\u{C9}\u{84}:,
18% and the character class for the range%, which is a range of three byte sequences,
19% expands into a series of sequences and alternations necessary to specify the byte encodings within the range.
20% The UTF-8 encoded regular expression for the range \verb:[\u{2030}-\u{2137}]: becomes:
21% \newline
22The \icGrep{} regular expression parser produces a representation of an input regular expression over Unicode code points.
23To process UTF-8 data streams, however, these expressions must first be converted to equivalent expressions in terms of UTF-8 code units.
24\%
25Consider the Unicode regular expression \verb:\u{244}[\u{2030}-\u{2137}]:.
26%
27The parser produces a sequence starting with a \verb:0x244: followed by the range of \verb:0x2030: to \verb:0x2137:.
28%
29After toUTF8, the first codepoint in the sequence becomes the two byte sequence \verb:\u{C9}\u{84}:'',
30and the range expands into the series of sequences and alternations shown below:
31\newline
32
33{
34\small
35\verb:\xE2((\x84[\x80-\xB7])|(([\x81-\x83][\x80-\xBF])|(\x80[\xB0-\xBF]))):
36\newline
37}
38
39%The benefit of transforming the regular expression immediately after parsing from being a regular expression over code points into a regular expression over bytes is that it simplifies the rest of the compiler, as the compiler then only needs to be concerned with single bytes as opposed to code points, which vary in size.
40
41
42\subsection{Match Unicode Characters}
43After the transformation, we can then generate the character class bitstreams with our character class compiler.
44As shown in Figure \ref{fig:multibytesequence}, the input data has 4 Chinese characters, each formed by 3 bytes.
45To better demonstrate the process, we use \emph{ni3}, \emph{hao} and \emph{men} to represent these characters.
46$CC_{\mbox{ni3}}$ is the bitstream that marks character \emph{ni3} and $CC_{\mbox{hao}}$ is the bitstream that marks character stream \emph{hao}.
47To match a two UTF-8 character sequence \emph{ni3hao}, we need to first create an \emph{Initial} steam that marks the first byte of all the valid characters.
48We also need to generate a \emph{NonFinal} stream that marks every byte of all the multibyte characters except for the last byte.
49Using \emph{Initial} to ScanThru \emph{NonFinal}, we get the bitstream $M_2$, which marks the positions of the last byte of every character.
50An overlap between $M_2$ and $CC_{\mbox{ni3}}$ gives the start position for matching the next character.
51As illustrated by \emph{Adv}, we find two matches for \emph{ni3} and from these two positions we can start the matching process for the next character \emph{hao}.
52The final result stream shows 1 match for Multibyte Sequence ni3hao.
53
54
55
56\begin{figure}[tbh]
57\begin{center}
58%\begin{tabular}{cr}\\
59
60\begin{tabular}{r@{\hspace{1em}}r}\\
61input data  & \verbni3hao(Hello),ni3men(You),\\
62$CC_{\mbox{ni3}}$ & \verb..1.............1.........\\
63$CC_{\mbox{hao}}$ & \verb.....1....................\\
64\emph{Initial} & \verb1..1..111111111..1..111111\\
65\emph{NonFinal} & \verb11.11.........11.11.......\\
66$M_1 = \mathit{ScanThru}(\mathit{Initial}, \mathit{NonFinal})$ & \verb..1..111111111..1..1111111\\
67$\mathit{Adv} = \mathit{Advance}(M_1 \land CC_{\mbox{ni3}})$ & \verb...1.............1........\\
68$M_2 = \mathit{ScanThru}(\mathit{Initial} \land \mathit{Adv}, \mathit{NonFinal})$ & \verb.....1.............1......\\
69$\mathit{match} = M_2 \land CC_{\mbox{hao}}$ & \verb.....1....................
70\end{tabular}
71
72
73\end{center}
74\caption{Processing of a Multibyte Sequence ni3hao}
75\label{fig:multibytesequence}
76\end{figure}
77
78\subsection{Predefined Unicode classes}
79
80The Unicode Consortium assigned every Unicode character to a general category, such as letter or punctuation.
81These categories seldom change, so we precomputed their parallel bitstream equations and statically compiled them into icGrep.
82Each category contains many code points, so we further devised an \emph{If Hierarchy} optimization for each category.
83This optimization assumes that most input documents only contain code points from a small number of writing systems.
84Including equations for code points outside of the used range is unnecessary and will only add to the total running time of the application.
85The \emph{If Hierarchy} optimization determines the ranges of the code points in the input text and only processes the character class
86equations for observed code point ranges.
87The optimization tests the input text with a series of nested \emph{if else} statements, similar to a static binary search.
88As the nesting of the \emph{if} statements increases, the range of the code points narrows and the equations for a precise range of
89code points are selected.
90
91%\subsection{Character Class Intersection and Difference}
92%\subsection{Unicode Case-Insensitive Matching}
Note: See TracBrowser for help on using the repository browser.