source: docs/Working/icGrep/unicode-re.tex @ 4489

Last change on this file since 4489 was 4489, checked in by nmedfort, 4 years ago

Minor revision to toUTF8 Transformation

File size: 6.1 KB
Line 
1\section{Unicode Regular Expression Methods}\label{sec:Unicode}
2
3\subsection{UTF-8 Transformation}\label{sec:Unicode:toUTF8}
4
5% The \icGrep{} parser produces a representation of an input regular expression over code points.
6% Parabix, however, operates on fixed size code \emph{units} instead of varying sized code \emph{points}.
7% We devised a toUTF-8 transformation that converts the expression over code points into an equivalent expression over code units.
8% %The transformation accomplishes this by first determining the number of UTF-8 bytes
9% %required to represent each code point contained within each character class.
10% The process splits code points into corresponding sequences of bytes (UTF-8 code units), %, with each byte containing the necessary UTF-8 prefix.
11% %The UTF-8 encoded bytes are each assigned to
12% and assigns these code units to new character classes.% in the new AST.
13% %
14% Consider the multibyte Unicode regular expression `\verb:\u{244}[\u{2030}-\u{2137}]:`.
15% The parser produces a sequence starting with a character class for \verb:0x244:
16% followed by a character class for the range from \verb:0x2030: to \verb:0x2137:.
17% After toUTF-8, %this AST becomes more complex.
18% the first code point in the sequence becomes the two byte sequence `\verb:\u{C9}\u{84}:`,
19% and the character class for the range%, which is a range of three byte sequences,
20% expands into a series of sequences and alternations necessary to specify the byte encodings within the range.
21% The UTF-8 encoded regular expression for the range \verb:[\u{2030}-\u{2137}]: becomes:
22% \newline
23
24
25The \icGrep{} parser produces a representation of an input regular expression over variable-length code \emph{points}.
26%
27Parabix, however, operates on fixed-size code \emph{units}.
28%
29To support code points, the toUTF8 transformation converts the expression into an equivalent expression over code units
30by splitting code points into corresponding sequences of bytes (UTF-8 code units).
31%and assigning them to new character classes.
32%
33Consider the Unicode regular expression `\verb:\u{244}[\u{2030}-\u{2137}]:`.
34%
35The parser produces a sequence starting with a \verb:0x244: followed by the range of \verb:0x2030: to \verb:0x2137:.
36%
37After toUTF8, the first code point in the sequence becomes the two byte sequence ``\verb:\u{C9}\u{84}:'',
38and the range expands into the series of sequences and alternations shown below:
39\newline
40
41{
42\small
43\verb:\xE2((\x84[\x80-\xB7])|(([\x81-\x83][\x80-\xBF])|(\x80[\xB0-\xBF]))):
44\newline
45}
46
47%The benefit of transforming the regular expression immediately after parsing from being a regular expression over code points into a regular expression over bytes is that it simplifies the rest of the compiler, as the compiler then only needs to be concerned with single bytes as opposed to code points, which vary in size.
48
49
50\subsection{Match Unicode Characters}
51After the transformation, we can then generate the character class bitstreams with our character class compiler.
52As shown in Figure \ref{fig:multibytesequence}, the input data has 4 Chinese characters, each formed by 3 bytes.
53To better demonstrate the process, we use \emph{ni3}, \emph{hao} and \emph{men} to represent these characters.
54$CC_{\mbox{ni3}}$ is the bitstream that marks character \emph{ni3} and $CC_{\mbox{hao}}$ is the bitstream that marks character stream \emph{hao}.
55To match a two UTF-8 character sequence \emph{ni3hao}, we need to first create an \emph{Initial} steam that marks the first byte of all the valid characters.
56We also need to generate a \emph{NonFinal} stream that marks every byte of all the multibyte characters except for the last byte.
57Using \emph{Initial} to ScanThru \emph{NonFinal}, we get the bitstream $M_2$, which marks the positions of the last byte of every character.
58An overlap between $M_2$ and $CC_{\mbox{ni3}}$ gives the start position for matching the next character.
59As illustrated by \emph{Adv}, we find two matches for \emph{ni3} and from these two positions we can start the matching process for the next character \emph{hao}.
60The final result stream shows 1 match for Multibyte Sequence ni3hao.
61
62
63
64\begin{figure}[tbh]
65\begin{center}
66%\begin{tabular}{cr}\\
67
68\begin{tabular}{r@{\hspace{1em}}r}\\
69input data  & \verb`ni3hao(Hello),ni3men(You),`\\
70$CC_{\mbox{ni3}}$ & \verb`..1.............1.........`\\
71$CC_{\mbox{hao}}$ & \verb`.....1....................`\\
72\emph{Initial} & \verb`1..1..111111111..1..111111`\\
73\emph{NonFinal} & \verb`11.11.........11.11.......`\\
74$M_1 = \mathit{ScanThru}(\mathit{Initial}, \mathit{NonFinal})$ & \verb`..1..111111111..1..1111111`\\
75$\mathit{Adv} = \mathit{Advance}(M_1 \land CC_{\mbox{ni3}})$ & \verb`...1.............1........`\\
76$M_2 = \mathit{ScanThru}(\mathit{Initial} \land \mathit{Adv}, \mathit{NonFinal})$ & \verb`.....1.............1......`\\
77$\mathit{match} = M_2 \land CC_{\mbox{hao}}$ & \verb`.....1....................`
78\end{tabular}
79
80
81\end{center}
82\caption{Processing of a Multibyte Sequence ni3hao}
83\label{fig:multibytesequence}
84\end{figure}
85
86\subsection{Predefined Unicode classes}
87
88The Unicode Consortium assigned every Unicode character to a general category, such as letter or punctuation.
89These categories seldom change, so we precomputed their parallel bitstream equations and statically compiled them into icGrep.
90Each category contains many code points, so we further devised an \emph{If Hierarchy} optimization for each category.
91This optimization assumes that most input documents only contain code points from a small number of writing systems.
92Including equations for code points outside of the used range is unnecessary and will only add to the total running time of the application.
93The \emph{If Hierarchy} optimization determines the ranges of the code points in the input text and only processes the character class
94equations for observed code point ranges.
95The optimization tests the input text with a series of nested \emph{if else} statements, similar to a static binary search.
96As the nesting of the \emph{if} statements increases, the range of the code points narrows and the equations for a precise range of
97code points are selected.
98
99%\subsection{Character Class Intersection and Difference}
100%\subsection{Unicode Case-Insensitive Matching}
Note: See TracBrowser for help on using the repository browser.