source: docs/Working/icGrep/unicode-re.tex @ 4476

Last change on this file since 4476 was 4476, checked in by nmedfort, 4 years ago

Architecture updates

File size: 5.3 KB
1\textsc{}\section{Unicode Regular Expression Methods}\label{sec:Unicode}
3\subsection{toUTF8 Transformation}\label{sec:Unicode:toUTF8}
5The \icGrep{} parser produces a representation of an input regular expression over code points.
6Parabix, however, operates on fixed size code \emph{units} instead of varyingly sized code \emph{points}.
7We devised a toUTF-8 transformation that converts the expression over code points into an equivalent expression over code units.
8%The transformation accomplishes this by first determining the number of UTF-8 bytes
9%required to represent each code point contained within each character class.
10The process splits code points into corresponding sequences of bytes (UTF-8 code units), %, with each byte containing the necessary UTF-8 prefix.
11%The UTF-8 encoded bytes are each assigned to
12and assigns these code units to new character classes.% in the new AST.
13Consider the multibyte Unicode regular expression `\verb:\u{244}[\u{2030}-\u{2137}]:`.
14The parser produces a sequence starting with a character class for \verb:0x244:
15followed by a character class for the range from \verb:0x2030: to \verb:0x2137:.
16After toUTF-8, %this AST becomes more complex.
17the first code point in the sequence becomes the two byte sequence `\verb:\u{C9}\u{84}:`,
18and the character class for the range%, which is a range of three byte sequences,
19expands into a series of sequences and alternations necessary to specify the byte encodings within the range.
20The UTF-8 encoded regular expression for the range \verb:[\u{2030}-\u{2137}]: becomes:
29%The benefit of transforming the regular expression immediately after parsing from being a regular expression over code points into a regular expression over bytes is that it simplifies the rest of the compiler, as the compiler then only needs to be concerned with single bytes as opposed to code points, which vary in size.
32\subsection{Match Unicode Characters}
33After the transformation, we can then generate the character class bitstreams with our character class compiler.
34As shown in Figure \ref{fig:multibytesequence}, the input data has 4 Chinese characters, each formed by 3 bytes.
35To better demonstrate the process, we use \emph{ni3}, \emph{hao} and \emph{men} to represent these characters.
36$CC_{\mbox{ni3}}$ is the bitstream that marks character \emph{ni3} and $CC_{\mbox{hao}}$ is the bitstream that marks character stream \emph{hao}.
37To match a two UTF-8 character sequence \emph{ni3hao}, we need to first create an \emph{Initial} steam that marks the first byte of all the valid characters.
38We also need to generate a \emph{NonFinal} stream that marks every byte of all the multibyte characters except for the last byte.
39Using \emph{Initial} to ScanThru \emph{NonFinal}, we get the bitstream $M_2$, which marks the positions of the last byte of every character.
40An overlap between $M_2$ and $CC_{\mbox{ni3}}$ gives the start position for matching the next character.
41As illustrated by \emph{Adv}, we find two matches for \emph{ni3} and from these two positions we can start the matching process for the next character \emph{hao}.
42The final result stream shows 1 match for Multibyte Sequence ni3hao.
51input data  & \verb`ni3hao(Hello),ni3men(You),`\\
52$CC_{\mbox{ni3}}$ & \verb`..1.............1.........`\\
53$CC_{\mbox{hao}}$ & \verb`.....1....................`\\
54\emph{Initial} & \verb`1..1..111111111..1..111111`\\
55\emph{NonFinal} & \verb`11.11.........11.11.......`\\
56$M_1 = \mathit{ScanThru}(\mathit{Initial}, \mathit{NonFinal})$ & \verb`..1..111111111..1..1111111`\\
57$\mathit{Adv} = \mathit{Advance}(M_1 \land CC_{\mbox{ni3}})$ & \verb`...1.............1........`\\
58$M_2 = \mathit{ScanThru}(\mathit{Initial} \land \mathit{Adv}, \mathit{NonFinal})$ & \verb`.....1.............1......`\\
59$\mathit{match} = M_2 \land CC_{\mbox{hao}}$ & \verb`.....1....................`
64\caption{Processing of a Multibyte Sequence ni3hao}
68\subsection{Predefined Unicode classes}
70The Unicode Consortium assigned every Unicode character to a general category, such as letter or punctuation.
71These categories seldom change, so we precomputed their parallel bitstream equations and statically compiled them into icGrep.
72Each category contains many code points, so we further devised an \emph{If Hierarchy} optimization for each category.
73This optimization assumes that most input documents only contain code points from a small number of writing systems.
74Including equations for code points outside of the used range is unnecessary and will only add to the total running time of the application.
75The \emph{If Hierarchy} optimization determines the ranges of the code points in the input text and only processes the character class
76equations for observed code point ranges.
77The optimization tests the input text with a series of nested \emph{if else} statements, similar to a static binary search.
78As the nesting of the \emph{if} statements increases, the range of the code points narrows and the equations for a precise range of
79code points are selected.
81%\subsection{Character Class Intersection and Difference}
82%\subsection{Unicode Case-Insensitive Matching}
Note: See TracBrowser for help on using the repository browser.