source: docs/Working/icGrep/unicode-re.tex @ 4468

Last change on this file since 4468 was 4464, checked in by daled, 4 years ago

A few changes.

File size: 7.5 KB
1\textsc{}\section{Unicode Regular Expression Methods}\label{sec:Unicode}
3\subsection{toUTF8 Transformation}\label{sec:Unicode:toUTF8}
5The icGrep parser generates an abstract syntax tree (AST) that represents an input regular expression over code points.  This AST is passed as input to a toUTF-8 transformation that generates a new AST that represents the equivalent regular expression over UTF-8 byte sequences. The transformation accomplishes this by first determining the number of UTF-8 bytes that are required to represent each code point contained within each character class.  The code points are then split into sequences of bytes, with each byte containing the necessary UTF-8 prefix. The UTF-8 encoded bytes are each assigned to a new character class in the new AST.  For an example, consider the following regular expression that consists entirely of multibyte Unicode characters: `\verb:\u{244}[\u{2030}-\u{2137}]:`.  The AST from the parser would represent this as a sequence starting with a character class containing the code point \verb:0x244: followed by a second character class containing the range from \verb:0x2030: to \verb:0x2137:.  After being passed through the toUTF-8 transformation this AST would become considerably more complex.  The first code point in the sequence would be encoded as the two byte sequence `\verb:\u{C9}\u{84}:`. The character class containing the range, which is a range of three byte sequences would be expanded into the series of sequences and alternations that are necessary to specify all of the possible byte encodings that would be contained within the range.  The UTF-8 encoded regular expression for the range \verb:[\u{2030}-\u{2137}]: would be encoded as follows:
14The benefit of transforming the regular expression immediately after parsing from being a regular expression over code points into a regular expression over bytes is that it simplifies the rest of the compiler, as the compiler then only needs to be concerned with single bytes as opposed to code points, which vary in size.
16\subsection{UTF-8 Advance using ScanThru}
18Each bit position in the character class bitstream of a single byte ASCII character marks either the location of, or the absence of the search character. To match the location of a character the current position of the cursor is checked to see if the bit is set and then the cursor is advanced by one position. To match the position of a multibyte search character the procedure is different. For multibyte UTF-8 characters of length \verb:k:, it is the last \verb:(k-1):th byte of the multibyte sequence in the bitstream that marks the character's location. Figure~\ref{fig:multibytesequence} illustrates the process of matching a character class of a three byte multibyte character.  The locations of the first two bytes of each character in the character class CC have been marked with zeros while the bitstream $M_1$ marks the current cursor positions. To match multibyte characters, first a \emph{nonfinal} helper bitstream must be formed.  The \emph{Nonfinal} bitstream is formed by marking the locations of the first bytes of two byte sequences, the first two bytes of three byte sequences, and the first three bytes of any four byte sequences. The \verb:ScanThru(current, nonfinal): operation is then applied, in order to advance all of the current cursor positions to the locations of the \verb:(k-1):th final character positions.  To find any matches the result is then compared with the bits that are set in the UTF-8 character class bitstream. After this, the cursor is advanced by one position to be ready for the next matching operation.
23\begin{tabular}{rclr} \\ 
24$                           CC$ & \verb`001...001.........`\\
25$                          M_1$ & \verb`1.....1....1......`\\
26$                     nonfinal$ & \verb`11....11..........`\\
27$T_1 = ScanThru(M_1, nonfinal)$ & \verb`..1.....1.........`\\
28$           T_2 = CC \land T_1$ & \verb`..1.....1.........`\\
29$           M_2 = Advance(M_1)$ & \verb`...1.....1........`
33\caption{Processing of a Multibyte Sequence}
36\subsection{MatchStar for Unicode character classes}
38Figure~\ref{fig:multibytesequence_matchstar} shows how the MatchStar operation can be used to find all matches of a multibyte UTF-8 sequence.   The problem is to find all matches to the character class CC that can be reached from the current cursor positions in $M_1$. First we form two helper bitstreams \emph{initial} and \emph{nonfinal}.  The initial bitstream marks the locations of all single byte characters and the first bytes of all multibyte characters.  Any full match to a multibyte sequence must reach the initial position of the next character.  The nonfinal bitstream consists of all positions except those that are final positions of UTF-8 sequences.   It is used to "fill in the gaps" in the CC bitstream so that the MatchStar addition can move through a contiguous sequence of one bits.  In the figure, the gaps in CC are filled in by a bitwise-or with the nonfinal bitstream to produce $T_1$.   This is then used as the basis of the MatchStar operation to yield $T_2$.  We then filter these results using the initial bitstream to produce the final set of complete matches in $M_2$.
43\begin{tabular}{rclr} \\ 
44$                       CC$ & \verb`001001001.........`\\
45$                      M_1$ & \verb`1...........1..1..`\\
46$                  initial$ & \verb`1..1..1..1..1..1..`\\
47$                 nonfinal$ & \verb``\\
48$   T_1 = nonfinal \lor CC$ & \verb`11111111111.11.11.`\\
49$T_2 = MatchStar(M_1, T_1)$ & \verb`111111111111......`\\
50$  M_2 = T_2 \land initial$ & \verb`1..1..1..1........`\\
54\caption{Processing of MatchStar for a Multibyte Sequence}
58\subsection{Predefined Unicode classes}
60Every character in the Unicode database has been assigned to a general category classification based upon the character's type. As the categories seldom change the parallel bitstream equations for the categories have been statically compiled into icGrep. Each of the categories contain a large number of code points, therefore an \emph{If Hierarchy} optimization has been included in the statically compiled implementation of each category.  The optimization works under the assumption that most input documents will only contain the code points of the characters from a small number of writing systems. Processing the blocks of code points for characters that exist outside of this range is unnecessary and will only add to the total running time of the application.  The optimization tests the input text to determine the ranges of the code points that are contained in the input text and it only processes the character class equations and the regular expression matching equations for the code point ranges that the input text contains. The optimization tests the input text with a series of nested \emph{if else} statements, using a process similar to that of a binary search.  As the nesting of the statements increases, the range of the code points in the conditions of the \emph{if} statements narrow until the exact ranges of the code points in the text has been found.
62\subsection{Character Class Intersection and Difference}
63\subsection{Unicode Case-Insensitive Matching}
Note: See TracBrowser for help on using the repository browser.