# Changeset 4464 for docs

Ignore:
Timestamp:
Feb 5, 2015, 10:22:21 PM (4 years ago)
Message:

A few changes.

File:
1 edited

### Legend:

Unmodified
 r4458 \subsection{toUTF8 Transformation}\label{sec:Unicode:toUTF8} The icGrep parser generates an abstract syntax tree (AST) that represents an input regular expression over code points.  This AST is passed as input to a toUTF-8 transformation that generates a new AST that represents the equivalent regular expression over UTF-8 byte sequences. The transformation accomplishes this by first determining the number of UTF-8 bytes that are required to represent each code point contained within each character class.  The code points are then split into sequences of bytes, where each byte is encoded with the necessary UTF-8 prefix after which they are then each assigned to a new character class in the new AST.  For an example, consider the following regular expression that consists entirely of multibyte Unicode characters: \verb:\u{244}[\u{2030}-\u{2137}]:.  The AST from the parser would represent this as a sequence starting with a character class containing the code point \verb:0x244: followed by a second character class containing the range from \verb:0x2030: to \verb:0x2137:.  After being passed through the toUTF-8 transformation this AST would become considerably more complex.  The first code point in the sequence would be encoded as the two byte sequence \verb:\u{C9}\u{84}:. The character class containing the range, which is a range of three byte sequences would be expanded into the series of sequences and alternations that are necessary to specify all of the possible byte encodings that would be contained within the range.  The UTF-8 encoded regular expression for the range \verb:[\u{2030}-\u{2137}]: would be encoded as follows: The icGrep parser generates an abstract syntax tree (AST) that represents an input regular expression over code points.  This AST is passed as input to a toUTF-8 transformation that generates a new AST that represents the equivalent regular expression over UTF-8 byte sequences. The transformation accomplishes this by first determining the number of UTF-8 bytes that are required to represent each code point contained within each character class.  The code points are then split into sequences of bytes, with each byte containing the necessary UTF-8 prefix. The UTF-8 encoded bytes are each assigned to a new character class in the new AST.  For an example, consider the following regular expression that consists entirely of multibyte Unicode characters: \verb:\u{244}[\u{2030}-\u{2137}]:.  The AST from the parser would represent this as a sequence starting with a character class containing the code point \verb:0x244: followed by a second character class containing the range from \verb:0x2030: to \verb:0x2137:.  After being passed through the toUTF-8 transformation this AST would become considerably more complex.  The first code point in the sequence would be encoded as the two byte sequence \verb:\u{C9}\u{84}:. The character class containing the range, which is a range of three byte sequences would be expanded into the series of sequences and alternations that are necessary to specify all of the possible byte encodings that would be contained within the range.  The UTF-8 encoded regular expression for the range \verb:[\u{2030}-\u{2137}]: would be encoded as follows: \newline \subsection{MatchStar for Unicode character classes} Figure~\ref{fig:multibytesequence_matchstar} shows how the MatchStar operation can be used to find all matches of a multibyte UTF-8 sequence.   The problem is to find all matches to the character class CC that can be reached from the current cursor positions in $M_1$. First we form two helper bitstreams \emph{initial} and \emph{nonfinal}.  The initial bitstream marks the locations of all single byte characters and the first bytes of all multibyte characters.  Any full match to a multibyte sequence must reach the initial position of the next character.  The nonfinal bitstream consists of all positions except those that are final positions of UTF-8 sequences.   It is used to "fill in the gaps" in the CC bitstream so that the MatchStar addition can move through a contiguous sequence of one bits.  In the figure, the gaps in CC are filled in by a bitwise-or with the nonfinal bitstream to produce $T_1$.   This is then used as the basis of the MatchStar operation to yield $T_2$.  We then filter these results using the initial bitstream to produce the final set of complete matches $M_2$. Figure~\ref{fig:multibytesequence_matchstar} shows how the MatchStar operation can be used to find all matches of a multibyte UTF-8 sequence.   The problem is to find all matches to the character class CC that can be reached from the current cursor positions in $M_1$. First we form two helper bitstreams \emph{initial} and \emph{nonfinal}.  The initial bitstream marks the locations of all single byte characters and the first bytes of all multibyte characters.  Any full match to a multibyte sequence must reach the initial position of the next character.  The nonfinal bitstream consists of all positions except those that are final positions of UTF-8 sequences.   It is used to "fill in the gaps" in the CC bitstream so that the MatchStar addition can move through a contiguous sequence of one bits.  In the figure, the gaps in CC are filled in by a bitwise-or with the nonfinal bitstream to produce $T_1$.   This is then used as the basis of the MatchStar operation to yield $T_2$.  We then filter these results using the initial bitstream to produce the final set of complete matches in $M_2$. \begin{figure}[tbh]