Changeset 4458
- Timestamp:
- Feb 4, 2015, 6:19:06 PM (4 years ago)
- File:
-
- 1 edited
Legend:
- Unmodified
- Added
- Removed
-
docs/Working/icGrep/unicode-re.tex
r4453 r4458 3 3 \subsection{toUTF8 Transformation}\label{sec:Unicode:toUTF8} 4 4 5 The icGrep parser generates an abstract syntax tree (AST) that represents an input regular expression over code points. This AST is passed as input to the toUTF-8 transformation that generates a new AST that represents the equivalent regular expression over UTF-8 byte sequences. The transformation accomplishes this by first determining the number of UTF-8 bytes that are required to represent each code point contained within each character class. The code points are then split into sequences of bytes, where each byte is encoded with the necessary UTF-8 prefix after which they areeach assigned to a new character class in the new AST. For an example, consider the following regular expression that consists entirely of multibyte Unicode characters: `\verb:\u{244}[\u{2030}-\u{2137}]:`. The AST from the parser would represent this as a sequence starting with a character class containing the code point \verb:0x244: followed by a second character class containing the range from \verb:0x2030: to \verb:0x2137:. After being passed through the toUTF-8 transformation this AST would become considerably more complex. The first code point in the sequence would be encoded as the two byte sequence `\verb:\u{C9}\u{84}:`. The character class containing the range, which is a range of three byte sequences would be expanded into the series of sequences and alternations that are necessary to specify all of the possible byte encodings that would be contained within the range. The UTF-8 encoded regular expression for the range \verb:[\u{2030}-\u{2137}]: would be encoded as follows:5 The icGrep parser generates an abstract syntax tree (AST) that represents an input regular expression over code points. This AST is passed as input to a toUTF-8 transformation that generates a new AST that represents the equivalent regular expression over UTF-8 byte sequences. The transformation accomplishes this by first determining the number of UTF-8 bytes that are required to represent each code point contained within each character class. The code points are then split into sequences of bytes, where each byte is encoded with the necessary UTF-8 prefix after which they are then each assigned to a new character class in the new AST. For an example, consider the following regular expression that consists entirely of multibyte Unicode characters: `\verb:\u{244}[\u{2030}-\u{2137}]:`. The AST from the parser would represent this as a sequence starting with a character class containing the code point \verb:0x244: followed by a second character class containing the range from \verb:0x2030: to \verb:0x2137:. After being passed through the toUTF-8 transformation this AST would become considerably more complex. The first code point in the sequence would be encoded as the two byte sequence `\verb:\u{C9}\u{84}:`. The character class containing the range, which is a range of three byte sequences would be expanded into the series of sequences and alternations that are necessary to specify all of the possible byte encodings that would be contained within the range. The UTF-8 encoded regular expression for the range \verb:[\u{2030}-\u{2137}]: would be encoded as follows: 6 6 \newline 7 7 … … 12 12 } 13 13 14 T ransforming the regular expression immediately after parsing from being a regular expression over code points into a regular expression over bytes simplifies the rest of the compiler, as the compileronly needs to be concerned with single bytes as opposed to code points, which vary in size.14 The benefit of transforming the regular expression immediately after parsing from being a regular expression over code points into a regular expression over bytes is that it simplifies the rest of the compiler, as the compiler then only needs to be concerned with single bytes as opposed to code points, which vary in size. 15 15 16 16 \subsection{UTF-8 Advance using ScanThru} 17 17 18 Each bit position in the character class bitstream of a single byte ASCII character marks either the location of, or the absence of the search character. To match the location of a character the current position of the cursor is checked to see if the bit is set and then the cursor is advanced by one position. To match the position of a multibyte search character the procedure is different. For multibyte UTF-8 characters of length \verb:k:, it is the last \verb:(k-1):th byte of the multibyte sequence in the bitstream that marks the location of the multibyte character. Figure~\ref{fig:multibytesequence} illustrates the process of matching a character class of a three byte character. The locations of the first two bytes of each character in the character class CC have been marked with zeros. To match multibyte characters, first a \emph{nonfinal} helper bitstream must be formed. The \emph{Nonfinal} bitstream is formed by marking the locations of the first bytes of two byte sequences, the first two bytes of three byte sequences, and the first three bytes of any four byte sequences. The \verb:ScanThru(current, nonfinal): operation is then applied, in order to advance all of the current cursor positions to the locations of the \verb:(k-1):th final character positions. To find any matches the result is then compared with the bits that are set in the UTF-8 character class bitstream. After this, the cursor is advanced by one position to be ready for the next matching operation.18 Each bit position in the character class bitstream of a single byte ASCII character marks either the location of, or the absence of the search character. To match the location of a character the current position of the cursor is checked to see if the bit is set and then the cursor is advanced by one position. To match the position of a multibyte search character the procedure is different. For multibyte UTF-8 characters of length \verb:k:, it is the last \verb:(k-1):th byte of the multibyte sequence in the bitstream that marks the character's location. Figure~\ref{fig:multibytesequence} illustrates the process of matching a character class of a three byte multibyte character. The locations of the first two bytes of each character in the character class CC have been marked with zeros while the bitstream $M_1$ marks the current cursor positions. To match multibyte characters, first a \emph{nonfinal} helper bitstream must be formed. The \emph{Nonfinal} bitstream is formed by marking the locations of the first bytes of two byte sequences, the first two bytes of three byte sequences, and the first three bytes of any four byte sequences. The \verb:ScanThru(current, nonfinal): operation is then applied, in order to advance all of the current cursor positions to the locations of the \verb:(k-1):th final character positions. To find any matches the result is then compared with the bits that are set in the UTF-8 character class bitstream. After this, the cursor is advanced by one position to be ready for the next matching operation. 19 19 20 20 \begin{figure}[tbh] … … 58 58 \subsection{Predefined Unicode classes} 59 59 60 With each of the Unicode general categories containing such a large number of code points,an \emph{If Hierarchy} optimization has been included in the statically compiled implementation of each category. The optimization works under the assumption that most input documents will only contain the code points of the characters from a small number of writing systems. Processing the blocks of code points for characters that exist outside of this range is unnecessary and will only add to the total running time of the application. The optimization tests the input text to determine the ranges of the code points that are contained in the input text and it only processes the character class equations and the regular expression matching equations for the code point ranges that the input text contains. The optimization tests the input text with a series of nested \emph{if else} statements, using a process similar to that of a binary search. As the nesting of the statements increases, the range of the code points in the conditions of the \emph{if} statements narrow until the exact ranges of the code points in the text has been found.60 Every character in the Unicode database has been assigned to a general category classification based upon the character's type. As the categories seldom change the parallel bitstream equations for the categories have been statically compiled into icGrep. Each of the categories contain a large number of code points, therefore an \emph{If Hierarchy} optimization has been included in the statically compiled implementation of each category. The optimization works under the assumption that most input documents will only contain the code points of the characters from a small number of writing systems. Processing the blocks of code points for characters that exist outside of this range is unnecessary and will only add to the total running time of the application. The optimization tests the input text to determine the ranges of the code points that are contained in the input text and it only processes the character class equations and the regular expression matching equations for the code point ranges that the input text contains. The optimization tests the input text with a series of nested \emph{if else} statements, using a process similar to that of a binary search. As the nesting of the statements increases, the range of the code points in the conditions of the \emph{if} statements narrow until the exact ranges of the code points in the text has been found. 61 61 62 62 \subsection{Character Class Intersection and Difference}
Note: See TracChangeset
for help on using the changeset viewer.