Changeset 4458 for docs


Ignore:
Timestamp:
Feb 4, 2015, 6:19:06 PM (4 years ago)
Author:
daled
Message:

Another pass on sections 4.1 - 4.4

File:
1 edited

Legend:

Unmodified
Added
Removed
  • docs/Working/icGrep/unicode-re.tex

    r4453 r4458  
    33\subsection{toUTF8 Transformation}\label{sec:Unicode:toUTF8}
    44
    5 The icGrep parser generates an abstract syntax tree (AST) that represents an input regular expression over code points.  This AST is passed as input to the toUTF-8 transformation that generates a new AST that represents the equivalent regular expression over UTF-8 byte sequences. The transformation accomplishes this by first determining the number of UTF-8 bytes that are required to represent each code point contained within each character class.  The code points are then split into sequences of bytes, where each byte is encoded with the necessary UTF-8 prefix after which they are each assigned to a new character class in the new AST.  For an example, consider the following regular expression that consists entirely of multibyte Unicode characters: `\verb:\u{244}[\u{2030}-\u{2137}]:`.  The AST from the parser would represent this as a sequence starting with a character class containing the code point \verb:0x244: followed by a second character class containing the range from \verb:0x2030: to \verb:0x2137:.  After being passed through the toUTF-8 transformation this AST would become considerably more complex.  The first code point in the sequence would be encoded as the two byte sequence `\verb:\u{C9}\u{84}:`. The character class containing the range, which is a range of three byte sequences would be expanded into the series of sequences and alternations that are necessary to specify all of the possible byte encodings that would be contained within the range.  The UTF-8 encoded regular expression for the range \verb:[\u{2030}-\u{2137}]: would be encoded as follows:
     5The icGrep parser generates an abstract syntax tree (AST) that represents an input regular expression over code points.  This AST is passed as input to a toUTF-8 transformation that generates a new AST that represents the equivalent regular expression over UTF-8 byte sequences. The transformation accomplishes this by first determining the number of UTF-8 bytes that are required to represent each code point contained within each character class.  The code points are then split into sequences of bytes, where each byte is encoded with the necessary UTF-8 prefix after which they are then each assigned to a new character class in the new AST.  For an example, consider the following regular expression that consists entirely of multibyte Unicode characters: `\verb:\u{244}[\u{2030}-\u{2137}]:`.  The AST from the parser would represent this as a sequence starting with a character class containing the code point \verb:0x244: followed by a second character class containing the range from \verb:0x2030: to \verb:0x2137:.  After being passed through the toUTF-8 transformation this AST would become considerably more complex.  The first code point in the sequence would be encoded as the two byte sequence `\verb:\u{C9}\u{84}:`. The character class containing the range, which is a range of three byte sequences would be expanded into the series of sequences and alternations that are necessary to specify all of the possible byte encodings that would be contained within the range.  The UTF-8 encoded regular expression for the range \verb:[\u{2030}-\u{2137}]: would be encoded as follows:
    66\newline
    77
     
    1212}
    1313
    14 Transforming the regular expression immediately after parsing from being a regular expression over code points into a regular expression over bytes simplifies the rest of the compiler, as the compiler only needs to be concerned with single bytes as opposed to code points, which vary in size.
     14The benefit of transforming the regular expression immediately after parsing from being a regular expression over code points into a regular expression over bytes is that it simplifies the rest of the compiler, as the compiler then only needs to be concerned with single bytes as opposed to code points, which vary in size.
    1515
    1616\subsection{UTF-8 Advance using ScanThru}
    1717
    18 Each bit position in the character class bitstream of a single byte ASCII character marks either the location of, or the absence of the search character. To match the location of a character the current position of the cursor is checked to see if the bit is set and then the cursor is advanced by one position. To match the position of a multibyte search character the procedure is different. For multibyte UTF-8 characters of length \verb:k:, it is the last \verb:(k-1):th byte of the multibyte sequence in the bitstream that marks the location of the  multibyte character. Figure~\ref{fig:multibytesequence} illustrates the process of matching a character class of a three byte character.  The locations of the first two bytes of each character in the character class CC have been marked with zeros. To match multibyte characters, first a \emph{nonfinal} helper bitstream must be formed.  The \emph{Nonfinal} bitstream is formed by marking the locations of the first bytes of two byte sequences, the first two bytes of three byte sequences, and the first three bytes of any four byte sequences. The \verb:ScanThru(current, nonfinal): operation is then applied, in order to advance all of the current cursor positions to the locations of the \verb:(k-1):th final character positions.  To find any matches the result is then compared with the bits that are set in the UTF-8 character class bitstream. After this, the cursor is advanced by one position to be ready for the next matching operation.
     18Each bit position in the character class bitstream of a single byte ASCII character marks either the location of, or the absence of the search character. To match the location of a character the current position of the cursor is checked to see if the bit is set and then the cursor is advanced by one position. To match the position of a multibyte search character the procedure is different. For multibyte UTF-8 characters of length \verb:k:, it is the last \verb:(k-1):th byte of the multibyte sequence in the bitstream that marks the character's location. Figure~\ref{fig:multibytesequence} illustrates the process of matching a character class of a three byte multibyte character.  The locations of the first two bytes of each character in the character class CC have been marked with zeros while the bitstream $M_1$ marks the current cursor positions. To match multibyte characters, first a \emph{nonfinal} helper bitstream must be formed.  The \emph{Nonfinal} bitstream is formed by marking the locations of the first bytes of two byte sequences, the first two bytes of three byte sequences, and the first three bytes of any four byte sequences. The \verb:ScanThru(current, nonfinal): operation is then applied, in order to advance all of the current cursor positions to the locations of the \verb:(k-1):th final character positions.  To find any matches the result is then compared with the bits that are set in the UTF-8 character class bitstream. After this, the cursor is advanced by one position to be ready for the next matching operation.
    1919
    2020\begin{figure}[tbh]
     
    5858\subsection{Predefined Unicode classes}
    5959
    60 With each of the Unicode general categories containing such a large number of code points, an \emph{If Hierarchy} optimization has been included in the statically compiled implementation of each category.  The optimization works under the assumption that most input documents will only contain the code points of the characters from a small number of writing systems. Processing the blocks of code points for characters that exist outside of this range is unnecessary and will only add to the total running time of the application.  The optimization tests the input text to determine the ranges of the code points that are contained in the input text and it only processes the character class equations and the regular expression matching equations for the code point ranges that the input text contains. The optimization tests the input text with a series of nested \emph{if else} statements, using a process similar to that of a binary search.  As the nesting of the statements increases, the range of the code points in the conditions of the \emph{if} statements narrow until the exact ranges of the code points in the text has been found.
     60Every character in the Unicode database has been assigned to a general category classification based upon the character's type. As the categories seldom change the parallel bitstream equations for the categories have been statically compiled into icGrep. Each of the categories contain a large number of code points, therefore an \emph{If Hierarchy} optimization has been included in the statically compiled implementation of each category.  The optimization works under the assumption that most input documents will only contain the code points of the characters from a small number of writing systems. Processing the blocks of code points for characters that exist outside of this range is unnecessary and will only add to the total running time of the application.  The optimization tests the input text to determine the ranges of the code points that are contained in the input text and it only processes the character class equations and the regular expression matching equations for the code point ranges that the input text contains. The optimization tests the input text with a series of nested \emph{if else} statements, using a process similar to that of a binary search.  As the nesting of the statements increases, the range of the code points in the conditions of the \emph{if} statements narrow until the exact ranges of the code points in the text has been found.
    6161
    6262\subsection{Character Class Intersection and Difference}
Note: See TracChangeset for help on using the changeset viewer.