Feb 10, 2015
Minor revision to toUTF8 Transformation

icGrep
 r4476 \textsc{}\section{Unicode Regular Expression Methods}\label{sec:Unicode} \section{Unicode Regular Expression Methods}\label{sec:Unicode} \subsection{toUTF8 Transformation}\label{sec:Unicode:toUTF8} \subsection{UTF-8 Transformation}\label{sec:Unicode:toUTF8} The \icGrep{} parser produces a representation of an input regular expression over code points. Parabix, however, operates on fixed size code \emph{units} instead of varyingly sized code \emph{points}. We devised a toUTF-8 transformation that converts the expression over code points into an equivalent expression over code units. %The transformation accomplishes this by first determining the number of UTF-8 bytes %required to represent each code point contained within each character class. The process splits code points into corresponding sequences of bytes (UTF-8 code units), %, with each byte containing the necessary UTF-8 prefix. %The UTF-8 encoded bytes are each assigned to and assigns these code units to new character classes.% in the new AST. Consider the multibyte Unicode regular expression \verb:\u{244}[\u{2030}-\u{2137}]:. The parser produces a sequence starting with a character class for \verb:0x244: followed by a character class for the range from \verb:0x2030: to \verb:0x2137:. After toUTF-8, %this AST becomes more complex. the first code point in the sequence becomes the two byte sequence \verb:\u{C9}\u{84}:, and the character class for the range%, which is a range of three byte sequences, expands into a series of sequences and alternations necessary to specify the byte encodings within the range. The UTF-8 encoded regular expression for the range \verb:[\u{2030}-\u{2137}]: becomes: % The \icGrep{} parser produces a representation of an input regular expression over code points. % Parabix, however, operates on fixed size code \emph{units} instead of varying sized code \emph{points}. % We devised a toUTF-8 transformation that converts the expression over code points into an equivalent expression over code units. % %The transformation accomplishes this by first determining the number of UTF-8 bytes % %required to represent each code point contained within each character class. % The process splits code points into corresponding sequences of bytes (UTF-8 code units), %, with each byte containing the necessary UTF-8 prefix. % %The UTF-8 encoded bytes are each assigned to % and assigns these code units to new character classes.% in the new AST. % % % Consider the multibyte Unicode regular expression \verb:\u{244}[\u{2030}-\u{2137}]:. % The parser produces a sequence starting with a character class for \verb:0x244: % followed by a character class for the range from \verb:0x2030: to \verb:0x2137:. % After toUTF-8, %this AST becomes more complex. % the first code point in the sequence becomes the two byte sequence \verb:\u{C9}\u{84}:, % and the character class for the range%, which is a range of three byte sequences, % expands into a series of sequences and alternations necessary to specify the byte encodings within the range. % The UTF-8 encoded regular expression for the range \verb:[\u{2030}-\u{2137}]: becomes: % \newline The \icGrep{} parser produces a representation of an input regular expression over variable-length code \emph{points}. % Parabix, however, operates on fixed-size code \emph{units}. % To support code points, the toUTF8 transformation converts the expression into an equivalent expression over code units by splitting code points into corresponding sequences of bytes (UTF-8 code units). %and assigning them to new character classes. % Consider the Unicode regular expression \verb:\u{244}[\u{2030}-\u{2137}]:. % The parser produces a sequence starting with a \verb:0x244: followed by the range of \verb:0x2030: to \verb:0x2137:. % After toUTF8, the first code point in the sequence becomes the two byte sequence \verb:\u{C9}\u{84}:'', and the range expands into the series of sequences and alternations shown below: \newline