Changeset 4490 for docs/Working/icGrep/unicode-re.tex

Ignore:
Timestamp:
Feb 10, 2015, 5:55:14 PM (4 years ago)
Message:

Intro and Background, remove section 3

File:
1 edited

Unmodified
Added
Removed
• docs/Working/icGrep/unicode-re.tex

 r4489 \section{Unicode Regular Expression Methods}\label{sec:Unicode} \section{Bitwise Methods for Unicode}\label{sec:Unicode} \subsection{UTF-8 Transformation}\label{sec:Unicode:toUTF8} % The UTF-8 encoded regular expression for the range \verb:[\u{2030}-\u{2137}]: becomes: % \newline The \icGrep{} parser produces a representation of an input regular expression over variable-length code \emph{points}. % Parabix, however, operates on fixed-size code \emph{units}. % To support code points, the toUTF8 transformation converts the expression into an equivalent expression over code units by splitting code points into corresponding sequences of bytes (UTF-8 code units). %and assigning them to new character classes. % The \icGrep{} regular expression parser produces a representation of an input regular expression over Unicode code points. To process UTF-8 data streams, however, these expressions must first be converted to equivalent expressions in terms of UTF-8 code units. \% Consider the Unicode regular expression \verb:\u{244}[\u{2030}-\u{2137}]:. % The parser produces a sequence starting with a \verb:0x244: followed by the range of \verb:0x2030: to \verb:0x2137:. % After toUTF8, the first code point in the sequence becomes the two byte sequence \verb:\u{C9}\u{84}:'', After toUTF8, the first codepoint in the sequence becomes the two byte sequence \verb:\u{C9}\u{84}:'', and the range expands into the series of sequences and alternations shown below: \newline
Note: See TracChangeset for help on using the changeset viewer.