Ignore:
Timestamp:
Feb 10, 2015, 5:55:14 PM (4 years ago)
Author:
cameron
Message:

Intro and Background, remove section 3

File:
1 edited

Legend:

Unmodified
Added
Removed
  • docs/Working/icGrep/unicode-re.tex

    r4489 r4490  
    1 \section{Unicode Regular Expression Methods}\label{sec:Unicode}
    2 
     1\section{Bitwise Methods for Unicode}\label{sec:Unicode}
    32\subsection{UTF-8 Transformation}\label{sec:Unicode:toUTF8}
    43
     
    2120% The UTF-8 encoded regular expression for the range \verb:[\u{2030}-\u{2137}]: becomes:
    2221% \newline
    23 
    24 
    25 The \icGrep{} parser produces a representation of an input regular expression over variable-length code \emph{points}.
    26 %
    27 Parabix, however, operates on fixed-size code \emph{units}.
    28 %
    29 To support code points, the toUTF8 transformation converts the expression into an equivalent expression over code units
    30 by splitting code points into corresponding sequences of bytes (UTF-8 code units).
    31 %and assigning them to new character classes.
    32 %
     22The \icGrep{} regular expression parser produces a representation of an input regular expression over Unicode code points.
     23To process UTF-8 data streams, however, these expressions must first be converted to equivalent expressions in terms of UTF-8 code units.   
     24\%
    3325Consider the Unicode regular expression `\verb:\u{244}[\u{2030}-\u{2137}]:`.
    3426%
    3527The parser produces a sequence starting with a \verb:0x244: followed by the range of \verb:0x2030: to \verb:0x2137:.
    3628%
    37 After toUTF8, the first code point in the sequence becomes the two byte sequence ``\verb:\u{C9}\u{84}:'',
     29After toUTF8, the first codepoint in the sequence becomes the two byte sequence ``\verb:\u{C9}\u{84}:'',
    3830and the range expands into the series of sequences and alternations shown below:
    3931\newline
Note: See TracChangeset for help on using the changeset viewer.