Changeset 4489 for docs/Working/icGrep


Ignore:
Timestamp:
Feb 10, 2015, 3:36:52 PM (4 years ago)
Author:
nmedfort
Message:

Minor revision to toUTF8 Transformation

Location:
docs/Working/icGrep
Files:
3 edited

Legend:

Unmodified
Added
Removed
  • docs/Working/icGrep/introduction.tex

    r4472 r4489  
    3232to apply well to Unicode regular expression matching problems, which routinely
    3333require thousands of DFA states for named Unicode properties.
    34 Building on the Parabix framework, Cameron at al \cite{cameron2014bitwise} introduce
     34Building on the Parabix framework, Cameron et al.~\cite{cameron2014bitwise} introduce
    3535regular expression matching using the bitwise
    3636data parallel approach together with the MatchStar primitive
  • docs/Working/icGrep/unicode-re.tex

    r4476 r4489  
    1 \textsc{}\section{Unicode Regular Expression Methods}\label{sec:Unicode}
     1\section{Unicode Regular Expression Methods}\label{sec:Unicode}
    22
    3 \subsection{toUTF8 Transformation}\label{sec:Unicode:toUTF8}
     3\subsection{UTF-8 Transformation}\label{sec:Unicode:toUTF8}
    44
    5 The \icGrep{} parser produces a representation of an input regular expression over code points.
    6 Parabix, however, operates on fixed size code \emph{units} instead of varyingly sized code \emph{points}.
    7 We devised a toUTF-8 transformation that converts the expression over code points into an equivalent expression over code units.
    8 %The transformation accomplishes this by first determining the number of UTF-8 bytes
    9 %required to represent each code point contained within each character class.
    10 The process splits code points into corresponding sequences of bytes (UTF-8 code units), %, with each byte containing the necessary UTF-8 prefix.
    11 %The UTF-8 encoded bytes are each assigned to
    12 and assigns these code units to new character classes.% in the new AST.
    13 Consider the multibyte Unicode regular expression `\verb:\u{244}[\u{2030}-\u{2137}]:`.
    14 The parser produces a sequence starting with a character class for \verb:0x244:
    15 followed by a character class for the range from \verb:0x2030: to \verb:0x2137:.
    16 After toUTF-8, %this AST becomes more complex.
    17 the first code point in the sequence becomes the two byte sequence `\verb:\u{C9}\u{84}:`,
    18 and the character class for the range%, which is a range of three byte sequences,
    19 expands into a series of sequences and alternations necessary to specify the byte encodings within the range.
    20 The UTF-8 encoded regular expression for the range \verb:[\u{2030}-\u{2137}]: becomes:
     5% The \icGrep{} parser produces a representation of an input regular expression over code points.
     6% Parabix, however, operates on fixed size code \emph{units} instead of varying sized code \emph{points}.
     7% We devised a toUTF-8 transformation that converts the expression over code points into an equivalent expression over code units.
     8% %The transformation accomplishes this by first determining the number of UTF-8 bytes
     9% %required to represent each code point contained within each character class.
     10% The process splits code points into corresponding sequences of bytes (UTF-8 code units), %, with each byte containing the necessary UTF-8 prefix.
     11% %The UTF-8 encoded bytes are each assigned to
     12% and assigns these code units to new character classes.% in the new AST.
     13% %
     14% Consider the multibyte Unicode regular expression `\verb:\u{244}[\u{2030}-\u{2137}]:`.
     15% The parser produces a sequence starting with a character class for \verb:0x244:
     16% followed by a character class for the range from \verb:0x2030: to \verb:0x2137:.
     17% After toUTF-8, %this AST becomes more complex.
     18% the first code point in the sequence becomes the two byte sequence `\verb:\u{C9}\u{84}:`,
     19% and the character class for the range%, which is a range of three byte sequences,
     20% expands into a series of sequences and alternations necessary to specify the byte encodings within the range.
     21% The UTF-8 encoded regular expression for the range \verb:[\u{2030}-\u{2137}]: becomes:
     22% \newline
     23
     24
     25The \icGrep{} parser produces a representation of an input regular expression over variable-length code \emph{points}.
     26%
     27Parabix, however, operates on fixed-size code \emph{units}.
     28%
     29To support code points, the toUTF8 transformation converts the expression into an equivalent expression over code units
     30by splitting code points into corresponding sequences of bytes (UTF-8 code units).
     31%and assigning them to new character classes.
     32%
     33Consider the Unicode regular expression `\verb:\u{244}[\u{2030}-\u{2137}]:`.
     34%
     35The parser produces a sequence starting with a \verb:0x244: followed by the range of \verb:0x2030: to \verb:0x2137:.
     36%
     37After toUTF8, the first code point in the sequence becomes the two byte sequence ``\verb:\u{C9}\u{84}:'',
     38and the range expands into the series of sequences and alternations shown below:
    2139\newline
    2240
Note: See TracChangeset for help on using the changeset viewer.