May 14, 2015, 7:26:46 PM (4 years ago)

Initial revision of UTF-8 matching

1 edited


  • docs/Working/icGrep/unicode-re.tex

    r4507 r4555  
    24 \paragraph{Unicode Advance.}  As illustrated in Section \ref{sec:background},  a bitwise shift
    25 (Advance) operation is frequently used in shifting a marker stream
    26 from positions matched by one regular expression element to the next.
    27 However, characters are represented by variable-length byte sequences
    28 in UTF-8, so that a simple shift operation is inadequate to implement
    29 the operation of advancing bit stream positions from one Unicode
    30 character to the next. 
     24\paragraph{UTF-8 Byte Classification and Validation.}
     25In UTF-8, bytes are classified as individual ASCII bytes, or as
     26prefixes of two-, three-, or four-byte sequences, or as suffix bytes.
     27In addition, we say that the {\em scope} bytes of a prefix are the
     28immediately following byte positions at which a suffix byte is expected.
     29Two helper streams are also useful.
     30The Initial stream marks ASCII bytes and prefixes of multibyte sequences,
     31while the NonFinal stream marks any position that is not the final
     32position of a Unicode character.
     34\mbox{\rm ASCII} & = & \mbox{\rm CharClass}(\verb:[\x00-\x7F]:) \\
     35\mbox{\rm Prefix2} & = & \mbox{\rm CharClass}(\verb:[\xC2-\xDF]:) \\
     36\mbox{\rm Prefix3} & = & \mbox{\rm CharClass}(\verb:[\xE0-\xEF]:) \\
     37\mbox{\rm Prefix4} & = & \mbox{\rm CharClass}(\verb:[\xF0-\xF4]:) \\
     38\mbox{\rm Suffix} & = & \mbox{\rm CharClass}(\verb:[\x80-\xBF]:) \\
     39\mbox{\rm Scope2} & = & \mbox{\rm Advance}(\mbox{Prefix2} \vee \mbox{Prefix3} \vee \mbox{Prefix4}) \\
     40\mbox{\rm Scope3} & = & \mbox{\rm Advance}(\mbox{\rm Advance}(\mbox{Prefix3} \vee \mbox{Prefix4})) \\
     41\mbox{\rm Scope4} & = & \mbox{\rm Advance}(\mbox{\rm Advance}(\mbox{\rm Advance}(\mbox{Prefix4}))) \\
     42\mbox{\rm Initial} & = & \mbox{\rm ASCII} \vee \mbox{Prefix2} \vee \mbox{Prefix3} \vee \mbox{Prefix4} \\
     43\mbox{\rm NonFinal} & = & \mbox{\rm Prefix} \vee \mbox{\rm Advance}(\mbox{Prefix3} \vee \mbox{Prefix4}) \vee \mbox{\rm Advance}(\mbox{\rm Advance}(\mbox{Prefix4}))
    32 To address the requirements of Unicode advance, we use
     46\paragraph{Unicode Character Classes.}  Whereas ASCII character classes
     47can be determined by simple bitwise logic at a single code unit position,
     48the UnicodeClass stream for a given class involves logic for up to four positions.
     49By convention, we define UnicodeClass($U$) for a given Unicode character class
     50$U$ to be the stream marking the {\em final} position of
     51Unicode character classes.
     53Using these definitions, it is then possible to extend the matching
     54equations to operate with Unicode character classes as follows.
     57\mbox{\rm Match}(m, U) & = &  \mbox{\rm Advance}(\mbox{\rm ScanThru}(m, \mbox{\rm NonFinal}) \wedge \mbox{\rm UnicodeClass}(U)) \\
     58\mbox{\rm Match}(m, U*) & = &  \mbox{\rm MatchStar}(m, \mbox{\rm UnicodeClass}(U) \vee \mbox{\rm NonFinal}) \wedge \mbox{\rm Initial}\\
     61Here, we use
    3362the ScanThru~\cite{cameron2011parallel} operation to move a set of markers
    3463each through the nonfinal bytes of UTF-8 sequences to the final
    3564byte position.
     65\[{\mbox ScanThru}(m, c) = (m + c)  \wedge \neg c\]
    3666Figure~\ref{fig:multibytesequence} shows this
    3767technique in operation in the case of advancing through byte
Note: See TracChangeset for help on using the changeset viewer.