Changeset 4555 for docs/Working/icGrep/unicode-re.tex

Ignore:
Timestamp:
May 14, 2015, 7:26:46 PM (4 years ago)
Message:

Initial revision of UTF-8 matching

File:
1 edited

Legend:

Unmodified
 r4507 } \paragraph{Unicode Advance.}  As illustrated in Section \ref{sec:background},  a bitwise shift (Advance) operation is frequently used in shifting a marker stream from positions matched by one regular expression element to the next. However, characters are represented by variable-length byte sequences in UTF-8, so that a simple shift operation is inadequate to implement the operation of advancing bit stream positions from one Unicode character to the next. \paragraph{UTF-8 Byte Classification and Validation.} In UTF-8, bytes are classified as individual ASCII bytes, or as prefixes of two-, three-, or four-byte sequences, or as suffix bytes. In addition, we say that the {\em scope} bytes of a prefix are the immediately following byte positions at which a suffix byte is expected. Two helper streams are also useful. The Initial stream marks ASCII bytes and prefixes of multibyte sequences, while the NonFinal stream marks any position that is not the final position of a Unicode character. \begin{eqnarray*} \mbox{\rm ASCII} & = & \mbox{\rm CharClass}(\verb:[\x00-\x7F]:) \\ \mbox{\rm Prefix2} & = & \mbox{\rm CharClass}(\verb:[\xC2-\xDF]:) \\ \mbox{\rm Prefix3} & = & \mbox{\rm CharClass}(\verb:[\xE0-\xEF]:) \\ \mbox{\rm Prefix4} & = & \mbox{\rm CharClass}(\verb:[\xF0-\xF4]:) \\ \mbox{\rm Suffix} & = & \mbox{\rm CharClass}(\verb:[\x80-\xBF]:) \\ \mbox{\rm Scope2} & = & \mbox{\rm Advance}(\mbox{Prefix2} \vee \mbox{Prefix3} \vee \mbox{Prefix4}) \\ \mbox{\rm Scope3} & = & \mbox{\rm Advance}(\mbox{\rm Advance}(\mbox{Prefix3} \vee \mbox{Prefix4})) \\ \mbox{\rm Scope4} & = & \mbox{\rm Advance}(\mbox{\rm Advance}(\mbox{\rm Advance}(\mbox{Prefix4}))) \\ \mbox{\rm Initial} & = & \mbox{\rm ASCII} \vee \mbox{Prefix2} \vee \mbox{Prefix3} \vee \mbox{Prefix4} \\ \mbox{\rm NonFinal} & = & \mbox{\rm Prefix} \vee \mbox{\rm Advance}(\mbox{Prefix3} \vee \mbox{Prefix4}) \vee \mbox{\rm Advance}(\mbox{\rm Advance}(\mbox{Prefix4})) \end{eqnarray*} To address the requirements of Unicode advance, we use \paragraph{Unicode Character Classes.}  Whereas ASCII character classes can be determined by simple bitwise logic at a single code unit position, the UnicodeClass stream for a given class involves logic for up to four positions. By convention, we define UnicodeClass($U$) for a given Unicode character class $U$ to be the stream marking the {\em final} position of Unicode character classes. Using these definitions, it is then possible to extend the matching equations to operate with Unicode character classes as follows. \begin{eqnarray*} \mbox{\rm Match}(m, U) & = &  \mbox{\rm Advance}(\mbox{\rm ScanThru}(m, \mbox{\rm NonFinal}) \wedge \mbox{\rm UnicodeClass}(U)) \\ \mbox{\rm Match}(m, U*) & = &  \mbox{\rm MatchStar}(m, \mbox{\rm UnicodeClass}(U) \vee \mbox{\rm NonFinal}) \wedge \mbox{\rm Initial}\\ \end{eqnarray*} Here, we use the ScanThru~\cite{cameron2011parallel} operation to move a set of markers each through the nonfinal bytes of UTF-8 sequences to the final byte position. ${\mbox ScanThru}(m, c) = (m + c) \wedge \neg c$ Figure~\ref{fig:multibytesequence} shows this technique in operation in the case of advancing through byte