source: docs/Working/re/re-Unicode.tex @ 3664

Last change on this file since 3664 was 3511, checked in by cameron, 6 years ago

Move Unicode section.

File size: 4.2 KB
3The introduction of Unicode as a common encoding system including
4the characters of all the world's written languages and notation
5systems has introduced some complexity for regular expression matching
6engines.   Nevertheless, most modern tools and libraries do include
7some form of Unicode support, with varying degrees of performance loss.
9UTF-8, UTF-16 and UTF-32 are the three common transformation formats
10of Unicode depending on whether character code points are encoded
11using 8-bit, 16-bit or 32-bit code units.    In the case of the 32-bit
12code units of UTF-32, each Unicode character is encoded as a
13single 32-bit unit, with the high 11 bit positions all zero.
14A stream of UTF-32 encoded text can then be processed using a
15straighforward application of the techniques described above by
16first transforming to a set of 21 parallel bit streams for each
17of the significant bit positions within UTF-32.
19For most practical purposes, UTF-16 can also be processed similarly,
20considering that each 16-bit code unit represents a single
21character.   For the rarely used characters of the Unicode
22supplementary plane, two 16-bit code units are required.
23Such a two code unit sequence is known as a surrogate pair.
24Following the common practice of treating each member of a surrogate
25pair as pseudo-character, UTF-16 can also be processed by
26the straightforward transposition to 16 parallel bit streams
27and application of the techniques above.
29UTF-8 is probably the most widely used of the Unicode formats,
30primarily because of its compatibility with widely deployed
31networking software based on 8-bit extended-ASCII character
32representations.   UTF-8 is a variable length coding system
33in which each Unicode character is represented using one to
34four 8-bit code units.   
36In order to safely process UTF-8, it is necessary to validate
37that it is well-formed.   Each byte is classified as either
38an ordinary ASCII byte (high bit clear: range 0x00-0x7F), a UTF-8 prefix
39byte of a 2-, 3- or 4- byte sequence (in ranges 0xC2-0xDF,
400xE0-0xEF, and 0xF0-F4, respectively) or as a UTF-8 suffix
41byte in the range 0x80-0xBF.   Parallel bit stream
42technology achieves this validation easily and efficiently \cite{cameron2008case}.
44The UTF-8 byte classification streams produced as a byproduct
45of validation then enable regular expression on the UTF-8
46input streams as follows.   For each multibyte character
47used in the pattern, a character classification bitstream
48may be formed to identify the occurrence of such characters
49at each position in the source stream at which the final
50byte of the character is found.   
52Using this approach to multibyte classification, matching of
53a single $k$-byte character is straightforward.   All current
54match positions are shifted forward $k-1$ positions and then
55combined with the computed character class using bitwise-and.
56The result is then shifted forward one position to produce
57the result of the multibyte character match.  This method
58easily extends to character classes comprising multibyte
59characters all of the same length.
61Matching a character class comprising characters of different
62lengths requires a slightly different strategy.   In this
63case, we take advantage of a bitstream {\tt nonfinal} formed
64from the UTF-8 byte classification streams to consist
65of all those positions at which a UTF-8 prefix byte is
66found, as well as those positions at which the second byte
67of a 3-byte sequence or the second or third byte of a
684-byte sequence are found.   We then apply {\tt ScanThru(current,nonfinal)}
69to advance all current matches to the final position of the
70next character in the input.   Bitwise-and combination
71with the character class bitstream produces the result
72identifying matches with this class, the result is then
73shifted forward 1 position.
75The MatchStar operation for matching arbitrary sequences
76of a character class can similarly take advantage of the
77{\tt nonfinal} stream.   For this case, we form the
78bitwise-or of the {\tt nonfinal} stream with the character
79class stream prior to applying MatchStar.   The result
80will propagate bits to the first character position of
81matches and to final positions of nonmatches.  We then
82clear the nonmatches by combining the result stream
83with the {\tt suffix} byte stream using bitwise-and.
Note: See TracBrowser for help on using the repository browser.