source: docs/PACT14/re-Unicode.tex

Last change on this file was 3880, checked in by cameron, 5 years ago

Various formatting items

File size: 4.2 KB
Line 
1\paragraph*{Unicode}  The introduction of Unicode as a common encoding system including
2the characters of all the world's written languages and notation
3systems has introduced some complexity for regular expression matching
4engines.   Nevertheless, most modern tools and libraries do include
5some form of Unicode support, with varying degrees of performance loss.
6
7UTF-8, UTF-16 and UTF-32 are the three common transformation formats
8of Unicode depending on whether character code points are encoded
9using 8-bit, 16-bit or 32-bit code units.    In the case of the 32-bit
10code units of UTF-32, each Unicode character is encoded as a
11single 32-bit unit, with the high 11 bit positions all zero.
12A stream of UTF-32 encoded text can then be processed using a
13straighforward application of the techniques described above by
14first transforming to a set of 21 parallel bit streams for each
15of the significant bit positions within UTF-32.
16
17For most practical purposes, UTF-16 can also be processed similarly,
18considering that each 16-bit code unit represents a single
19character.   For the rarely used characters of the Unicode
20supplementary plane, two 16-bit code units are required.
21Such a two code unit sequence is known as a surrogate pair.
22Following the common practice of treating each member of a surrogate
23pair as pseudo-character, UTF-16 can also be processed by
24the straightforward transposition to 16 parallel bit streams
25and application of the techniques above.
26
27UTF-8 is probably the most widely used of the Unicode formats,
28primarily because of its compatibility with widely deployed
29networking software based on 8-bit extended-ASCII character
30representations.   UTF-8 is a variable length coding system
31in which each Unicode character is represented using one to
32four 8-bit code units.   
33
34In order to safely process UTF-8, it is necessary to validate
35that it is well-formed.   Each byte is classified as either
36an ordinary ASCII byte (high bit clear: range 0x00-0x7F), a UTF-8 prefix
37byte of a 2-, 3- or 4- byte sequence (in ranges 0xC2-0xDF,
380xE0-0xEF, and 0xF0-F4, respectively) or as a UTF-8 suffix
39byte in the range 0x80-0xBF.   Parallel bit stream
40technology achieves this validation easily and efficiently \cite{cameron2008case}.
41
42The UTF-8 byte classification streams produced as a byproduct
43of validation then enable regular expression on the UTF-8
44input streams as follows.   For each multibyte character
45used in the pattern, a character classification bitstream
46may be formed to identify the occurrence of such characters
47at each position in the source stream at which the final
48byte of the character is found.   
49
50Using this approach to multibyte classification, matching of
51a single $k$-byte character is straightforward.   All current
52match positions are shifted forward $k-1$ positions and then
53combined with the computed character class using bitwise-and.
54The result is then shifted forward one position to produce
55the result of the multibyte character match.  This method
56easily extends to character classes comprising multibyte
57characters all of the same length.
58
59Matching a character class comprising characters of different
60lengths requires a slightly different strategy.   In this
61case, we take advantage of a bitstream {\tt nonfinal} formed
62from the UTF-8 byte classification streams to consist
63of all those positions at which a UTF-8 prefix byte is
64found, as well as those positions at which the second byte
65of a 3-byte sequence or the second or third byte of a
664-byte sequence are found.   We then apply {\tt ScanThru(current,nonfinal)}
67to advance all current matches to the final position of the
68next character in the input.   Bitwise-and combination
69with the character class bitstream produces the result
70identifying matches with this class, the result is then
71shifted forward 1 position.
72
73The MatchStar operation for matching arbitrary sequences
74of a character class can similarly take advantage of the
75{\tt nonfinal} stream.   For this case, we form the
76bitwise-or of the {\tt nonfinal} stream with the character
77class stream prior to applying MatchStar.   The result
78will propagate bits to the first character position of
79matches and to final positions of nonmatches.  We then
80clear the nonmatches by combining the result stream
81with the {\tt suffix} byte stream using bitwise-and.
82
Note: See TracBrowser for help on using the repository browser.