source: docs/Working/re/re-Unicode.tex @ 3490

Last change on this file since 3490 was 3490, checked in by cameron, 6 years ago

Scripts and updates

File size: 4.3 KB
Line 
1\subsection{Unicode}
2
3The introduction of Unicode as a common encoding system including
4the characters of all the world's written languages and notation
5systems has introduced some complexity for regular expression matching
6engines.   Nevertheless, most modern tools and libraries do include
7some form of Unicode support, with varying degrees of performance loss.
8
9UTF-8, UTF-16 and UTF-32 are the three common transformation formats
10of Unicode depending on whether character code points are encoded
11using 8-bit, 16-bit or 32-bit code units.    In the case of the 32-bit
12code units of UTF-32, each Unicode character is encoded as a
13single 32-bit unit, with the high 11 bit positions all zero.
14A stream of UTF-32 encoded text can then be processed using a
15straighforward application of the techniques described above by
16first transforming to a set of 21 parallel bit streams for each
17of the significant bit positions within UTF-32.
18
19For most practical purposes, UTF-16 can also be processed similarly,
20considering that each 16-bit code unit represents a single
21character.   For the rarely used characters of the Unicode
22supplementary plane, two 16-bit code units are required.
23Such a two code unit sequence is known as a surrogate pair.
24Following the common practice of treating each member of a surrogate
25pair as pseudo-character, UTF-16 can also be processed by
26the straightforward transposition to 16 parallel bit streams
27and application of the techniques above.
28
29UTF-8 is probably the most widely used of the Unicode formats,
30primarily because of its compatibility with widely deployed
31networking software based on 8-bit extended-ASCII character
32representations.   UTF-8 is a variable length coding system
33in which each Unicode character is represented using one to
34four 8-bit code units.   
35
36In order to safely process UTF-8, it is necessary to validate
37that it is well-formed.   Each byte is classified as either
38an ordinary ASCII byte (high bit clear: range 0x00-0x7F), a UTF-8 prefix
39byte of a 2-, 3- or 4- byte sequence (in ranges 0xC2-0xDF,
400xE0-0xEF, and 0xF0-F4, respectively) or as a UTF-8 suffix
41byte in the range 0x80-0xBF.   Parallel bit stream
42technology achieves this validation easily and efficiently \cite{cameron2008case}.
43
44The UTF-8 byte classification streams produced as a byproduct
45of validation then enable regular expression on the UTF-8
46input streams as follows.   For each multibyte character
47used in the pattern, a character classification bitstream
48may be formed to identify the occurrence of such characters
49at each position in the source stream at which the final
50byte of the character is found.   Figure \ref{multibytecharclass} illustrates.
51(Describe).
52
53Using this approach to multibyte classification, matching of
54a single $k$-byte character is straightforward.   All current
55match positions are shifted forward $k-1$ positions and then
56combined with the computed character class using bitwise-and.
57The result is then shifted forward one position to produce
58the result of the multibyte character match.  This method
59easily extends to character classes comprising multibyte
60characters all of the same length.
61
62Matching a character class comprising characters of different
63lengths requires a slightly different strategy.   In this
64case, we take advantage of a bitstream {\tt nonfinal} formed
65from the UTF-8 byte classification streams to consist
66of all those positions at which a UTF-8 prefix byte is
67found, as well as those positions at which the second byte
68of a 3-byte sequence or the second or third byte of a
694-byte sequence are found.   We then apply {\tt ScanThru(current,nonfinal)}
70to advance all current matches to the final position of the
71next character in the input.   Bitwise-and combination
72with the character class bitstream produces the result
73identifying matches with this class, the result is then
74shifted forward 1 position.
75
76The MatchStar operation for matching arbitrary sequences
77of a character class can similarly take advantage of the
78{\tt nonfinal} stream.   For this case, we form the
79bitwise-or of the {\tt nonfinal} stream with the character
80class stream prior to applying MatchStar.   The result
81will propagate bits to the first character position of
82matches and to final positions of nonmatches.  We then
83clear the nonmatches by combining the result stream
84with the {\tt suffix} byte stream using bitwise-and.
85
Note: See TracBrowser for help on using the repository browser.