Changeset 3489 for docs/Working


Ignore:
Timestamp:
Sep 15, 2013, 8:50:49 AM (6 years ago)
Author:
cameron
Message:

minor updates

File:
1 edited

Legend:

Unmodified
Added
Removed
  • docs/Working/re/re-Unicode.tex

    r3179 r3489  
    99UTF-8, UTF-16 and UTF-32 are the three common transformation formats
    1010of Unicode depending on whether character code points are encoded
    11 using 8-bit, 16-bit or 32-bit code units.    In the case of 32-bit
     11using 8-bit, 16-bit or 32-bit code units.    In the case of the 32-bit
    1212code units of UTF-32, each Unicode character is encoded as a
    1313single 32-bit unit, with the high 11 bit positions all zero.
     
    2121character.   For the rarely used characters of the Unicode
    2222supplementary plane, two 16-bit code units are required.
    23 Such a two code unit sequences is known as a surrogate pair.
     23Such a two code unit sequence is known as a surrogate pair.
    2424Following the common practice of treating each member of a surrogate
    2525pair as pseudo-character, UTF-16 can also be processed by
     
    40400xE0-0xEF, and 0xF0-F4, respectively) or as a UTF-8 suffix
    4141byte in the range 0x80-0xBF.   Parallel bit stream
    42 technology achieves this validation easily, as documented
    43 previously for UTF-8 to UTF-16 transcoding.
     42technology achieves this validation easily and efficiently \cite{PPoPP08}.
    4443
    4544The UTF-8 byte classification streams produced as a byproduct
     
    8584with the {\tt suffix} byte stream using bitwise-and.
    8685
    87 
    88 
Note: See TracChangeset for help on using the changeset viewer.