Changes between Version 1 and Version 2 of CharSetArch


Ignore:
Timestamp:
Jun 27, 2008, 8:47:19 AM (11 years ago)
Author:
cameron
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • CharSetArch

    v1 v2  
    22
    33Parabix has a character set architecture that is designed to provide high-performance native parsing for a wide variety of character sets.
    4 The architecture uses both the C++ template mechanism and an object hierarchy for character sets within families.
     4The architecture uses both the C++ template mechanism and an object hierarchy for character sets within families. 
    55
    66== ASCII vs. EBCDIC ==
    77
    8 Parabix performs native parsing for character sets based on either ASCII or EBCDIC.
     8Parabix performs native parsing for character sets based on either ASCII or EBCDIC compatibility.
    99
    1010The enumerated type !CodeUnit_Base may have either the value ASCII or EBCDIC.
    1111Used as a template parameter, !CodeUnit_Base allows the selection of character codes according to the specified base set.
    1212These are provided by the Ord structure as defined in [source:trunk/src/charsets/ASCII_EBCDIC.h ASCII_EBCDIC.h].
     13
     14The Ord structure is used extensively in the implementation of multibyte recognizers for various
     15XML tokens [source:trunk/src/multiliteral.h multiliteral.h] and [source:trunk/src/bytelex.h bytelex.h].   
     16In essence, an ASCII- and an EBCDIC-based version of each recognizer are instantiated at compile time.
     17
     18= Single-, Double- or Quad-byte Character Sets ==
     19
     20The ASCII-based character sets are organized into families based on the size of
     21character code units, namely 1, 2 or 4 bytes.  The single-byte family includes
     22ASCII itself, UTF-8 and various extended ASCII character sets such as the ISO-8859-X subfamily.
     23
     24The double-byte character sets include UTF-16, UTF-16BE and UTF-16LE.  In addition,
     25the older sets UCS-2, UCS-2LE an UCS-2BE are also supported.
     26
     27The quad-byte character sets consist of UCS-4/UTF-32 and variants, including the
     28unusual octet order variants identified in the XML 1.0 specification.
     29
     30
     31
     32