Version 4 (modified by cameron, 11 years ago) (diff)


Character Set Architecture

Parabix has a character set architecture that is designed to provide high-performance native parsing for a wide variety of character sets. The architecture uses both the C++ template mechanism and an object hierarchy for character sets within families.


Parabix performs native parsing for character sets based on either ASCII or EBCDIC compatibility.

The enumerated type !CodeUnit_Base may have either the value ASCII or EBCDIC. Used as a template parameter, !CodeUnit_Base allows the selection of character codes according to the specified base set. These are provided by the Ord structure as defined in ASCII_EBCDIC.h.

The Ord structure is used extensively in the implementation of single- and multi-byte recognizers for various XML tokens; see multiliteral.h and bytelex.h. In essence, an ASCII- and an EBCDIC-based version of each recognizer are instantiated at compile time.

Single-, Double- or Quad-byte Character Sets

The ASCII-based character sets are organized into families based on the size of character code units, namely 1, 2 or 4 bytes. The single-byte family includes ASCII itself, UTF-8 and various extended ASCII character sets such as the ISO-8859-X subfamily.

The double-byte character sets include UTF-16, UTF-16BE and UTF-16LE. In addition, the older sets UCS-2, UCS-2LE an UCS-2BE are also supported.

The quad-byte character sets consist of UCS-4/UTF-32 and variants, including the unusual octet order variants identified in the XML 1.0 specification.