wiki:CharSetArch

Version 3 (modified by cameron, 11 years ago) (diff)

--

Character Set Architecture

Parabix has a character set architecture that is designed to provide high-performance native parsing for a wide variety of character sets. The architecture uses both the C++ template mechanism and an object hierarchy for character sets within families.

ASCII vs. EBCDIC

Parabix performs native parsing for character sets based on either ASCII or EBCDIC compatibility.

The enumerated type !CodeUnit_Base may have either the value ASCII or EBCDIC. Used as a template parameter, !CodeUnit_Base allows the selection of character codes according to the specified base set. These are provided by the Ord structure as defined in ASCII_EBCDIC.h.

The Ord structure is used extensively in the implementation of multibyte recognizers for various XML tokens multiliteral.h and bytelex.h. In essence, an ASCII- and an EBCDIC-based version of each recognizer are instantiated at compile time.

Single-, Double- or Quad-byte Character Sets

The ASCII-based character sets are organized into families based on the size of character code units, namely 1, 2 or 4 bytes. The single-byte family includes ASCII itself, UTF-8 and various extended ASCII character sets such as the ISO-8859-X subfamily.

The double-byte character sets include UTF-16, UTF-16BE and UTF-16LE. In addition, the older sets UCS-2, UCS-2LE an UCS-2BE are also supported.

The quad-byte character sets consist of UCS-4/UTF-32 and variants, including the unusual octet order variants identified in the XML 1.0 specification.