wiki:CharSetArch

Version 5 (modified by cameron, 11 years ago) (diff)

--

Character Set Architecture

Parabix has a character set architecture that is designed to provide high-performance native parsing for a wide variety of character sets. The architecture uses both the C++ template mechanism and an object hierarchy for character sets within families.

Pseudo-ASCII

The parabix parsing engine uses the concept of a *pseudo-ASCII* byte stream as a core abstraction that enables it to use a single code base for parsing documents encoded in any ASCII-based character set having certain properties.

  1. ASCII byte values always represent ASCII characters.
  2. ASCII characters are always represented as ASCII byte values.

EBCDIC

Parabix performs native parsing for character sets based on either ASCII or EBCDIC compatibility.

The enumerated type !CodeUnit_Base may have either the value ASCII or EBCDIC. Used as a template parameter, !CodeUnit_Base allows the selection of character codes according to the specified base set. These are provided by the Ord structure as defined in ASCII_EBCDIC.h.

The Ord structure is used extensively in the implementation of single- and multi-byte recognizers for various XML tokens; see multiliteral.h and bytelex.h. In essence, an ASCII- and an EBCDIC-based version of each recognizer are instantiated at compile time.

Single-, Double- or Quad-byte Character Sets

The ASCII-based character sets are organized into families based on the size of character code units, namely 1, 2 or 4 bytes. The single-byte family includes ASCII itself, UTF-8 and various extended ASCII character sets such as the ISO-8859-X subfamily.

The double-byte character sets include UTF-16, UTF-16BE and UTF-16LE. In addition, the older sets UCS-2, UCS-2LE an UCS-2BE are also supported.

The quad-byte character sets consist of UCS-4/UTF-32 and variants, including the unusual octet order variants identified in the XML 1.0 specification.