wiki:CharSetArch

Character Set Architecture

The XML specification allows individual XML documents to be encoded in any of a wide variety of character sets. The Parabix character set architecture is designed to provide high-performance native parsing for any reasonable character set, while also easing the burden on application developers by efficient transcoding to UTF-8 or UTF-16, as needed.

The architecture uses both the C++ template mechanism and an object hierarchy for character sets within families.

Document Character Set, Working Character Set

Conceptually, Parabix may be considered to be a family of parsing engines, one for each possible pair of values (DCS, WCS) where DCS is the document character set in which the XML document is encoded and WCS is the working character set for strings that are delivered to the application (UTF-8 or UTF-16). By supplying specific (DCS, WCS) pairs at compile time through the template mechanism, individual members of the family can be instantiated. However, the space of potential DCS values is itself organized as a family that is encoded as a C++ object hierarchy. This allows partial specification of character-set family at compile-time, together with run-time determination of the processing required for particular members of the family.

Pseudo-ASCII

The parabix parsing engine uses the concept of a *pseudo-ASCII* byte stream as a core abstraction that enables it to use a single code base for parsing documents encoded in any ASCII-based character set having certain properties.

  1. ASCII byte values always represent ASCII characters.
  2. ASCII characters are always represented as ASCII byte values.

EBCDIC

Parabix performs native parsing for character sets based on either ASCII or EBCDIC compatibility.

The enumerated type !CodeUnit_Base may have either the value ASCII or EBCDIC. Used as a template parameter, !CodeUnit_Base allows the selection of character codes according to the specified base set. These are provided by the Ord structure as defined in ASCII_EBCDIC.h.

The Ord structure is used extensively in the implementation of single- and multi-byte recognizers for various XML tokens; see multiliteral.h and bytelex.h. In essence, an ASCII- and an EBCDIC-based version of each recognizer are instantiated at compile time.

Single-, Double- or Quad-byte Character Sets

The ASCII-based character sets are organized into families based on the size of character code units, namely 1, 2 or 4 bytes. The single-byte family includes ASCII itself, UTF-8 and various extended ASCII character sets such as the ISO-8859-X subfamily.

The double-byte character sets include UTF-16, UTF-16BE and UTF-16LE. In addition, the older sets UCS-2, UCS-2LE an UCS-2BE are also supported.

The quad-byte character sets consist of UCS-4/UTF-32 and variants, including the unusual octet order variants identified in the XML 1.0 specification.

Last modified 11 years ago Last modified on Jul 22, 2008, 5:12:17 AM