source: docs/Working/icXML/arch-charactersetadapters.tex @ 2866

Last change on this file since 2866 was 2866, checked in by nmedfort, 7 years ago


File size: 3.4 KB
[2470]1\subsection{Character Set Adapters}
[2505]4In Xerces, all input is transcoded into UTF-16 to simplify the parsing costs of Xerces itself and
5provide the end-consumer with a single encoding format.
[2517]6In the important case of UTF-8 to UTF-16 transcoding, the transcoding costs can be significant,
7because of the need to decode and classify each byte of input, mapping variable-length UTF-8
8byte sequences into 16-bit UTF-16 code units with bit manipulation operations.   In other
9cases, transcoding may involve table lookup operations for each byte of input.  In any case,
10transcoding imposes at least a cost of buffer copying.
12In \icXML{}, however,  the concept of Character Set Adapters (CSAs) is used to minimize transcoding costs.
[2522]13Given a specified input encoding, a CSA is responsible for checking that
14input code units represent valid characters, mapping the characters of the encoding into
[2517]15the appropriate bit streams for XML parsing actions (i.e., producing the lexical item
16streams), as well as supporting ultimate transcoding requirements.   All of this work
17is performed using the parallel bit stream representation of the source input.
[2522]19An important observation is that many character sets are an
[2517]20extension to the legacy 7-bit ASCII character set.  This includes the
[2522]21various ISO Latin character sets, UTF-8, UTF-16 and many others.
[2517]22Furthermore, all significant characters for parsing XML are confined to the
23ASCII repertoire.   Thus, a single common set of lexical item calculations
24serves to compute lexical item streams for all such ASCII-based character sets.
[2522]26A second observation is that---regardless of which character set is used---quite
27often all of the characters in a particular block of input will be within the ASCII range.
28This is a very simple test to perform using the bit stream representation, simply confirming that the
[2517]29bit 0 stream is zero for the entire block.   For blocks satisfying this test,
30all logic dealing with non-ASCII characters can simply be skipped.
[2522]31Transcoding to UTF-16 becomes trivial as the high eight bit streams of the
[2517]32UTF-16 form are each set to zero in this case.
34A third observation is that repeated transcoding of the names of XML
35elements, attributes and so on can be avoided by using a lookup mechanism.
[2520]36That is, the first occurrence of each symbol is stored in a lookup
[2517]37table mapping the input encoding to a numeric symbol ID.   Transcoding
38of the symbol is applied at this time.  Subsequent lookup operations
39can avoid transcoding by simply retrieving the stored representation.
40As symbol lookup is required to apply various XML validation rules,
41there is achieves the effect of transcoding each occurrence without
42additional cost.
[2866]44The cost of individual character transcoding is avoided whenever a block of input is
45confined to the ASCII subset and for all but the first occurrence of any XML element or attribute name.
[2517]46Furthermore, when transcoding is required, the parallel bit stream representation
[2866]47supports efficient transcoding operations.   In the important
[2517]48case of UTF-8 to UTF-16 transcoding, the corresponding UTF-16 bit streams
49can be calculated in bit parallel fashion based on UTF-8 streams \cite{Cameron2008},
[2520]50and all but the final bytes of multibyte sequences can be marked for deletion as
51discussed in the following subsection.
[2517]52In other cases, transcoding within a block only need be applied for non-ASCII
53bytes, which are conveniently identified by iterating through the bit 0 stream
54using bit scan operations.
Note: See TracBrowser for help on using the repository browser.