source: docs/Working/icXML/arch-charactersetadapters.tex @ 2522

Last change on this file since 2522 was 2522, checked in by nmedfort, 7 years ago


File size: 3.5 KB
[2470]1\subsection{Character Set Adapters}
[2505]4In Xerces, all input is transcoded into UTF-16 to simplify the parsing costs of Xerces itself and
5provide the end-consumer with a single encoding format.
[2517]6In the important case of UTF-8 to UTF-16 transcoding, the transcoding costs can be significant,
7because of the need to decode and classify each byte of input, mapping variable-length UTF-8
8byte sequences into 16-bit UTF-16 code units with bit manipulation operations.   In other
9cases, transcoding may involve table lookup operations for each byte of input.  In any case,
10transcoding imposes at least a cost of buffer copying.
12In \icXML{}, however,  the concept of Character Set Adapters (CSAs) is used to minimize transcoding costs.
[2522]13Given a specified input encoding, a CSA is responsible for checking that
14input code units represent valid characters, mapping the characters of the encoding into
[2517]15the appropriate bit streams for XML parsing actions (i.e., producing the lexical item
16streams), as well as supporting ultimate transcoding requirements.   All of this work
17is performed using the parallel bit stream representation of the source input.
[2522]19An important observation is that many character sets are an
[2517]20extension to the legacy 7-bit ASCII character set.  This includes the
[2522]21various ISO Latin character sets, UTF-8, UTF-16 and many others.
[2517]22Furthermore, all significant characters for parsing XML are confined to the
23ASCII repertoire.   Thus, a single common set of lexical item calculations
24serves to compute lexical item streams for all such ASCII-based character sets.
[2522]26A second observation is that---regardless of which character set is used---quite
27often all of the characters in a particular block of input will be within the ASCII range.
28This is a very simple test to perform using the bit stream representation, simply confirming that the
[2517]29bit 0 stream is zero for the entire block.   For blocks satisfying this test,
30all logic dealing with non-ASCII characters can simply be skipped.
[2522]31Transcoding to UTF-16 becomes trivial as the high eight bit streams of the
[2517]32UTF-16 form are each set to zero in this case.
34A third observation is that repeated transcoding of the names of XML
35elements, attributes and so on can be avoided by using a lookup mechanism.
[2520]36That is, the first occurrence of each symbol is stored in a lookup
[2517]37table mapping the input encoding to a numeric symbol ID.   Transcoding
38of the symbol is applied at this time.  Subsequent lookup operations
39can avoid transcoding by simply retrieving the stored representation.
40As symbol lookup is required to apply various XML validation rules,
41there is achieves the effect of transcoding each occurrence without
42additional cost.
44In short, the cost of individual character transcoding is avoided whenever
45they constitute lexical items, whenever a block of input is confined to the ASCII subset
46and for all but the first occurrence of any XML element or attribute name.
47Furthermore, when transcoding is required, the parallel bit stream representation
48generally supports efficient transcoding operations.   In the important
49case of UTF-8 to UTF-16 transcoding, the corresponding UTF-16 bit streams
50can be calculated in bit parallel fashion based on UTF-8 streams \cite{Cameron2008},
[2520]51and all but the final bytes of multibyte sequences can be marked for deletion as
52discussed in the following subsection.
[2517]53In other cases, transcoding within a block only need be applied for non-ASCII
54bytes, which are conveniently identified by iterating through the bit 0 stream
55using bit scan operations.
Note: See TracBrowser for help on using the repository browser.