source: docs/Working/icXML/arch-charactersetadapters.tex @ 2517

Last change on this file since 2517 was 2517, checked in by cameron, 7 years ago

Character Set Adapters

File size: 3.5 KB
1\subsection{Character Set Adapters}
4In Xerces, all input is transcoded into UTF-16 to simplify the parsing costs of Xerces itself and
5provide the end-consumer with a single encoding format.
6In the important case of UTF-8 to UTF-16 transcoding, the transcoding costs can be significant,
7because of the need to decode and classify each byte of input, mapping variable-length UTF-8
8byte sequences into 16-bit UTF-16 code units with bit manipulation operations.   In other
9cases, transcoding may involve table lookup operations for each byte of input.  In any case,
10transcoding imposes at least a cost of buffer copying.
12In \icXML{}, however,  the concept of Character Set Adapters (CSAs) is used to minimize transcoding costs.
13Given a specified input encoding, a Character Set Adapter is responsible for validating that
14input code units represent valid characters, mapping the characters of the incoding into
15the appropriate bit streams for XML parsing actions (i.e., producing the lexical item
16streams), as well as supporting ultimate transcoding requirements.   All of this work
17is performed using the parallel bit stream representation of the source input.
19An important observation is that many character sets are some form of
20extension to the legacy 7-bit ASCII character set.  This includes the
21various ISO Latin character sets, UTF-8 and UTF-16 as well as many others.
22Furthermore, all significant characters for parsing XML are confined to the
23ASCII repertoire.   Thus, a single common set of lexical item calculations
24serves to compute lexical item streams for all such ASCII-based character sets.
26A second observation is that, regardless of which character set is used, it is
27often the case that all of the characters in a particular block of input
28happen to be within the ASCII repertoire.   This is a very simple test to
29perform using the bit stream representation, simply confirming that the
30bit 0 stream is zero for the entire block.   For blocks satisfying this test,
31all logic dealing with non-ASCII characters can simply be skipped.
32Transcoding to UTF-16 becomes trivial: the high eight bit streams of the
33UTF-16 form are each set to zero in this case.
35A third observation is that repeated transcoding of the names of XML
36elements, attributes and so on can be avoided by using a lookup mechanism.
37That is, the first occurrence of each symbol is store in a lookup
38table mapping the input encoding to a numeric symbol ID.   Transcoding
39of the symbol is applied at this time.  Subsequent lookup operations
40can avoid transcoding by simply retrieving the stored representation.
41As symbol lookup is required to apply various XML validation rules,
42there is achieves the effect of transcoding each occurrence without
43additional cost.
45In short, the cost of individual character transcoding is avoided whenever
46they constitute lexical items, whenever a block of input is confined to the ASCII subset
47and for all but the first occurrence of any XML element or attribute name.
48Furthermore, when transcoding is required, the parallel bit stream representation
49generally supports efficient transcoding operations.   In the important
50case of UTF-8 to UTF-16 transcoding, the corresponding UTF-16 bit streams
51can be calculated in bit parallel fashion based on UTF-8 streams \cite{Cameron2008},
52and all but the final bytes of multibyte sequences can be marked for deletion.
53In other cases, transcoding within a block only need be applied for non-ASCII
54bytes, which are conveniently identified by iterating through the bit 0 stream
55using bit scan operations.
Note: See TracBrowser for help on using the repository browser.