Changeset 2517 for docs/Working


Ignore:
Timestamp:
Oct 20, 2012, 8:56:45 AM (7 years ago)
Author:
cameron
Message:

Character Set Adapters

File:
1 edited

Legend:

Unmodified
Added
Removed
  • docs/Working/icXML/arch-charactersetadapters.tex

    r2505 r2517  
    44In Xerces, all input is transcoded into UTF-16 to simplify the parsing costs of Xerces itself and
    55provide the end-consumer with a single encoding format.
    6 \icXML{} uses Character Set Adapters (CSAs) to parse data from encoding type into a set of basis and
    7 lexical bit streams.
     6In the important case of UTF-8 to UTF-16 transcoding, the transcoding costs can be significant,
     7because of the need to decode and classify each byte of input, mapping variable-length UTF-8
     8byte sequences into 16-bit UTF-16 code units with bit manipulation operations.   In other
     9cases, transcoding may involve table lookup operations for each byte of input.  In any case,
     10transcoding imposes at least a cost of buffer copying.
     11
     12In \icXML{}, however,  the concept of Character Set Adapters (CSAs) is used to minimize transcoding costs.
     13Given a specified input encoding, a Character Set Adapter is responsible for validating that
     14input code units represent valid characters, mapping the characters of the incoding into
     15the appropriate bit streams for XML parsing actions (i.e., producing the lexical item
     16streams), as well as supporting ultimate transcoding requirements.   All of this work
     17is performed using the parallel bit stream representation of the source input.
     18
     19An important observation is that many character sets are some form of
     20extension to the legacy 7-bit ASCII character set.  This includes the
     21various ISO Latin character sets, UTF-8 and UTF-16 as well as many others.
     22Furthermore, all significant characters for parsing XML are confined to the
     23ASCII repertoire.   Thus, a single common set of lexical item calculations
     24serves to compute lexical item streams for all such ASCII-based character sets.
     25
     26A second observation is that, regardless of which character set is used, it is
     27often the case that all of the characters in a particular block of input
     28happen to be within the ASCII repertoire.   This is a very simple test to
     29perform using the bit stream representation, simply confirming that the
     30bit 0 stream is zero for the entire block.   For blocks satisfying this test,
     31all logic dealing with non-ASCII characters can simply be skipped.
     32Transcoding to UTF-16 becomes trivial: the high eight bit streams of the
     33UTF-16 form are each set to zero in this case.
     34
     35A third observation is that repeated transcoding of the names of XML
     36elements, attributes and so on can be avoided by using a lookup mechanism.
     37That is, the first occurrence of each symbol is store in a lookup
     38table mapping the input encoding to a numeric symbol ID.   Transcoding
     39of the symbol is applied at this time.  Subsequent lookup operations
     40can avoid transcoding by simply retrieving the stored representation.
     41As symbol lookup is required to apply various XML validation rules,
     42there is achieves the effect of transcoding each occurrence without
     43additional cost.
     44
     45In short, the cost of individual character transcoding is avoided whenever
     46they constitute lexical items, whenever a block of input is confined to the ASCII subset
     47and for all but the first occurrence of any XML element or attribute name.
     48Furthermore, when transcoding is required, the parallel bit stream representation
     49generally supports efficient transcoding operations.   In the important
     50case of UTF-8 to UTF-16 transcoding, the corresponding UTF-16 bit streams
     51can be calculated in bit parallel fashion based on UTF-8 streams \cite{Cameron2008},
     52and all but the final bytes of multibyte sequences can be marked for deletion.
     53In other cases, transcoding within a block only need be applied for non-ASCII
     54bytes, which are conveniently identified by iterating through the bit 0 stream
     55using bit scan operations.
     56
     57
     58
     59
     60 
Note: See TracChangeset for help on using the changeset viewer.