Ignore:
Timestamp:
Oct 16, 2012, 5:48:15 PM (7 years ago)
Author:
nmedfort
Message:

More work; mostly edits

File:
1 edited

Legend:

Unmodified
Added
Removed
  • docs/Working/icXML/arch-errorhandling.tex

    r2455 r2470  
    2929One of the CSA's major responsibilities is transcoding an input text from some encoding format to near-output-ready UTF-16.
    3030During this process, white-space normalization rules are applied and multi-code-unit and surrogate characters are detected
    31 and validated. Bit streams marking the positions of the normalized new lines is a natural derivative of this process.
     31and validated.
     32A {\it line-feed bit stream}, which marks the positions of the normalized new lines characters, is a natural derivative of
     33this process.
    3234Using an optimized population count algorithm, the line count can be summarized cheaply for each valid block of text.
    3335% The optimization delays the counting process ....
    3436Column position is more difficult to calculate.
    3537It is possible to scan backwards through the bit stream of new line characters to determine the distance (in code-units)
    36 between the position between which an error was detected and the last line feed. However, as some of these code-units
    37 are skipped over when tallying up the position, the CSA must generate a {\it skip mask} bit stream to represent those
    38 characters. This mask ORs together many relevant bit streams, such as all trailing multi-code-unit and surrogate
    39 characters, and any characters that were removed during the normalization process.
     38between the position between which an error was detected and the last line feed. However, this distance may exceed
     39than the acutal character position for the reasons discussed in (2).
     40To handle this, the CSA generates a {\it skip mask} bit stream by ORing together many relevant bit streams,
     41such as all trailing multi-code-unit and surrogate characters, and any characters that were removed during the
     42normalization process.
    4043When an error is detected, the sum of those skipped positions is subtracted from the distance to determine the actual
    41 column position.
     44column number.
    4245
    4346\begin{figure}[h]
     
    4750\caption{}
    4851\end{figure}
     52
     53The Markup Processor is a state-driven machine. As such, error detection within it is very similar to Xerces.
     54However, line/column tracking within it is a much more difficult problem. The Markup Processor parses the content stream,
     55which is a series of tagged UTF-16 strings. Each string is normalized in accordance with the XML specification. All symbol
     56data and unnecessary whitespace is eliminated from the stream.
     57This means it is impossible to directly assess the current location with only the content stream.
     58To calculate this, the Markup Processor borrows three additional pieces of information from the Parabix subsystem:
     59the line-feed, skip mask, and a {\it deletion mask stream}, which is a bit stream that denotes every code-unit that
     60was surpressed from the raw data during the production of the content stream.
     61
     62
     63Armed with the cursor position in
     64the content stream,
     65
Note: See TracChangeset for help on using the changeset viewer.