Changeset 2455 for docs


Ignore:
Timestamp:
Oct 15, 2012, 5:20:02 PM (7 years ago)
Author:
nmedfort
Message:

work on error handling section

Location:
docs/Working/icXML
Files:
2 edited

Legend:

Unmodified
Added
Removed
  • docs/Working/icXML/arch-errorhandling.tex

    r2449 r2455  
    44% Challenges / Line Col Tracker
    55
    6 Xerces outputs error messages in one of two ways: through the programmer API and as a thrown errors for fatal messages.
    7 ICXML emits errors in the similar manner---but how they determine the line/column number of the error, which is a necessary
    8 component of the error message, differs substantially.
     6Xerces outputs error messages in two ways: through the programmer API and as thrown objects for fatal errors.
     7As Xerces parses a file, it uses context-dependant logic to assess whether the next character is legal or not;
     8if not, the current state determines the type and severity of the error.
     9ICXML emits errors in the similar manner---but how it discovers them differs substantially.
    910Recall that in Figure \ref{fig:icxml-arch}, ICXML is divided into two sections: the Parabix subsystem and
    10 the markup processor.
    11 Within Parabix, all computations are performed in parallel at a block at a time. Errors are derived as artifacts of
    12 bit stream equations, with a 1-bit marking the position of an error in a block.
     11the markup processor. Each section has its own system for producing the error messages, geared towards the type
     12of processing handled by the module.
    1313
    14 
     14Within the Parabix subsystem, all computations are performed in parallel, a block at a time.
     15Errors are derived as artifacts of bit stream calculations, with a 1-bit marking the byte-position of an error within a block,
     16and the type of error is determined by the equation that discovered it.
     17The difficulty of error processing in this section is that in Xerces the line and column number must be given
     18with every error production. Two major issues exist because of this:
     19(1) line position adheres to XML white-normalization rules; as such, some sequences of characters, e.g., a carriage return
     20followed by a line feed, are counted as a single new line character.
     21(2) column position is counted in characters, not bytes or code units;
     22thus multi-code-unit code-points and surrogate character pairs are all counted as a single column position.
     23Exacerbating these problems is the fact that typical XML documents are error-free but the calculation of the
     24line/column position is a constant overhead in Xerces that must be maintained in the case that one occurs.
     25To reduce this overhead, ICXML pushes the bulk cost of the line/column calculation to the occurence of the error and
     26performs the minimal amount of book-keeping necessary to facilitate the function.
     27ICXML leverages the byproducts of the Character Set Adapter (CSA) module and amalgamates the information
     28within the Line Column Tracker (LCT).
     29One of the CSA's major responsibilities is transcoding an input text from some encoding format to near-output-ready UTF-16.
     30During this process, white-space normalization rules are applied and multi-code-unit and surrogate characters are detected
     31and validated. Bit streams marking the positions of the normalized new lines is a natural derivative of this process.
     32Using an optimized population count algorithm, the line count can be summarized cheaply for each valid block of text.
     33% The optimization delays the counting process ....
     34Column position is more difficult to calculate.
     35It is possible to scan backwards through the bit stream of new line characters to determine the distance (in code-units)
     36between the position between which an error was detected and the last line feed. However, as some of these code-units
     37are skipped over when tallying up the position, the CSA must generate a {\it skip mask} bit stream to represent those
     38characters. This mask ORs together many relevant bit streams, such as all trailing multi-code-unit and surrogate
     39characters, and any characters that were removed during the normalization process.
     40When an error is detected, the sum of those skipped positions is subtracted from the distance to determine the actual
     41column position.
    1542
    1643\begin{figure}[h]
  • docs/Working/icXML/icxml-main.tex

    r2453 r2455  
    125125\input{arch-overview.tex}
    126126
     127\input{arch-charactersetadapters.tex}
     128
    127129\input{parfilter.tex}
    128130
Note: See TracChangeset for help on using the changeset viewer.