source: docs/Working/icXML/arch-errorhandling.tex @ 2470

Last change on this file since 2470 was 2470, checked in by nmedfort, 7 years ago

More work; mostly edits

File size: 4.5 KB
Line 
1\subsection{Error Handling}
2\label{section:arch:errorhandling}
3
4% Challenges / Line Col Tracker
5
6Xerces outputs error messages in two ways: through the programmer API and as thrown objects for fatal errors.
7As Xerces parses a file, it uses context-dependant logic to assess whether the next character is legal or not;
8if not, the current state determines the type and severity of the error.
9ICXML emits errors in the similar manner---but how it discovers them differs substantially.
10Recall that in Figure \ref{fig:icxml-arch}, ICXML is divided into two sections: the Parabix subsystem and
11the markup processor. Each section has its own system for producing the error messages, geared towards the type
12of processing handled by the module.
13
14Within the Parabix subsystem, all computations are performed in parallel, a block at a time.
15Errors are derived as artifacts of bit stream calculations, with a 1-bit marking the byte-position of an error within a block,
16and the type of error is determined by the equation that discovered it.
17The difficulty of error processing in this section is that in Xerces the line and column number must be given
18with every error production. Two major issues exist because of this:
19(1) line position adheres to XML white-normalization rules; as such, some sequences of characters, e.g., a carriage return
20followed by a line feed, are counted as a single new line character.
21(2) column position is counted in characters, not bytes or code units;
22thus multi-code-unit code-points and surrogate character pairs are all counted as a single column position.
23Exacerbating these problems is the fact that typical XML documents are error-free but the calculation of the
24line/column position is a constant overhead in Xerces that must be maintained in the case that one occurs.
25To reduce this overhead, ICXML pushes the bulk cost of the line/column calculation to the occurence of the error and
26performs the minimal amount of book-keeping necessary to facilitate the function.
27ICXML leverages the byproducts of the Character Set Adapter (CSA) module and amalgamates the information
28within the Line Column Tracker (LCT).
29One of the CSA's major responsibilities is transcoding an input text from some encoding format to near-output-ready UTF-16.
30During this process, white-space normalization rules are applied and multi-code-unit and surrogate characters are detected
31and validated.
32A {\it line-feed bit stream}, which marks the positions of the normalized new lines characters, is a natural derivative of
33this process.
34Using an optimized population count algorithm, the line count can be summarized cheaply for each valid block of text.
35% The optimization delays the counting process ....
36Column position is more difficult to calculate.
37It is possible to scan backwards through the bit stream of new line characters to determine the distance (in code-units)
38between the position between which an error was detected and the last line feed. However, this distance may exceed
39than the acutal character position for the reasons discussed in (2).
40To handle this, the CSA generates a {\it skip mask} bit stream by ORing together many relevant bit streams,
41such as all trailing multi-code-unit and surrogate characters, and any characters that were removed during the
42normalization process.
43When an error is detected, the sum of those skipped positions is subtracted from the distance to determine the actual
44column number.
45
46\begin{figure}[h]
47{\bf TODO: An example of a skip mask, error mask, and the raw data and transcoded data for it.
48Should a multi-byte character be used and/or some CRLFs to show the difficulties?}
49\label{fig:error_mask}
50\caption{}
51\end{figure}
52
53The Markup Processor is a state-driven machine. As such, error detection within it is very similar to Xerces.
54However, line/column tracking within it is a much more difficult problem. The Markup Processor parses the content stream,
55which is a series of tagged UTF-16 strings. Each string is normalized in accordance with the XML specification. All symbol
56data and unnecessary whitespace is eliminated from the stream.
57This means it is impossible to directly assess the current location with only the content stream.
58To calculate this, the Markup Processor borrows three additional pieces of information from the Parabix subsystem:
59the line-feed, skip mask, and a {\it deletion mask stream}, which is a bit stream that denotes every code-unit that
60was surpressed from the raw data during the production of the content stream.
61
62
63Armed with the cursor position in
64the content stream,
65
Note: See TracBrowser for help on using the repository browser.