source: docs/Working/icXML/arch-errorhandling.tex @ 2471

Last change on this file since 2471 was 2471, checked in by nmedfort, 7 years ago

some edits

File size: 4.6 KB
Line 
1\subsection{Error Handling}
2\label{section:arch:errorhandling}
3
4% XML errors are rare but they do happen, especially with untrustworthy data sources.
5Xerces outputs error messages in two ways: through the programmer API and as thrown objects for fatal errors.
6As Xerces parses a file, it uses context-dependant logic to assess whether the next character is legal or not;
7if not, the current state determines the type and severity of the error.
8ICXML emits errors in the similar manner---but how it discovers them is substantially different.
9
10Recall that in Figure \ref{fig:icxml-arch}, ICXML is divided into two sections: the \PS{} and
11the \MP{}. Each section has its own system for producing the error messages, geared towards the type
12of processing handled by the module.
13
14Within the \PS{}, all computations are performed in parallel, a block at a time.
15Errors are derived as artifacts of bit stream calculations, with a 1-bit marking the byte-position of an error within a block,
16and the type of error is determined by the equation that discovered it.
17The difficulty of error processing in this section is that in Xerces the line and column number must be given
18with every error production. Two major issues exist because of this:
19(1) line position adheres to XML white-normalization rules; as such, some sequences of characters, e.g., a carriage return
20followed by a line feed, are counted as a single new line character.
21(2) column position is counted in characters, not bytes or code units;
22thus multi-code-unit code-points and surrogate character pairs are all counted as a single column position.
23Exacerbating these problems is the fact that typical XML documents are error-free but the calculation of the
24line/column position is a constant overhead in Xerces that must be maintained in the case that one occurs.
25To reduce this overhead, ICXML pushes the bulk cost of the line/column calculation to the occurence of the error and
26performs the minimal amount of book-keeping necessary to facilitate the function.
27ICXML leverages the byproducts of the Character Set Adapter (CSA) module and amalgamates the information
28within the Line Column Tracker (LCT).
29One of the CSA's major responsibilities is transcoding an input text from some encoding format to near-output-ready UTF-16.
30During this process, white-space normalization rules are applied and multi-code-unit and surrogate characters are detected
31and validated.
32A {\it line-feed bit stream}, which marks the positions of the normalized new lines characters, is a natural derivative of
33this process.
34Using an optimized population count algorithm, the line count can be summarized cheaply for each valid block of text.
35% The optimization delays the counting process ....
36Column position is more difficult to calculate.
37It is possible to scan backwards through the bit stream of new line characters to determine the distance (in code-units)
38between the position between which an error was detected and the last line feed. However, this distance may exceed
39than the acutal character position for the reasons discussed in (2).
40To handle this, the CSA generates a {\it skip mask} bit stream by ORing together many relevant bit streams,
41such as all trailing multi-code-unit and surrogate characters, and any characters that were removed during the
42normalization process.
43When an error is detected, the sum of those skipped positions is subtracted from the distance to determine the actual
44column number.
45
46\begin{figure}[h]
47{\bf TODO: An example of a skip mask, error mask, and the raw data and transcoded data for it.
48Should a multi-byte character be used and/or some CRLFs to show the difficulties?}
49\label{fig:error_mask}
50\caption{}
51\end{figure}
52
53The \MP{} is a state-driven machine. As such, error detection within it is very similar to Xerces.
54However, reporting the correct line/column is a much more difficult problem.
55The \MP{} parses the content stream, which is a series of tagged UTF-16 strings.
56Each string is normalized in accordance with the XML specification.
57All symbol data and unnecessary whitespace is eliminated from the stream.
58This means it is impossible to directly assess the current location using only the cursor position within the content stream.
59To calculate the location, the \MP{} borrows three additional pieces of information from the \PS{}:
60the line-feed, skip mask, and a {\it deletion mask stream}, which is a bit stream denoting the (code-unit) position of every
61datum that was surpressed from the source during the production of the content stream.
62Armed with these, it is possible to calculate the actual line/column using
63the same system as the \PS{} until the sum of the negated deletion mask stream is equal to the cursor position.
Note: See TracBrowser for help on using the repository browser.