source: docs/Working/icXML/arch-errorhandling.tex

Last change on this file was 2872, checked in by nmedfort, 6 years ago

edits

File size: 4.3 KB
RevLine 
[2429]1\subsection{Error Handling}
[2439]2\label{section:arch:errorhandling}
[2429]3
[2471]4% XML errors are rare but they do happen, especially with untrustworthy data sources.
[2455]5Xerces outputs error messages in two ways: through the programmer API and as thrown objects for fatal errors.
[2496]6As Xerces parses a file, it uses context-dependant logic to assess whether the next character is legal;
[2455]7if not, the current state determines the type and severity of the error.
[2496]8\icXML{} emits errors in the similar manner---but how it discovers them is substantially different.
9Recall that in Figure \ref{fig:icxml-arch}, \icXML{} is divided into two sections: the \PS{} and \MP{},
10each with its own system for detecting and producing error messages.
[2471]11
12Within the \PS{}, all computations are performed in parallel, a block at a time.
[2872]13Errors are derived as artifacts of \bitstream{} calculations, with a 1-bit marking the byte-position of an error within a block,
[2455]14and the type of error is determined by the equation that discovered it.
15The difficulty of error processing in this section is that in Xerces the line and column number must be given
16with every error production. Two major issues exist because of this:
17(1) line position adheres to XML white-normalization rules; as such, some sequences of characters, e.g., a carriage return
18followed by a line feed, are counted as a single new line character.
19(2) column position is counted in characters, not bytes or code units;
20thus multi-code-unit code-points and surrogate character pairs are all counted as a single column position.
[2496]21Note that typical XML documents are error-free but the calculation of the
22line/column position is a constant overhead in Xerces. % that must be maintained in the case that one occurs.
23To reduce this, \icXML{} pushes the bulk cost of the line/column calculation to the occurrence of the error and
24performs the minimal amount of book-keeping necessary to facilitate it.
25\icXML{} leverages the byproducts of the Character Set Adapter (CSA) module and amalgamates the information
[2455]26within the Line Column Tracker (LCT).
[2496]27One of the CSA's major responsibilities is transcoding an input text. % from some encoding format to near-output-ready UTF-16.
[2455]28During this process, white-space normalization rules are applied and multi-code-unit and surrogate characters are detected
[2470]29and validated.
[2872]30A {\it line-feed \bitstream{}}, which marks the positions of the normalized new lines characters, is a natural derivative of
[2470]31this process.
[2455]32Using an optimized population count algorithm, the line count can be summarized cheaply for each valid block of text.
33% The optimization delays the counting process ....
34Column position is more difficult to calculate.
[2872]35It is possible to scan backwards through the \bitstream{} of new line characters to determine the distance (in code-units)
[2470]36between the position between which an error was detected and the last line feed. However, this distance may exceed
[2872]37than the actual character position for the reasons discussed in (2).
38To handle this, the CSA generates a {\it skip mask} \bitstream{} by ORing together many relevant \bitstream{}s,
[2470]39such as all trailing multi-code-unit and surrogate characters, and any characters that were removed during the
40normalization process.
[2455]41When an error is detected, the sum of those skipped positions is subtracted from the distance to determine the actual
[2470]42column number.
[2449]43
[2470]44
[2471]45The \MP{} is a state-driven machine. As such, error detection within it is very similar to Xerces.
46However, reporting the correct line/column is a much more difficult problem.
47The \MP{} parses the content stream, which is a series of tagged UTF-16 strings.
48Each string is normalized in accordance with the XML specification.
[2505]49All symbol data and unnecessary whitespace is eliminated from the stream;
50thus its impossible to derive the current location using only the content stream.
[2471]51To calculate the location, the \MP{} borrows three additional pieces of information from the \PS{}:
[2872]52the line-feed, skip mask, and a {\it deletion mask stream}, which is a \bitstream{} denoting the (code-unit) position of every
53datum that was suppressed from the source during the production of the content stream.
[2471]54Armed with these, it is possible to calculate the actual line/column using
[2866]55the same system as the \PS{} until the sum of the negated deletion mask stream is equal to the current position.
Note: See TracBrowser for help on using the repository browser.