source: docs/HPCA2012/final_ieee/03b-research.tex

Last change on this file was 1774, checked in by lindanl, 8 years ago

minor changes

File size: 2.7 KB
1\section{The Parabix XML Parser}
8\caption{Parabix XML Parser Structure}
12This section describes the implementation of the Parabix XML parser
13for well-formedness checking. Figure \ref{parabix_arch} shows its overall structure.
14The input file is processed using 11 functions organized into 7 modules. 
15In the first module, {\tt Read\_Data}, the input file is loaded into the
16data\_buffer. The data is then transposed to eight parallel basis
17bit streams (basis\_bits) in the {\tt Transposition} module. 
18The basis\_bits are used in the {\tt U8\_Validation} module to validate
19UTF-8 characters, and by the {\tt Classification} and {\tt Gen\_Scope} module
20to generate all the XML lexical item streams (lex) and scope streams (scope).
21Scope streams are a simplified subset of lex streams in which the legal yet
22insignificant cursors have been removed. Both the lex and scope streams
23are supplied to the parsing module, which consists of three functions:
24(1) {\tt Parse\_CtCDPI}, (2) {\tt Parse\_Ref} and (3) {\tt Parse\_tag};
25these functions deal with the parsing of
26(1) comments, CDATA sections, and processing instructions;
27(2) references, and
28(3) start tags, end tags, and empty tags as well as any related attributes.
29Afterward, the information is gathered by the {\tt Name\_Validation} and
30{\tt Err\_Check} functions, producing name check streams and error streams.
31Name check streams are weak error streams that verify each character used in a
32name is valid according to the XML 1.0 specification.
33These are then passed to the final {\tt Postprocessing} module.
34Any error that cannot be conveniently detected in bit space are
35checked here. The final output reports any
36well-formedness error and its position within the input file.
38Using this structure, all of the functions in the four shaded modules
39consist entirely of parallel bit stream operations. Of these, the
40Classification function consists of XML character class definitions
41that are generated using our character class compiler \textit{ccc}, while much of the U8\_Validation
42similarly consists of UTF-8 byte class definitions that are also
43generated by ccc.  The remainder of these functions are programmed
44using our unbounded bit stream language following the logical
45requirements of XML parsing.  All the functions in the four shaded
46modules are then compiled to low-level C/C++ code using our Pablo
47compiler.  This code is then linked in with the general Transposition
48code available in the Parabix runtime library, as well as the
49hand-written postprocessing code that completes the well-formed
Note: See TracBrowser for help on using the repository browser.