source: docs/HPCA2012/03b-research.tex @ 1647

Last change on this file since 1647 was 1639, checked in by ksherdy, 8 years ago

Minor edit.

File size: 2.7 KB
Line 
1\section{The Parabix XML Parser}
2\label{section:parser}
3
4\begin{figure}[h]
5\begin{center}
6\includegraphics[width=1\textwidth]{plots/parabix_arch.pdf}
7\end{center}
8\caption{Parabix XML Parser Structure}
9\label{parabix_arch}
10\end{figure}
11
12This section describes the implementation of the Parabix XML parser.
13Figure \ref{parabix_arch} shows its overall structure set up for
14well-formedness checking. 
15The input file is processed using 11 functions organized into 7 modules. 
16In the first module, {\tt Read\_Data}, the input file is loaded into the
17data\_buffer. The data is then transposed to eight parallel basis
18bit streams (basis\_bits) in the {\tt Transposition} module. 
19The basis\_bits are used in the {\tt U8\_Validation} module to validate
20UTF-8 characters, and by the {\tt Classification} and {\tt Gen\_Scope} module
21to generate all the XML lexical item streams (lex) and scope streams (scope).
22Scope streams are a simplified subset of lex streams in which the legal yet
23insignificant cursors have been removed. Both the lex and scope streams
24are supplied to the parsing module, which consists of three functions:
25(1) {\tt Parse\_CtCDPI}, (2) {\tt Parse\_Ref} and (3) {\tt Parse\_tag};
26these functions deal with the parsing of
27(1) comments, CDATA sections, and processing instructions;
28(2) references, and
29(3) start tags, end tags, and empty tags as well as any related attributes.
30Afterward, the information is gathered by the {\tt Name\_Validation} and
31{\tt Err\_Check} functions, producing name check streams and error streams.
32Name check streams are weak error streams that verify each character used in a
33name is valid according to the XML 1.0 specification.
34These are then passed to the final {\tt Postprocessing} module.
35Any error that cannot be conveniently detected in bit space are
36checked here. The final output reports any
37well-formedness error and its position within the input file.
38
39Using this structure, all of the functions in the four shaded modules
40consist entirely of parallel bit stream operations. Of these, the
41Classification function consists of XML character class definitions
42that are generated using our character class compiler \textit{ccc}, while much of the U8\_Validation
43similarly consists of UTF-8 byte class definitions that are also
44generated by ccc.  The remainder of these functions are programmed
45using our unbounded bit stream language following the logical
46requirements of XML parsing.  All the functions in the four shaded
47modules are then compiled to low-level C/C++ code using our Pablo
48compiler.  This code is then linked in with the general Transposition
49code available in the Parabix run-time library, as well as the
50hand-written postprocessing code that completes the well-formed
51checking.
Note: See TracBrowser for help on using the repository browser.