source: docs/Working/icXML/background-xerces.tex @ 2490

Last change on this file since 2490 was 2490, checked in by cameron, 7 years ago

Section 1 and 2 clean-ups.

File size: 2.8 KB
Line 
1\subsection{Xerces C++ Structure}
2\label{background:xerces}
3
4The Xerces C++ parser
5% is a widely-used standards-conformant
6% XML parser produced as open-source software
7% by the Apache Software Foundation. 
8% It
9features comprehensive support for a variety of character encodings
10both commonplace (e.g., UTF-8, UTF-16) and rarely used (e.g., EBCDIC), support for
11multiple XML vocabularies through the XML namespace
12mechanism, as well as complete implementations
13of structure and data validation through multiple grammars
14declared using either legacy DTDs (document type
15definitions) or modern XML schema facilities.
16Xerces also supports several APIs for accessing
17parser services, including event-based parsing
18using either pull parsing or SAX/SAX2 push-style
19parsing as well as a DOM tree-based parsing interface.
20
21% What is the story behind the xerces-profile picture? should it contain one single file or several from our test suite?
22% Our test suite does not have any grammars in it; ergo, processing those files will give a poor indication of the cost of using grammars
23% Should we show a val-grind summary of a few files in a linechart form?
24
25Xerces, like all traditional parsers, processes XML documents sequentially a byte-at-a-time from the
26first to the last byte of input data. Each byte passes through several processing layers and is
27classified and eventually validated within the context of the document state.
28This introduces implicit dependencies between the various tasks within the application that make it
29difficult to optimize for performance.
30As a complex software system, no one feature dominates the overall parsing performance.
31Figure \ref{fig:xerces-profile} shows the
32execution time profile of the top ten functions
33in a typical run.
34Even if it were possible, tackling any single one of these
35functions for parallelization in isolation would
36only produce a small improvement in perfomance
37in accord with Amdahl's Law.  In order to obtain
38systematic acceleration of the Xerces parser,
39it should be expected that a comprehensive restructuring
40is required, involving all aspects of the parser.
41
42
43% Figure \ref{fig:xerces-arch} shows the
44% overall architecture of the Xerces C++ parser.
45% In analyzing the structure of Xerces, it was found that
46% there were a number of individual byte-at-a-time
47% processing tasks.
48%
49% \begin{enumerate}
50% \item Transcoding of source data to UTF-16
51% \item Character validation.
52% \item Line break normalization.
53% \item Character classification.
54% \item Line-column calculation.
55% \item Escape insertion and replacement.
56% \item Surrogate handling.
57% \item Name processing.
58% \item Markup parsing.
59% \item Attribute validation.
60% %\item Attribute checking.
61% %\item xmlns attribute processing.
62% \item Namespace processing.
63% \item Grammars, content model and data type validation.
64% \end{enumerate}
65
66
Note: See TracBrowser for help on using the repository browser.