source: docs/Working/icXML/background-xerces.tex @ 2471

Last change on this file since 2471 was 2429, checked in by nmedfort, 7 years ago

some progress

File size: 2.8 KB
Line 
1\subsection{Xerces C++ Structure}
2
3The Xerces C++ parser is a widely-used standards-conformant
4XML parser produced as open-source software
5by the Apache Software Foundation.  It features
6comprehensive support for a variety of character encodings
7both commonplace (e.g., UTF-8, UTF-16) and rarely used (e.g., EBCDIC), support for
8multiple XML vocabularies through the XML namespace
9mechanism, as well as complete implementations
10of structure and data validation through multiple grammars
11declared using either legacy DTDs (document type
12definitions) or modern XML schema facilities.
13Xerces also supports several APIs for accessing
14parser services, including event-based parsing
15using either pull parsing or SAX push-style
16parsing as well as a DOM tree-based parsing interface.
17
18% What is the story behind the xerces-profile picture? should it contain one single file or several from our test suite?
19% Our test suite does not have any grammars in it; ergo, processing those files will give a poor indication of the cost of using grammars
20% Should we show a val-grind summary of a few files in a linechart form?
21
22Xerces, like all traditional parsers, process XML documents sequentially a byte-at-a-time from the
23first to the last byte of input data. Each byte passes through several processing layers and are
24classified and eventually validated within the context of the document state.
25This introduces implicit dependencies between the various tasks within the application that make it
26difficult to optimize for performance.
27As a complex software system, no one feature dominates the overall parsing performance.
28Figure \ref{fig:xerces-profile} shows the
29execution time profile of the top ten functions
30in a typical run.
31Even if it were possible, tackling any single one of these
32functions for parallelization in isolation would
33only produce a small improvement in perfomance
34in accord with Amdahl's Law.  In order to obtain
35systematic acceleration of the Xerces parser,
36it should be expected that a comprehensive restructuring
37is required, involving all aspects of the parser.
38
39
40% Figure \ref{fig:xerces-arch} shows the
41% overall architecture of the Xerces C++ parser.
42% In analyzing the structure of Xerces, it was found that
43% there were a number of individual byte-at-a-time
44% processing tasks.
45%
46% \begin{enumerate}
47% \item Transcoding of source data to UTF-16
48% \item Character validation.
49% \item Line break normalization.
50% \item Character classification.
51% \item Line-column calculation.
52% \item Escape insertion and replacement.
53% \item Surrogate handling.
54% \item Name processing.
55% \item Markup parsing.
56% \item Attribute validation.
57% %\item Attribute checking.
58% %\item xmlns attribute processing.
59% \item Namespace processing.
60% \item Grammars, content model and data type validation.
61% \end{enumerate}
62
63
Note: See TracBrowser for help on using the repository browser.