source: docs/Working/icXML/background-xerces.tex @ 2522

Last change on this file since 2522 was 2522, checked in by nmedfort, 7 years ago

edits

File size: 3.7 KB
Line 
1\subsection{Xerces C++ Structure}
2\label{background:xerces}
3
4The Xerces C++ parser
5% is a widely-used standards-conformant
6% XML parser produced as open-source software
7% by the Apache Software Foundation. 
8% It
9features comprehensive support for a variety of character encodings
10both commonplace (e.g., UTF-8, UTF-16) and rarely used (e.g., EBCDIC), support for
11multiple XML vocabularies through the XML namespace
12mechanism, as well as complete implementations
13of structure and data validation through multiple grammars
14declared using either legacy DTDs (document type
15definitions) or modern XML schema facilities.
16Xerces also supports several APIs for accessing
17parser services, including event-based parsing
18using either pull parsing or SAX/SAX2 push-style
19parsing as well as a DOM tree-based parsing interface.
20
21% What is the story behind the xerces-profile picture? should it contain one single file or several from our test suite?
22% Our test suite does not have any grammars in it; ergo, processing those files will give a poor indication of the cost of using grammars
23% Should we show a val-grind summary of a few files in a linechart form?
24
25Xerces, like all traditional parsers, processes XML documents sequentially a byte-at-a-time from the
26first to the last byte of input data. Each byte passes through several processing layers and is
27classified and eventually validated within the context of the document state.
28This introduces implicit dependencies between the various tasks within the application that make it
29difficult to optimize for performance.
30As a complex software system, no one feature dominates the overall parsing performance.
31Figure \ref{fig:xerces-profile} shows the execution time profile of the top ten functions in a typical run.
32Even if it were possible, Amdahl's Law dictates that tackling any one of these functions for
33parallelization in isolation would only produce a minute improvement in perfomance.
34Unfortunetly, early investigation into these functions found they were already performing well in their given tasks
35and only trivial enhancements were possible.
36In order to obtain a systematic acceleration of Xerces,
37it should be expected that a comprehensive restructuring
38is required, involving all aspects of the parser.
39
40% In order to obtain systematic acceleration of the Xerces parser,
41% it should be expected that a comprehensive restructuring
42% is required, involving all aspects of the parser.
43
44\begin{figure}[h]
45\begin{tabular}{r|l}
46Time (\%) & Function Name \\
47\hline
4813.29   &       XMLUTF8Transcoder::transcodeFrom \\
497.45    &       IGXMLScanner::scanCharData \\
506.83    &       memcpy \\
515.83    &       XMLReader::getNCName \\
524.67    &       IGXMLScanner::buildAttList \\
534.54    &       RefHashTableOf\verb|<>|::findBucketElem \\
544.20    &       IGXMLScanner::scanStartTagNS \\
553.75    &       ElemStack::mapPrefixToURI \\
563.58    &       ReaderMgr::getNextChar \\
573.20    &       IGXMLScanner::basicAttrValueScan \\
58\end{tabular}
59\caption{Execution Time of Top 10 Xerces Functions}
60\label {fig:xerces-profile}
61\end{figure}
62
63
64
65% Figure \ref{fig:xerces-arch} shows the
66% overall architecture of the Xerces C++ parser.
67% In analyzing the structure of Xerces, it was found that
68% there were a number of individual byte-at-a-time
69% processing tasks.
70%
71% \begin{enumerate}
72% \item Transcoding of source data to UTF-16
73% \item Character validation.
74% \item Line break normalization.
75% \item Character classification.
76% \item Line-column calculation.
77% \item Escape insertion and replacement.
78% \item Surrogate handling.
79% \item Name processing.
80% \item Markup parsing.
81% \item Attribute validation.
82% %\item Attribute checking.
83% %\item xmlns attribute processing.
84% \item Namespace processing.
85% \item Grammars, content model and data type validation.
86% \end{enumerate}
87
88
Note: See TracBrowser for help on using the repository browser.