# Changeset 1393 for docs/HPCA2012/02-background.tex

Ignore:
Timestamp:
Aug 30, 2011, 10:47:59 AM (8 years ago)
Message:

Minor bug fixes up to 04

File:
1 edited

### Legend:

Unmodified
 r1374 XML can represent virtually any type of information (i.e., content) in a descriptive fashion. XML markup encodes a description of an XML document's storage layout and logical structure. Because XML is intended to be human-readable, XML markup tags are often verbose by design \cite{TR:XML}. Since XML is intended to be human-readable, markup tags are often verbose by design \cite{TR:XML}. For example, Figure \ref{fig:sample_xml} provides a standard product list encapsulated within an XML document. All content is highlighted in bold. Anything that is not content is considered markup. \begin{figure}[h] {\scriptsize {\footnotesize \begin{center} \begin{tabbing} \subsection{XML Parsers} % However, textual data tends to consist of variable-length items in generally unpredictable patterns \cite{Cameron2010}. Traditional XML parsers process an XML document sequentially, a single byte-at-a-time, from the first to the last character in the source text. Each character is examined to distinguish between the XML-specific markup, such as an opening angle bracket \verb<', and the content held within the document. The character that the parser is currently processing is commonly referred to its {\it cursor position}. As the parser moves its cursor through the source document, it alternates between markup scanning / validation and content processing operations. In other words, traditional XML parsers operate as finite-state machines that use byte comparisons to transition between data and metadata states. Each state transition indicates the context in which to interpret the subsequent characters. Unfortunately, textual data tends to consist of variable-length items sequenced in generally unpredictable patterns; thus any character could be a state transition until deemed otherwise. A major disadvantage of the sequential byte-at-a-time approach to XML parsing is that processing an XML document requires at least one conditional branch per byte of source text. For example, Xerces-C, which is the most popular open-source C++ based XML parser and the foundation of the Apache XML project \cite{xerces}, uses a series of nested switch statements and state-dependent flag tests to control the parsing logic of the program. Our analysis, which we detail in Section \ref{section:XML-branches}, found that Xerces-C requires between 6 - 13 branches per byte of XML to support this form of control structure (depending on the ratio of markup vs. content). Cache utilization is also significantly reduced due to the manner in which markup and content must be scanned and buffered for future use. For instance, Xerces-C incurs $\sim$100 L1 cache misses per 1000 bytes of XML. % Branch mispredictions are known % to signficantly degrade XML parsing performance in proportion to the markup density of the source document % \cite{CameronHerdyLin2008} (i.e., the ratio of markup vs. content). Traditional XML parsers process an XML document sequentially, a single byte-at-a-time, from the first to the last character in the source text.  Each character is examined to distinguish between the XML-specific markup, such as an left angle bracket \verb<', and the content held within the document.  The character that the parser is currently interpreting is commonly referred to its {\it cursor}. As the parser moves its cursor through the document, it alternates between markup scanning, validation, and content processing operations.  In other words, traditional XML parsers operate as finite-state machines that use byte comparisons to transition between data and metadata states. Each state transition indicates the context for subsequent characters. Unfortunately, textual data tends to consist of variable-length items sequenced in generally unpredictable patterns; thus any character could be a state transition until deemed otherwise. A major disadvantage of the sequential byte-at-a-time approach to XML parsing is that processing an XML document requires at least one conditional branch per byte of source text.  For example, Xerces-C, which forms the foundation for widely deployed the Apache XML project \cite{xerces}, uses a series of nested switch statements and state-dependent flag tests to control the parsing logic of the program.  Our analysis, which we detail in Section \ref{section:XML-branches}, found that Xerces requires between 6 - 13 branches per byte of XML to support this form of control flow, depending on the fraction of markup in the overall document.  Cache utilization is also significantly reduced due to the manner in which markup and content must be scanned and buffered for future use.  For instance, Xerces incurs $\sim$100 L1 cache misses per 1000 bytes of XML.  In general, while microarchitectural improvements may help the parser tide over some of these challenges (e.g., cache misses), the fundamental data and control flow in the parsers are ill suited for commodity processors and experience significant overhead.