Ignore:
Timestamp:
Aug 30, 2011, 10:47:59 AM (8 years ago)
Author:
ashriram
Message:

Minor bug fixes up to 04

File:
1 edited

Legend:

Unmodified
Added
Removed
  • docs/HPCA2012/02-background.tex

    r1374 r1393  
    4343XML can represent virtually any type of information (i.e., content) in a descriptive fashion.
    4444XML markup encodes a description of an XML document's storage layout and logical structure.
    45 Because XML is intended to be human-readable, XML markup tags are often verbose by design \cite{TR:XML}.
     45Since XML is intended to be human-readable, markup tags are often verbose by design \cite{TR:XML}.
    4646For example, Figure \ref{fig:sample_xml} provides a standard product list encapsulated within an XML document.
    4747All content is highlighted in bold. Anything that is not content is considered markup.
     
    4949\begin{figure}[h]
    5050
    51 {\scriptsize
     51{\footnotesize
    5252\begin{center}
    5353\begin{tabbing}
     
    7171\subsection{XML Parsers}
    7272% However, textual data tends to consist of variable-length items in generally unpredictable patterns \cite{Cameron2010}.
    73 Traditional XML parsers process an XML document sequentially, a single byte-at-a-time,
    74 from the first to the last character in the source text.
    75 Each character is examined to distinguish between the XML-specific markup,
    76 such as an opening angle bracket `\verb`<`', and the content held within the document.
    77 The character that the parser is currently processing is commonly referred to its {\it cursor position}. As the
    78 parser moves its cursor through the source document, it alternates between markup scanning / validation and
    79 content processing operations.
    80 In other words,
    81 traditional XML parsers operate as finite-state machines that use byte comparisons to transition
    82 between data and metadata states. Each state transition indicates the context in which to interpret the
    83 subsequent characters. Unfortunately, textual data tends to consist of variable-length items sequenced in
    84 generally unpredictable patterns; thus any character could be a state transition until
    85 deemed otherwise.
    86 
    87 A major disadvantage of the sequential byte-at-a-time approach to XML parsing is that processing an XML document
    88 requires at least one conditional branch per byte of source text.
    89 For example, Xerces-C, which is the most popular open-source C++ based XML parser and the foundation of the
    90 Apache XML project \cite{xerces},
    91 uses a series of nested switch statements and state-dependent flag tests to control the parsing logic of the program.
    92 Our analysis, which we detail in Section \ref{section:XML-branches}, found that Xerces-C requires between 6 - 13 branches per byte
    93 of XML to support this form of control structure (depending on the ratio of markup vs. content).
    94 Cache utilization is also significantly reduced due to the manner in which markup and content must be scanned and buffered
    95 for future use.
    96 For instance, Xerces-C incurs $\sim$100 L1 cache misses per 1000 bytes of XML.
    97 % Branch mispredictions are known
    98 % to signficantly degrade XML parsing performance in proportion to the markup density of the source document
    99 % \cite{CameronHerdyLin2008} (i.e., the ratio of markup vs. content).
    100 
    101 
     73Traditional XML parsers process an XML document sequentially, a single
     74byte-at-a-time, from the first to the last character in the source
     75text.  Each character is examined to distinguish between the
     76XML-specific markup, such as an left angle bracket `\verb`<`', and the
     77content held within the document.  The character that the parser is
     78currently interpreting is commonly referred to its {\it cursor}. As
     79the parser moves its cursor through the document, it alternates
     80between markup scanning, validation, and content processing
     81operations.  In other words, traditional XML parsers operate as
     82finite-state machines that use byte comparisons to transition between
     83data and metadata states. Each state transition indicates the context
     84for subsequent characters. Unfortunately, textual data tends to
     85consist of variable-length items sequenced in generally unpredictable
     86patterns; thus any character could be a state transition until deemed
     87otherwise.
     88
     89A major disadvantage of the sequential byte-at-a-time approach to XML
     90parsing is that processing an XML document requires at least one
     91conditional branch per byte of source text.  For example, Xerces-C,
     92which forms the foundation for widely deployed the Apache XML project
     93\cite{xerces}, uses a series of nested switch statements and
     94state-dependent flag tests to control the parsing logic of the
     95program.  Our analysis, which we detail in Section
     96\ref{section:XML-branches}, found that Xerces requires between 6 - 13
     97branches per byte of XML to support this form of control flow,
     98depending on the fraction of markup in the overall document.  Cache
     99utilization is also significantly reduced due to the manner in which
     100markup and content must be scanned and buffered for future use.  For
     101instance, Xerces incurs $\sim$100 L1 cache misses per 1000 bytes of
     102XML.  In general, while microarchitectural improvements may help the
     103parser tide over some of these challenges (e.g., cache misses), the
     104fundamental data and control flow in the parsers are ill suited for
     105commodity processors and experience significant overhead.
    102106
    103107
Note: See TracChangeset for help on using the changeset viewer.