Oct 10, 2012, 3:59:17 PM (7 years ago)

some progress

1 edited


  • docs/Working/icXML/background-xerces.tex

    r2300 r2429  
    44XML parser produced as open-source software
    55by the Apache Software Foundation.  It features
    6 comprehensive support for XML character encodings
    7 both commonplace and rarely used, support for
     6comprehensive support for a variety of character encodings
     7both commonplace (e.g., UTF-8, UTF-16) and rarely used (e.g., EBCDIC), support for
    88multiple XML vocabularies through the XML namespace
    99mechanism, as well as complete implementations
    10 of structure and data validation through grammars
     10of structure and data validation through multiple grammars
    1111declared using either legacy DTDs (document type
    1212definitions) or modern XML schema facilities.
    1313Xerces also supports several APIs for accessing
    1414parser services, including event-based parsing
    15 using either pull parsing or SAX-style push
    16 parsing as well as tree-based parsing with
    17 a DOM-based interface.
     15using either pull parsing or SAX push-style
     16parsing as well as a DOM tree-based parsing interface.
    19 As a complex software system, there is no single Xerces
    20 feature that dominates in overall parsing performance.
     18% What is the story behind the xerces-profile picture? should it contain one single file or several from our test suite?
     19% Our test suite does not have any grammars in it; ergo, processing those files will give a poor indication of the cost of using grammars
     20% Should we show a val-grind summary of a few files in a linechart form?
     22Xerces, like all traditional parsers, process XML documents sequentially a byte-at-a-time from the
     23first to the last byte of input data. Each byte passes through several processing layers and are
     24classified and eventually validated within the context of the document state.
     25This introduces implicit dependencies between the various tasks within the application that make it
     26difficult to optimize for performance.
     27As a complex software system, no one feature dominates the overall parsing performance.
    2128Figure \ref{fig:xerces-profile} shows the
    2229execution time profile of the top ten functions
    23 in a typical run.  Even if it were possible,
    24 tackling any single one of these
     30in a typical run.
     31Even if it were possible, tackling any single one of these
    2532functions for parallelization in isolation would
    2633only produce a small improvement in perfomance
    2734in accord with Amdahl's Law.  In order to obtain
    28 systematic acceleration of the Xerces parser, then,
     35systematic acceleration of the Xerces parser,
    2936it should be expected that a comprehensive restructuring
    3037is required, involving all aspects of the parser.
    32 Figure \ref{fig:xerces-arch} shows the
    33 overall architecture of the Xerces C++ parser.
    34 In analyzing the structure of Xerces, it was found that
    35 there were a number of individual byte-at-a-time
    36 processing tasks.
    37 \begin{enumerate}
    38 \item Transcoding to UTF-16
    39 \item Character validation.
    40 \item Line break normalization.
    41 \item Character classification.
    42 \item Line-column calculation.
    43 \item Escape insertion and replacement.
    44 \item Surrogate handling.
    45 \item Name processing.
    46 \item Markup parsing.
    47 \item Attribute checking.
    48 \item xmlns attribute processing.
    49 \item namespacing processing.
    50 \item Grammars, content model and data type validation.
    51 \end{enumerate}
     40% Figure \ref{fig:xerces-arch} shows the
     41% overall architecture of the Xerces C++ parser.
     42% In analyzing the structure of Xerces, it was found that
     43% there were a number of individual byte-at-a-time
     44% processing tasks.
     46% \begin{enumerate}
     47% \item Transcoding of source data to UTF-16
     48% \item Character validation.
     49% \item Line break normalization.
     50% \item Character classification.
     51% \item Line-column calculation.
     52% \item Escape insertion and replacement.
     53% \item Surrogate handling.
     54% \item Name processing.
     55% \item Markup parsing.
     56% \item Attribute validation.
     57% %\item Attribute checking.
     58% %\item xmlns attribute processing.
     59% \item Namespace processing.
     60% \item Grammars, content model and data type validation.
     61% \end{enumerate}
Note: See TracChangeset for help on using the changeset viewer.