Changeset 888 for docs


Ignore:
Timestamp:
Feb 4, 2011, 4:39:15 PM (9 years ago)
Author:
cameron
Message:

Tighten section 3

Location:
docs/EuroPar2011
Files:
3 edited

Legend:

Unmodified
Added
Removed
  • docs/EuroPar2011/europar-cameron.aux

    r885 r888  
    2727\@writefile{lof}{\contentsline {figure}{\numberline {3}{\ignorespaces Parallel Scan Using Addition and Mask}}{5}}
    2828\newlabel{fig:scan2}{{3}{5}}
    29 \@writefile{toc}{\contentsline {section}{\numberline {3}Parsing and Error Streams}{6}}
     29\@writefile{toc}{\contentsline {section}{\numberline {3}XML Scanning and Parsing}{6}}
    3030\newlabel{sec:errorstream}{{3}{6}}
    31 \@writefile{lof}{\contentsline {figure}{\numberline {4}{\ignorespaces Grammar of Decimal Character References}}{6}}
    32 \newlabel{fig:decrefgrmr}{{4}{6}}
    33 \@writefile{toc}{\contentsline {subsection}{\numberline {3.1}Marker Stream Initialization}{6}}
    34 \@writefile{lof}{\contentsline {figure}{\numberline {5}{\ignorespaces Parsing Decimal References}}{7}}
    35 \newlabel{fig:decref}{{5}{7}}
    36 \@writefile{toc}{\contentsline {subsection}{\numberline {3.2}Parallel Parsing of Sequential Structures}{7}}
    37 \@writefile{lof}{\contentsline {figure}{\numberline {6}{\ignorespaces Grammar of XML Start Tags}}{7}}
    38 \newlabel{fig:stag-grmr}{{6}{7}}
    39 \@writefile{lof}{\contentsline {figure}{\numberline {7}{\ignorespaces Start Tag Parsing}}{8}}
    40 \newlabel{fig:stag-ex}{{7}{8}}
    41 \@writefile{toc}{\contentsline {subsection}{\numberline {3.3}Mask Formation with Bitstream Subtraction}{9}}
    42 \newlabel{sec:maskstream}{{3.3}{9}}
    43 \@writefile{lof}{\contentsline {figure}{\numberline {8}{\ignorespaces Mask Formation for Comments, CDATA and PI}}{9}}
    44 \newlabel{fig:CtCDPImask}{{8}{9}}
     31\@writefile{lof}{\contentsline {figure}{\numberline {4}{\ignorespaces XML Grammar: Decimal Character References and Start Tags}}{6}}
     32\newlabel{fig:xmlgrmr}{{4}{6}}
     33\@writefile{lof}{\contentsline {figure}{\numberline {5}{\ignorespaces Parsing Decimal References}}{6}}
     34\newlabel{fig:decref}{{5}{6}}
    4535\citation{TR:XML}
    46 \@writefile{toc}{\contentsline {subsection}{\numberline {3.4}Python Prototyping}{10}}
    47 \@writefile{toc}{\contentsline {section}{\numberline {4}XML Well-Formedness}{10}}
     36\@writefile{lof}{\contentsline {figure}{\numberline {6}{\ignorespaces Start Tag Parsing}}{8}}
     37\newlabel{fig:stag-ex}{{6}{8}}
     38\@writefile{toc}{\contentsline {section}{\numberline {4}XML Well-Formedness}{9}}
    4839\citation{DaiNiZhu2010}
    49 \@writefile{toc}{\contentsline {subsection}{\numberline {4.1}Error and Error-Check Bitstreams}{11}}
    50 \@writefile{toc}{\contentsline {subsection}{\numberline {4.2}Tag Matching}{12}}
    51 \@writefile{toc}{\contentsline {subsection}{\numberline {4.3}Document Structure}{12}}
    52 \@writefile{toc}{\contentsline {subsection}{\numberline {4.4}Summary}{12}}
    53 \@writefile{toc}{\contentsline {section}{\numberline {5}Compilation to Block-Based Processing}{12}}
    54 \newlabel{sec:compile}{{5}{12}}
     40\@writefile{toc}{\contentsline {section}{\numberline {5}Compilation to Block-Based Processing}{10}}
     41\newlabel{sec:compile}{{5}{10}}
    5542\citation{GML04}
    56 \@writefile{toc}{\contentsline {section}{\numberline {6}Performance Results}{13}}
    57 \@writefile{lot}{\contentsline {table}{\numberline {1}{\ignorespaces XML Document Characteristics}}{14}}
    58 \newlabel{XMLDocChars}{{1}{14}}
    59 \@writefile{toc}{\contentsline {section}{\numberline {7}Conclusion}{14}}
     43\@writefile{toc}{\contentsline {section}{\numberline {6}Performance Results}{11}}
     44\@writefile{lot}{\contentsline {table}{\numberline {1}{\ignorespaces XML Document Characteristics}}{12}}
     45\newlabel{XMLDocChars}{{1}{12}}
     46\@writefile{toc}{\contentsline {section}{\numberline {7}Conclusion}{12}}
    6047\bibstyle{plain}
    6148\bibdata{xmlperf}
     
    6350\bibcite{TR:XML}{2}
    6451\bibcite{Cameron2009}{3}
     52\@writefile{lot}{\contentsline {table}{\numberline {2}{\ignorespaces Parser Performance (Cycles Per Byte)}}{13}}
     53\newlabel{parsers-cpb}{{2}{13}}
    6554\bibcite{PPoPP08}{4}
    66 \@writefile{lot}{\contentsline {table}{\numberline {2}{\ignorespaces Parser Performance (Cycles Per Byte)}}{15}}
    67 \newlabel{parsers-cpb}{{2}{15}}
    6855\bibcite{CameronHerdyLin2008}{5}
    6956\bibcite{CameronLin2009}{6}
  • docs/EuroPar2011/europar-cameron.tex

    r885 r888  
    352352
    353353
    354 \section{Parsing and Error Streams}
     354\section{XML Scanning and Parsing}
    355355\label{sec:errorstream}
    356356
    357 Now consider how the parallel scanning primitive can
    358 be applied to the following problem: parse all occurrences
    359 of XML decimal character references according to the
    360 grammar of Figure \ref{fig:decrefgrmr} and identify any
    361 errors.  This grammar is simplified to omit other
    362 forms of XML reference including hexadecimal character
    363 references as well as general entity references.
    364 For the time being, we assume that our XML documents contain
    365 only the decimal references and no other use of the ``\verb:&:''
    366 character.
     357We now consider how the parallel scanning primitive can
     358be applied to the following problems in scanning and
     359parsing of XML structures:  (1) parallel scanning of XML decimal character references,
     360and (2) parallel parsing of XML start tags.
     361The grammar of these structures is shown in Figure \ref{fig:xmlgrmr}.
    367362
    368363\begin{figure}[tbh]
     
    370365\begin{tabular}{rcl}
    371366DecRef & ::=   &        '\verb:&#:' Digit+ '\verb:;:'  \\
    372 Digit  & ::=   &         \verb:[0-9]:
     367Digit  & ::=   &         \verb:[0-9]:\\
     368STag         &  ::=   &        '\verb:<:' Name (WS  Attribute)* WS? '\verb:>:'  \\
     369Attribute & ::=   &        Name WS? '=' WS? AttValue \\
     370AttValue  &           ::=   &        `\verb:":' \verb:[^<"]*: `\verb:":' $|$ ``\verb:':'' \verb:[^<']*: ``\verb:':'' \\
     371%DQuoted & ::= & \verb:[^<"]*:  \\
     372%SQuoted & ::= & \verb:[^<']*:
    373373\end{tabular}
    374374\end{center}
    375 \caption{Grammar of Decimal Character References}
    376 \label{fig:decrefgrmr}
     375\caption{XML Grammar: Decimal Character References and Start Tags}
     376\label{fig:xmlgrmr}
     377\end{figure}
     378
     379\begin{figure}[b]
     380\begin{center}
     381\begin{tabular}{l@{}lr}\\
     382\multicolumn{2}{l}{source data $\vartriangleright$}     
     383                                         & \verb`-&#978;-&9;--&#;--&#13!-`\\
     384$M_0$ &                                  & \verb`.1......1....1....1.....`\\
     385$M_1$ & $ = n(M_0)$                      & \verb`..1......1....1....1....`\\
     386$E_0$ & $ = M_1 \wedge \neg $\verb:[#]:  & \verb`.........1..............`\\
     387$M_2$ & $ = n(M_1 \wedge \neg  E_0)$     & \verb`...1...........1....1...`\\
     388$E_1$ & $ = M_2 \wedge \neg  D$          & \verb`...............1........`\\
     389$M_3$ & $ = s(M_2 \wedge \neg  E_1, D)$  & \verb`......1...............1.`\\
     390$E_2$ & $ = M_3 \wedge \neg  $\verb:[;]: & \verb`......................1.`\\
     391$M_4$ & $ = M_3 \wedge \neg  E_2$        & \verb`......1.................`\\
     392$E $  & $= E_0 \, | \, E_1 \, | \, E_2$  & \verb`.........1.....1......1.`
     393\end{tabular}
     394\end{center}
     395\caption{Parsing Decimal References}
     396\label{fig:decref}
    377397\end{figure}
    378398
     
    406426indicating the positions of all detected errors.
    407427
    408 \begin{figure}[tbh]
    409 \begin{center}
    410 \begin{tabular}{l@{}lr}\\
    411 \multicolumn{2}{l}{source data $\vartriangleright$}     
    412                                          & \verb`-&#978;-&9;--&#;--&#13!-`\\
    413 $M_0$ &                                  & \verb`.1......1....1....1.....`\\
    414 $M_1$ & $ = n(M_0)$                      & \verb`..1......1....1....1....`\\
    415 $E_0$ & $ = M_1 \wedge \neg $\verb:[#]:  & \verb`.........1..............`\\
    416 $M_2$ & $ = n(M_1 \wedge \neg  E_0)$     & \verb`...1...........1....1...`\\
    417 $E_1$ & $ = M_2 \wedge \neg  D$          & \verb`...............1........`\\
    418 $M_3$ & $ = s(M_2 \wedge \neg  E_1, D)$  & \verb`......1...............1.`\\
    419 $E_2$ & $ = M_3 \wedge \neg  $\verb:[;]: & \verb`......................1.`\\
    420 $M_4$ & $ = M_3 \wedge \neg  E_2$        & \verb`......1.................`\\
    421 $E $  & $= E_0 \, | \, E_1 \, | \, E_2$  & \verb`.........1.....1......1.`
    422 \end{tabular}
    423 \end{center}
    424 \caption{Parsing Decimal References}
    425 \label{fig:decref}
    426 \end{figure}
    427 
    428 
    429 \subsection{Marker Stream Initialization}
    430 
    431 How are marker bitstreams initialized?   In general,
     428
     429One question that may arise is: how are marker bitstreams initialized?   In general,
    432430this is an important problem, and dependent on the task at hand.   
    433431In the XML parsing context,
     
    439437end or empty element tag, and every remaining \verb:&:
    440438must be the initial character of a general entity
    441 or character reference. These assumptions permit easy creation of marker bitstreams for XML tags and XML entities.
    442 
    443 \subsection{Parallel Parsing of Sequential Structures}
    444 
    445 The critical and most interesting application of our parallel
    446 parsing techniques to XML is in parsing of the start, end and empty element
    447 tags that comprise the core of XML markup.   
    448 In particular, start tags have an iterative internal structure
    449 as shown in the grammar of Figure \ref{fig:stag-grmr}.  After an
    450 opening angle bracket and tag name, a tag may have multiple
    451 attribute-value pairs with values enclosed in single or double
    452 quotes.  Using the bitstream addition technique, our method
     439or character reference. These assumptions permit easy creation of
     440marker bitstreams for XML tags and XML references.
     441
     442The parsing of XML start tags is a richer problem, involving
     443sequential structure of attribute-value pairs as shown in Figure \ref{fig:xmlgrmr}.
     444Using the bitstream addition technique, our method
    453445is to start with the opening angle bracket of all tags as
    454446the initial marker bitstream for parsing the tags in parallel,
     
    465457$W$ for whitespace characters,
    466458and $Q$ for characters permitted within a double-quoted attribute value string. 
    467 
    468 \begin{figure}[tbh]
    469 \begin{center}
    470 \begin{tabular}{rcl}
    471 STag         &  ::=   &        '\verb:<:' Name (WS  Attribute)* WS? '\verb:>:'  \\
    472 Attribute & ::=   &        Name WS? '=' WS? AttValue \\
    473 AttValue  &           ::=   &        `\verb:":' \verb:[^<"]*: `\verb:":' $|$ ``\verb:':'' \verb:[^<']*: ``\verb:':'' \\
    474 %DQuoted & ::= & \verb:[^<"]*:  \\
    475 %SQuoted & ::= & \verb:[^<']*:
    476 \end{tabular}
    477 \end{center}
    478 \caption{Grammar of XML Start Tags}
    479 \label{fig:stag-grmr}
    480 \end{figure}
    481 
    482459
    483460\begin{figure*}[tbh]
     
    513490$M_{2,8} = s(M_{2,7}, W) \wedge \neg$\verb:[>]: & \verb`...............................................`
    514491\end{tabular}
    515 
    516 
    517 
    518492\end{center}
    519 
    520 
    521493\caption{Start Tag Parsing}
    522494\label{fig:stag-ex}
     
    524496
    525497
    526 The parsing process is then illustrated in the remaining rows of the
     498The parsing process is illustrated in the remaining rows of the
    527499figure.    Each successive row shows the set of parsing markers as they
    528500advance in parallel using bitwise logic and addition.
     
    593565%We can then use the scan function in a manner similar to how it was used in Figure \ref{fig:scan2} to scan through the entire name to %identify its end position.
    594566
    595 
    596 \subsection{Mask Formation with Bitstream Subtraction}
    597 %\subsection{Comment, CDATA and Processing Instructions}
    598 \label{sec:maskstream}
    599 
    600 For various purposes in parsing, it may be necessary to
    601 introduce {\em mask bitstreams}, streams that identify spans
    602 of positions that are to be selected or excluded from processing
    603 in some fashion.
    604 
    605 In the case of XML processing, one important use of mask
    606 bitstreams is to filter out those \verb:&: and \verb:<:
    607 characters that occur within comments,
    608 CDATA sections and processing instructions and hence
    609 do not indicate starting marker positions
    610 for references or tags, respectively. Each of these
    611 has a relatively simple structure comprising primarily
    612 specific opening and closing delimiters: \verb;<!--;
    613 and \verb:-->: for comments, \verb:<![CDATA[:
    614 and \verb:]]>: for CDATA sections and \verb:<?:
    615 and \verb:?>: for processing instructions.   Processing
    616 instructions also have a small amount of internal structure
    617 consisting of a name that identifies the target of the
    618 processing instruction followed by optional parameter text.
    619 
    620 The content of each of these items is relatively unconstrained
    621 and may contain what appears to be XML markup of other kinds.
    622 This makes it impossible to reliably parse all instances of these
    623 types of markup using parallel techniques. 
    624 However, we can
    625 still use bitstream addition for the sequential parsing of
    626 these items from the beginning of the file.  In this case,
    627 instead of initializing a marker bitstream using specific marker
    628 symbols found throughout the file, the marker bitstream is
    629 initialized with a single 1 bit at the file start position.
    630 
    631 Nevertheless, parsing of comments, CDATA sections and
    632 processing instructions generally proceeds quite quickly in a
    633 sequential fashion.   Parsing steps generally involve long
    634 scans to an opening delimiter of one of these constructs,
    635 followed by further long scans through the content of the
    636 comment, CDATA section or processing instruction.
    637 
    638 \begin{figure*}[tbh]
    639 \begin{center}
    640 \begin{tabular}{cr}\\
    641 source data $\vartriangleright$ & \verb`<!-- <<<< --> <?php 1<2 ?> <t> <![CDATA[ <demo/> ]]>.`\\
    642 $M_i$ & \verb`_1_____________1________________1____________________`\\
    643 $M_f$ & \verb`____________1____________1_________________________1_`\\
    644 $m = n(M_f) - m_i$ & \verb`_111111111111__11111111111______11111111111111111111_`\\
    645 $M_1 = n(\verb:[<]:) \wedge \neg m$ & \verb`____________________________1________________________`
    646 \end{tabular}
    647 \end{center}
    648 \caption{Mask Formation for Comments, CDATA and PI}
    649 \label{fig:CtCDPImask}
    650 \end{figure*}
    651 
    652 A single pass scan is made to identify these structures
    653 within the document.   Once complete, a mask bitstream
    654 of 1 bits is formed to identify the union of the interiors
    655 of these structures.   Figure \ref{fig:CtCDPImask}
    656 illustrates.   Here, we assume this pass is complete
    657 to produce the initial and final positions of these
    658 constructs as $M_i$ and $M_f$.
    659 Here, $M_i$ is actually the set of upshifted start positions
    660 for comments, CDATA sections and processing instructions.
    661 Note the upshifting is requires to properly match the
    662 delimiter $n(\verb`[<]`) \wedge \verb`[!?]`$.
    663 $M_f$ is the set of final positions determined by parsing.
    664 This then allows the mask bitstream $m$ to be computed by
    665 bitstream subtraction.   (To perform the subtraction,
    666 imaging reversing $M_i$ and $M_f$ to produce the
    667 internal little-endian representation.)   Note that this
    668 stream does not allow us to mask out the opening \verb:<:
    669 of these constructs themselves.   But we can mask
    670 out $n(\verb`[<]`)$ to produce $M_1$ as shown.  However,
    671 this is adequate as an initial marker bitstream for tag
    672 parsing, as the formation of $M_1 = n(\verb`[<]`)$ is the
    673 first step shown in Figure \ref{fig:stag-ex}.
    674 
    675 \subsection{Python Prototyping}
    676 
    677 Taking advantage of Python's built-in support for
    678 unbounded integers to represent arbitrary-size bitstreams,
    679 we have implemented a complete parsing prototype for XML
    680 using only bitstream addition, subtraction and bitwise logic.
    681 The \verb:ParallelScanThru: operation is a straightforward
    682 implementation of the scanning primitive $s$ as shown
    683 previously.   The operation $n$ is simply implemented using
    684 an upshift operation, while subtraction and bitwise logic
    685 are directly supported.
    686 
    687 We have also used a modified version of this prototype as the
    688 input language of an experimental bitstream compiler that
    689 we have developed.   This compiler implements the compilation to
    690 block-by-block processing presented in Section 5, following.
     567Following the pattern shown here, the remaining syntactic
     568features of XML markup can similarly be parsed with
     569bitstream based methods.   One complexity is that the
     570parsing of comments,
     571CDATA sections and processing instructions must be
     572performed first to determine those regions of text
     573within which ordinary XML markups are not parsed (i.e.,
     574within each of these types of construct.   This is handled
     575by first performance the parsing of these structures and
     576then forming a {\em mask bitstream}, that is, a stream that
     577identifies spans of text to be excluded from parsing
     578(comment and CDATA interiors, parameter text to processing instructions).
    691579
    692580
     
    711599
    712600
    713 \subsection{Error and Error-Check Bitstreams}
     601%\subsection{Error and Error-Check Bitstreams}
    714602
    715603Most of the requirements of XML well-formedness checking
     
    776664checking the well-formedness requirements.
    777665
    778 \subsection{Tag Matching}
     666%\subsection{Tag Matching}
    779667
    780668Except for empty element tags, XML tags come in pairs with
     
    789677and match tags using an iterative stack-based approach.
    790678
    791 \subsection{Document Structure}
     679%\subsection{Document Structure}
    792680
    793681An XML document consists of a single root element with recursively
     
    802690that is quite low for sufficiently sized files.
    803691
    804 \subsection{Summary}
     692%\subsection{Summary}
    805693
    806694Overall, parallel bitstream techniques are quite well-suited to
Note: See TracChangeset for help on using the changeset viewer.