Feb 4, 2011, 4:39:15 PM (8 years ago)

Tighten section 3

1 edited


  • docs/EuroPar2011/europar-cameron.tex

    r885 r888  
    354 \section{Parsing and Error Streams}
     354\section{XML Scanning and Parsing}
    357 Now consider how the parallel scanning primitive can
    358 be applied to the following problem: parse all occurrences
    359 of XML decimal character references according to the
    360 grammar of Figure \ref{fig:decrefgrmr} and identify any
    361 errors.  This grammar is simplified to omit other
    362 forms of XML reference including hexadecimal character
    363 references as well as general entity references.
    364 For the time being, we assume that our XML documents contain
    365 only the decimal references and no other use of the ``\verb:&:''
    366 character.
     357We now consider how the parallel scanning primitive can
     358be applied to the following problems in scanning and
     359parsing of XML structures:  (1) parallel scanning of XML decimal character references,
     360and (2) parallel parsing of XML start tags.
     361The grammar of these structures is shown in Figure \ref{fig:xmlgrmr}.
    371366DecRef & ::=   &        '\verb:&#:' Digit+ '\verb:;:'  \\
    372 Digit  & ::=   &         \verb:[0-9]:
     367Digit  & ::=   &         \verb:[0-9]:\\
     368STag         &  ::=   &        '\verb:<:' Name (WS  Attribute)* WS? '\verb:>:'  \\
     369Attribute & ::=   &        Name WS? '=' WS? AttValue \\
     370AttValue  &           ::=   &        `\verb:":' \verb:[^<"]*: `\verb:":' $|$ ``\verb:':'' \verb:[^<']*: ``\verb:':'' \\
     371%DQuoted & ::= & \verb:[^<"]*:  \\
     372%SQuoted & ::= & \verb:[^<']*:
    375 \caption{Grammar of Decimal Character References}
    376 \label{fig:decrefgrmr}
     375\caption{XML Grammar: Decimal Character References and Start Tags}
     382\multicolumn{2}{l}{source data $\vartriangleright$}     
     383                                         & \verb`-&#978;-&9;--&#;--&#13!-`\\
     384$M_0$ &                                  & \verb`.1......1....1....1.....`\\
     385$M_1$ & $ = n(M_0)$                      & \verb`..1......1....1....1....`\\
     386$E_0$ & $ = M_1 \wedge \neg $\verb:[#]:  & \verb`.........1..............`\\
     387$M_2$ & $ = n(M_1 \wedge \neg  E_0)$     & \verb`...1...........1....1...`\\
     388$E_1$ & $ = M_2 \wedge \neg  D$          & \verb`...............1........`\\
     389$M_3$ & $ = s(M_2 \wedge \neg  E_1, D)$  & \verb`......1...............1.`\\
     390$E_2$ & $ = M_3 \wedge \neg  $\verb:[;]: & \verb`......................1.`\\
     391$M_4$ & $ = M_3 \wedge \neg  E_2$        & \verb`......1.................`\\
     392$E $  & $= E_0 \, | \, E_1 \, | \, E_2$  & \verb`.........1.....1......1.`
     395\caption{Parsing Decimal References}
    406426indicating the positions of all detected errors.
    408 \begin{figure}[tbh]
    409 \begin{center}
    410 \begin{tabular}{l@{}lr}\\
    411 \multicolumn{2}{l}{source data $\vartriangleright$}     
    412                                          & \verb`-&#978;-&9;--&#;--&#13!-`\\
    413 $M_0$ &                                  & \verb`.1......1....1....1.....`\\
    414 $M_1$ & $ = n(M_0)$                      & \verb`..1......1....1....1....`\\
    415 $E_0$ & $ = M_1 \wedge \neg $\verb:[#]:  & \verb`.........1..............`\\
    416 $M_2$ & $ = n(M_1 \wedge \neg  E_0)$     & \verb`...1...........1....1...`\\
    417 $E_1$ & $ = M_2 \wedge \neg  D$          & \verb`...............1........`\\
    418 $M_3$ & $ = s(M_2 \wedge \neg  E_1, D)$  & \verb`......1...............1.`\\
    419 $E_2$ & $ = M_3 \wedge \neg  $\verb:[;]: & \verb`......................1.`\\
    420 $M_4$ & $ = M_3 \wedge \neg  E_2$        & \verb`......1.................`\\
    421 $E $  & $= E_0 \, | \, E_1 \, | \, E_2$  & \verb`.........1.....1......1.`
    422 \end{tabular}
    423 \end{center}
    424 \caption{Parsing Decimal References}
    425 \label{fig:decref}
    426 \end{figure}
    429 \subsection{Marker Stream Initialization}
    431 How are marker bitstreams initialized?   In general,
     429One question that may arise is: how are marker bitstreams initialized?   In general,
    432430this is an important problem, and dependent on the task at hand.   
    433431In the XML parsing context,
    439437end or empty element tag, and every remaining \verb:&:
    440438must be the initial character of a general entity
    441 or character reference. These assumptions permit easy creation of marker bitstreams for XML tags and XML entities.
    443 \subsection{Parallel Parsing of Sequential Structures}
    445 The critical and most interesting application of our parallel
    446 parsing techniques to XML is in parsing of the start, end and empty element
    447 tags that comprise the core of XML markup.   
    448 In particular, start tags have an iterative internal structure
    449 as shown in the grammar of Figure \ref{fig:stag-grmr}.  After an
    450 opening angle bracket and tag name, a tag may have multiple
    451 attribute-value pairs with values enclosed in single or double
    452 quotes.  Using the bitstream addition technique, our method
     439or character reference. These assumptions permit easy creation of
     440marker bitstreams for XML tags and XML references.
     442The parsing of XML start tags is a richer problem, involving
     443sequential structure of attribute-value pairs as shown in Figure \ref{fig:xmlgrmr}.
     444Using the bitstream addition technique, our method
    453445is to start with the opening angle bracket of all tags as
    454446the initial marker bitstream for parsing the tags in parallel,
    465457$W$ for whitespace characters,
    466458and $Q$ for characters permitted within a double-quoted attribute value string. 
    468 \begin{figure}[tbh]
    469 \begin{center}
    470 \begin{tabular}{rcl}
    471 STag         &  ::=   &        '\verb:<:' Name (WS  Attribute)* WS? '\verb:>:'  \\
    472 Attribute & ::=   &        Name WS? '=' WS? AttValue \\
    473 AttValue  &           ::=   &        `\verb:":' \verb:[^<"]*: `\verb:":' $|$ ``\verb:':'' \verb:[^<']*: ``\verb:':'' \\
    474 %DQuoted & ::= & \verb:[^<"]*:  \\
    475 %SQuoted & ::= & \verb:[^<']*:
    476 \end{tabular}
    477 \end{center}
    478 \caption{Grammar of XML Start Tags}
    479 \label{fig:stag-grmr}
    480 \end{figure}
    513490$M_{2,8} = s(M_{2,7}, W) \wedge \neg$\verb:[>]: & \verb`...............................................`
    521493\caption{Start Tag Parsing}
    526 The parsing process is then illustrated in the remaining rows of the
     498The parsing process is illustrated in the remaining rows of the
    527499figure.    Each successive row shows the set of parsing markers as they
    528500advance in parallel using bitwise logic and addition.
    593565%We can then use the scan function in a manner similar to how it was used in Figure \ref{fig:scan2} to scan through the entire name to %identify its end position.
    596 \subsection{Mask Formation with Bitstream Subtraction}
    597 %\subsection{Comment, CDATA and Processing Instructions}
    598 \label{sec:maskstream}
    600 For various purposes in parsing, it may be necessary to
    601 introduce {\em mask bitstreams}, streams that identify spans
    602 of positions that are to be selected or excluded from processing
    603 in some fashion.
    605 In the case of XML processing, one important use of mask
    606 bitstreams is to filter out those \verb:&: and \verb:<:
    607 characters that occur within comments,
    608 CDATA sections and processing instructions and hence
    609 do not indicate starting marker positions
    610 for references or tags, respectively. Each of these
    611 has a relatively simple structure comprising primarily
    612 specific opening and closing delimiters: \verb;<!--;
    613 and \verb:-->: for comments, \verb:<![CDATA[:
    614 and \verb:]]>: for CDATA sections and \verb:<?:
    615 and \verb:?>: for processing instructions.   Processing
    616 instructions also have a small amount of internal structure
    617 consisting of a name that identifies the target of the
    618 processing instruction followed by optional parameter text.
    620 The content of each of these items is relatively unconstrained
    621 and may contain what appears to be XML markup of other kinds.
    622 This makes it impossible to reliably parse all instances of these
    623 types of markup using parallel techniques. 
    624 However, we can
    625 still use bitstream addition for the sequential parsing of
    626 these items from the beginning of the file.  In this case,
    627 instead of initializing a marker bitstream using specific marker
    628 symbols found throughout the file, the marker bitstream is
    629 initialized with a single 1 bit at the file start position.
    631 Nevertheless, parsing of comments, CDATA sections and
    632 processing instructions generally proceeds quite quickly in a
    633 sequential fashion.   Parsing steps generally involve long
    634 scans to an opening delimiter of one of these constructs,
    635 followed by further long scans through the content of the
    636 comment, CDATA section or processing instruction.
    638 \begin{figure*}[tbh]
    639 \begin{center}
    640 \begin{tabular}{cr}\\
    641 source data $\vartriangleright$ & \verb`<!-- <<<< --> <?php 1<2 ?> <t> <![CDATA[ <demo/> ]]>.`\\
    642 $M_i$ & \verb`_1_____________1________________1____________________`\\
    643 $M_f$ & \verb`____________1____________1_________________________1_`\\
    644 $m = n(M_f) - m_i$ & \verb`_111111111111__11111111111______11111111111111111111_`\\
    645 $M_1 = n(\verb:[<]:) \wedge \neg m$ & \verb`____________________________1________________________`
    646 \end{tabular}
    647 \end{center}
    648 \caption{Mask Formation for Comments, CDATA and PI}
    649 \label{fig:CtCDPImask}
    650 \end{figure*}
    652 A single pass scan is made to identify these structures
    653 within the document.   Once complete, a mask bitstream
    654 of 1 bits is formed to identify the union of the interiors
    655 of these structures.   Figure \ref{fig:CtCDPImask}
    656 illustrates.   Here, we assume this pass is complete
    657 to produce the initial and final positions of these
    658 constructs as $M_i$ and $M_f$.
    659 Here, $M_i$ is actually the set of upshifted start positions
    660 for comments, CDATA sections and processing instructions.
    661 Note the upshifting is requires to properly match the
    662 delimiter $n(\verb`[<]`) \wedge \verb`[!?]`$.
    663 $M_f$ is the set of final positions determined by parsing.
    664 This then allows the mask bitstream $m$ to be computed by
    665 bitstream subtraction.   (To perform the subtraction,
    666 imaging reversing $M_i$ and $M_f$ to produce the
    667 internal little-endian representation.)   Note that this
    668 stream does not allow us to mask out the opening \verb:<:
    669 of these constructs themselves.   But we can mask
    670 out $n(\verb`[<]`)$ to produce $M_1$ as shown.  However,
    671 this is adequate as an initial marker bitstream for tag
    672 parsing, as the formation of $M_1 = n(\verb`[<]`)$ is the
    673 first step shown in Figure \ref{fig:stag-ex}.
    675 \subsection{Python Prototyping}
    677 Taking advantage of Python's built-in support for
    678 unbounded integers to represent arbitrary-size bitstreams,
    679 we have implemented a complete parsing prototype for XML
    680 using only bitstream addition, subtraction and bitwise logic.
    681 The \verb:ParallelScanThru: operation is a straightforward
    682 implementation of the scanning primitive $s$ as shown
    683 previously.   The operation $n$ is simply implemented using
    684 an upshift operation, while subtraction and bitwise logic
    685 are directly supported.
    687 We have also used a modified version of this prototype as the
    688 input language of an experimental bitstream compiler that
    689 we have developed.   This compiler implements the compilation to
    690 block-by-block processing presented in Section 5, following.
     567Following the pattern shown here, the remaining syntactic
     568features of XML markup can similarly be parsed with
     569bitstream based methods.   One complexity is that the
     570parsing of comments,
     571CDATA sections and processing instructions must be
     572performed first to determine those regions of text
     573within which ordinary XML markups are not parsed (i.e.,
     574within each of these types of construct.   This is handled
     575by first performance the parsing of these structures and
     576then forming a {\em mask bitstream}, that is, a stream that
     577identifies spans of text to be excluded from parsing
     578(comment and CDATA interiors, parameter text to processing instructions).
    713 \subsection{Error and Error-Check Bitstreams}
     601%\subsection{Error and Error-Check Bitstreams}
    715603Most of the requirements of XML well-formedness checking
    776664checking the well-formedness requirements.
    778 \subsection{Tag Matching}
     666%\subsection{Tag Matching}
    780668Except for empty element tags, XML tags come in pairs with
    789677and match tags using an iterative stack-based approach.
    791 \subsection{Document Structure}
     679%\subsection{Document Structure}
    793681An XML document consists of a single root element with recursively
    802690that is quite low for sufficiently sized files.
    804 \subsection{Summary}
    806694Overall, parallel bitstream techniques are quite well-suited to
Note: See TracChangeset for help on using the changeset viewer.