# Changeset 888

Ignore:
Timestamp:
Feb 4, 2011, 4:39:15 PM (9 years ago)
Message:

Tighten section 3

Location:
docs/EuroPar2011
Files:
3 edited

### Legend:

Unmodified
 r885 \section{Parsing and Error Streams} \section{XML Scanning and Parsing} \label{sec:errorstream} Now consider how the parallel scanning primitive can be applied to the following problem: parse all occurrences of XML decimal character references according to the grammar of Figure \ref{fig:decrefgrmr} and identify any errors.  This grammar is simplified to omit other forms of XML reference including hexadecimal character references as well as general entity references. For the time being, we assume that our XML documents contain only the decimal references and no other use of the \verb:&:'' character. We now consider how the parallel scanning primitive can be applied to the following problems in scanning and parsing of XML structures:  (1) parallel scanning of XML decimal character references, and (2) parallel parsing of XML start tags. The grammar of these structures is shown in Figure \ref{fig:xmlgrmr}. \begin{figure}[tbh] \begin{tabular}{rcl} DecRef & ::=   &        '\verb:&#:' Digit+ '\verb:;:'  \\ Digit  & ::=   &         \verb:[0-9]: Digit  & ::=   &         \verb:[0-9]:\\ STag         &  ::=   &        '\verb:<:' Name (WS  Attribute)* WS? '\verb:>:'  \\ Attribute & ::=   &        Name WS? '=' WS? AttValue \\ AttValue  &           ::=   &        \verb:":' \verb:[^<"]*: \verb:":' $|$ \verb:':'' \verb:[^<']*: \verb:':'' \\ %DQuoted & ::= & \verb:[^<"]*:  \\ %SQuoted & ::= & \verb:[^<']*: \end{tabular} \end{center} \caption{Grammar of Decimal Character References} \label{fig:decrefgrmr} \caption{XML Grammar: Decimal Character References and Start Tags} \label{fig:xmlgrmr} \end{figure} \begin{figure}[b] \begin{center} \begin{tabular}{l@{}lr}\\ \multicolumn{2}{l}{source data $\vartriangleright$} & \verb-ϒ-&9;--&#;-- !-\\ $M_0$ &                                  & \verb.1......1....1....1.....\\ $M_1$ & $= n(M_0)$                      & \verb..1......1....1....1....\\ $E_0$ & $= M_1 \wedge \neg$\verb:[#]:  & \verb.........1..............\\ $M_2$ & $= n(M_1 \wedge \neg E_0)$     & \verb...1...........1....1...\\ $E_1$ & $= M_2 \wedge \neg D$          & \verb...............1........\\ $M_3$ & $= s(M_2 \wedge \neg E_1, D)$  & \verb......1...............1.\\ $E_2$ & $= M_3 \wedge \neg$\verb:[;]: & \verb......................1.\\ $M_4$ & $= M_3 \wedge \neg E_2$        & \verb......1.................\\ $E$  & $= E_0 \, | \, E_1 \, | \, E_2$  & \verb.........1.....1......1. \end{tabular} \end{center} \caption{Parsing Decimal References} \label{fig:decref} \end{figure} indicating the positions of all detected errors. \begin{figure}[tbh] \begin{center} \begin{tabular}{l@{}lr}\\ \multicolumn{2}{l}{source data $\vartriangleright$} & \verb-ϒ-&9;--&#;-- !-\\ $M_0$ &                                  & \verb.1......1....1....1.....\\ $M_1$ & $= n(M_0)$                      & \verb..1......1....1....1....\\ $E_0$ & $= M_1 \wedge \neg$\verb:[#]:  & \verb.........1..............\\ $M_2$ & $= n(M_1 \wedge \neg E_0)$     & \verb...1...........1....1...\\ $E_1$ & $= M_2 \wedge \neg D$          & \verb...............1........\\ $M_3$ & $= s(M_2 \wedge \neg E_1, D)$  & \verb......1...............1.\\ $E_2$ & $= M_3 \wedge \neg$\verb:[;]: & \verb......................1.\\ $M_4$ & $= M_3 \wedge \neg E_2$        & \verb......1.................\\ $E$  & $= E_0 \, | \, E_1 \, | \, E_2$  & \verb.........1.....1......1. \end{tabular} \end{center} \caption{Parsing Decimal References} \label{fig:decref} \end{figure} \subsection{Marker Stream Initialization} How are marker bitstreams initialized?   In general, One question that may arise is: how are marker bitstreams initialized?   In general, this is an important problem, and dependent on the task at hand. In the XML parsing context, end or empty element tag, and every remaining \verb:&: must be the initial character of a general entity or character reference. These assumptions permit easy creation of marker bitstreams for XML tags and XML entities. \subsection{Parallel Parsing of Sequential Structures} The critical and most interesting application of our parallel parsing techniques to XML is in parsing of the start, end and empty element tags that comprise the core of XML markup. In particular, start tags have an iterative internal structure as shown in the grammar of Figure \ref{fig:stag-grmr}.  After an opening angle bracket and tag name, a tag may have multiple attribute-value pairs with values enclosed in single or double quotes.  Using the bitstream addition technique, our method or character reference. These assumptions permit easy creation of marker bitstreams for XML tags and XML references. The parsing of XML start tags is a richer problem, involving sequential structure of attribute-value pairs as shown in Figure \ref{fig:xmlgrmr}. Using the bitstream addition technique, our method is to start with the opening angle bracket of all tags as the initial marker bitstream for parsing the tags in parallel, $W$ for whitespace characters, and $Q$ for characters permitted within a double-quoted attribute value string. \begin{figure}[tbh] \begin{center} \begin{tabular}{rcl} STag         &  ::=   &        '\verb:<:' Name (WS  Attribute)* WS? '\verb:>:'  \\ Attribute & ::=   &        Name WS? '=' WS? AttValue \\ AttValue  &           ::=   &        \verb:":' \verb:[^<"]*: \verb:":' $|$ \verb:':'' \verb:[^<']*: \verb:':'' \\ %DQuoted & ::= & \verb:[^<"]*:  \\ %SQuoted & ::= & \verb:[^<']*: \end{tabular} \end{center} \caption{Grammar of XML Start Tags} \label{fig:stag-grmr} \end{figure} \begin{figure*}[tbh] $M_{2,8} = s(M_{2,7}, W) \wedge \neg$\verb:[>]: & \verb............................................... \end{tabular} \end{center} \caption{Start Tag Parsing} \label{fig:stag-ex} The parsing process is then illustrated in the remaining rows of the The parsing process is illustrated in the remaining rows of the figure.    Each successive row shows the set of parsing markers as they advance in parallel using bitwise logic and addition. %We can then use the scan function in a manner similar to how it was used in Figure \ref{fig:scan2} to scan through the entire name to %identify its end position. \subsection{Mask Formation with Bitstream Subtraction} %\subsection{Comment, CDATA and Processing Instructions} \label{sec:maskstream} For various purposes in parsing, it may be necessary to introduce {\em mask bitstreams}, streams that identify spans of positions that are to be selected or excluded from processing in some fashion. In the case of XML processing, one important use of mask bitstreams is to filter out those \verb:&: and \verb:<: characters that occur within comments, CDATA sections and processing instructions and hence do not indicate starting marker positions for references or tags, respectively. Each of these has a relatively simple structure comprising primarily specific opening and closing delimiters: \verb;: for comments, \verb:: for CDATA sections and \verb:: for processing instructions.   Processing instructions also have a small amount of internal structure consisting of a name that identifies the target of the processing instruction followed by optional parameter text. The content of each of these items is relatively unconstrained and may contain what appears to be XML markup of other kinds. This makes it impossible to reliably parse all instances of these types of markup using parallel techniques. However, we can still use bitstream addition for the sequential parsing of these items from the beginning of the file.  In this case, instead of initializing a marker bitstream using specific marker symbols found throughout the file, the marker bitstream is initialized with a single 1 bit at the file start position. Nevertheless, parsing of comments, CDATA sections and processing instructions generally proceeds quite quickly in a sequential fashion.   Parsing steps generally involve long scans to an opening delimiter of one of these constructs, followed by further long scans through the content of the comment, CDATA section or processing instruction. \begin{figure*}[tbh] \begin{center} \begin{tabular}{cr}\\ source data $\vartriangleright$ & \verb ]]>.\\ $M_i$ & \verb_1_____________1________________1____________________\\ $M_f$ & \verb____________1____________1_________________________1_\\ $m = n(M_f) - m_i$ & \verb_111111111111__11111111111______11111111111111111111_\\ $M_1 = n(\verb:[<]:) \wedge \neg m$ & \verb____________________________1________________________ \end{tabular} \end{center} \caption{Mask Formation for Comments, CDATA and PI} \label{fig:CtCDPImask} \end{figure*} A single pass scan is made to identify these structures within the document.   Once complete, a mask bitstream of 1 bits is formed to identify the union of the interiors of these structures.   Figure \ref{fig:CtCDPImask} illustrates.   Here, we assume this pass is complete to produce the initial and final positions of these constructs as $M_i$ and $M_f$. Here, $M_i$ is actually the set of upshifted start positions for comments, CDATA sections and processing instructions. Note the upshifting is requires to properly match the delimiter $n(\verb[<]) \wedge \verb[!?]$. $M_f$ is the set of final positions determined by parsing. This then allows the mask bitstream $m$ to be computed by bitstream subtraction.   (To perform the subtraction, imaging reversing $M_i$ and $M_f$ to produce the internal little-endian representation.)   Note that this stream does not allow us to mask out the opening \verb:<: of these constructs themselves.   But we can mask out $n(\verb[<])$ to produce $M_1$ as shown.  However, this is adequate as an initial marker bitstream for tag parsing, as the formation of $M_1 = n(\verb[<])$ is the first step shown in Figure \ref{fig:stag-ex}. \subsection{Python Prototyping} Taking advantage of Python's built-in support for unbounded integers to represent arbitrary-size bitstreams, we have implemented a complete parsing prototype for XML using only bitstream addition, subtraction and bitwise logic. The \verb:ParallelScanThru: operation is a straightforward implementation of the scanning primitive $s$ as shown previously.   The operation $n$ is simply implemented using an upshift operation, while subtraction and bitwise logic are directly supported. We have also used a modified version of this prototype as the input language of an experimental bitstream compiler that we have developed.   This compiler implements the compilation to block-by-block processing presented in Section 5, following. Following the pattern shown here, the remaining syntactic features of XML markup can similarly be parsed with bitstream based methods.   One complexity is that the parsing of comments, CDATA sections and processing instructions must be performed first to determine those regions of text within which ordinary XML markups are not parsed (i.e., within each of these types of construct.   This is handled by first performance the parsing of these structures and then forming a {\em mask bitstream}, that is, a stream that identifies spans of text to be excluded from parsing (comment and CDATA interiors, parameter text to processing instructions). \subsection{Error and Error-Check Bitstreams} %\subsection{Error and Error-Check Bitstreams} Most of the requirements of XML well-formedness checking checking the well-formedness requirements. \subsection{Tag Matching} %\subsection{Tag Matching} Except for empty element tags, XML tags come in pairs with and match tags using an iterative stack-based approach. \subsection{Document Structure} %\subsection{Document Structure} An XML document consists of a single root element with recursively that is quite low for sufficiently sized files. \subsection{Summary} %\subsection{Summary} Overall, parallel bitstream techniques are quite well-suited to