# Changeset 3041 for docs/Balisage13/Bal2013came0601/Bal2013came0601.html

Ignore:
Timestamp:
Apr 19, 2013, 12:39:07 AM (6 years ago)
Message:

Replaced verbatim with code tags, replaced math escape with appropriate tags, added unicode escapes, added personal blurbs.

File:
1 edited

Unmodified
Removed
• ## docs/Balisage13/Bal2013came0601/Bal2013came0601.html

 r3040

Graduate Student, School of Computing Science

Simon Fraser University

Chief Technology Officer

International Characters, Inc.

Introduction

Background

Xerces C++ Structure

The Xerces C++ parser parsing as well as a DOM tree-based parsing interface.

is required, involving all aspects of the parser.

The Parabix Framework

The Parabix (parallel bit stream) framework is a transformative approach to XML parsing (and other forms of text processing.) The key idea is to exploit the availability of wide Boolean-logic operations\footnote{â§, \âš and Â¬ denote the boolean AND, OR and NOT operators.} are used to classify the input bits into a set of character-class bit streams, which identify key characters (or groups of characters) with a $1$. characters (or groups of characters) with a 1. For example, one of the fundamental characters in XML is a left-angle bracket. A character is a< if and only if Â¬(b0 âš b1) â§ (b2 â§ A character is an '<' if and only if Â¬(b0 âš b1) â§ (b2 â§ b3) â§ (b4 â§ b5) â§ Â¬ (b6 âš b7) = 1. Similarly, a character is numeric [0-9] if and only if $Â¬(b0 âš b1) â§ (b2 â§ b3) â§ Â¬(b4 â§ (b5 âš b6)). Similarly, a character is numeric, [0-9] if and only if Â¬(b0 âš b1) â§ (b2 â§ b3) â§ Â¬(b4 â§ (b5 âš b6)). An important observation here is that ranges of characters may require fewer operations than individual characters and multiple classes can share the classification cost. Consider, for example, the XML source data stream shown in the first line of . The remaining lines of this figure show several parallel bit streams that are computed in Parabix-style tag openers into start tag marks and end tag marks depending on the character immediately following the opener (i.e., \verb:/:'') or not. The remaining three opener (i.e., "/") or not. The remaining three lines show streams that can be computed in subsequent parsing (using the technique attribute names and attribute values of tags. Two intuitions may help explain how the Parabix approach can lead to improved XML parsing performance. The first is that should provide substantial benefit. Previous studies have shown that the Parabix approach improves many aspects of XML processing, including transcoding \cite{Cameron2008}, character classification and validation, they lacked the functionality required by full XML parsers. Commercial XML processors support transcoding of multiple character sets and can parse and validate against multiple document vocabularies. Sequential vs. Parallel Paradigm Xercesâlike all traditional XML parsersâprocesses XML documents sequentially. Each character is examined to distinguish between the validation and content processing modes. In other words, Xerces belongs to an equivalent class applications termed FSM applications\footnote{ Herein FSM applications are considered software systems whose behaviour is defined by the inputs, Unfortunately, textual data tends to be unpredictable and any character could induce a state transition. Parabix-style XML parsers utilize a concept of layered processing. A block of source text is transformed into a set of lexical bitstreams, Architecture Overview icXML is more than an optimized version of Xerces. Many components were grouped, restructured and rearchitected with pipeline parallelism in mind. the user-defined DTD and schema grammar(s) before passing it to the end-user. In icXML functions are grouped into logical components. As shown in Figure \ref{fig:icxml-arch}, two major categories exist: (1) the Parabix Subsystem and (2) the Markup Processor. From here, two data-independent branches exist: the Symbol Resolver and Content Preparation Unit. A typical XML file contains few unique element and attribute namesâbut each of them will occur frequently. icXML stores these as distinct data structures, called symbols, each with their own global identifier (GID). the raw data to produce a sequence of GIDs, called the symbol stream. The final components of the Parabix Subsystem are the Content Preparation Unit and Content Stream Generator. The former takes the (transposed) basis bitstreams and selectively filters them, according to the filtered streams into the tagged UTF-16 content stream, discussed in Section \ref{section:arch:contentstream}. Combined, the symbol and content stream form icXML's compressed IR of the XML document. The Markup Processor~parses the IR to validate and produce the sequential output for the end user. However, preprocessing associated with each symbol greatly reduces the work of this stage. Character Set Adapters In Xerces, all input is transcoded into UTF-16 to simplify the parsing costs of Xerces itself and provide the end-consumer with a single encoding format. transcoding imposes at least a cost of buffer copying. In icXML, however, the concept of Character Set Adapters (CSAs) is used to minimize transcoding costs. Given a specified input encoding, a CSA is responsible for checking that is performed using the parallel bitstream representation of the source input. An important observation is that many character sets are an extension to the legacy 7-bit ASCII character set. This includes the serves to compute lexical item streams for all such ASCII-based character sets. A second observation is thatâregardless of which character set is usedâquite often all of the characters in a particular block of input will be within the ASCII range. UTF-16 form are each set to zero in this case. A third observation is that repeated transcoding of the names of XML elements, attributes and so on can be avoided by using a look-up mechanism. additional cost. The cost of individual character transcoding is avoided whenever a block of input is confined to the ASCII subset and for all but the first occurrence of any XML element or attribute name. Combined Parallel Filtering As just mentioned, UTF-8 to UTF-16 transcoding involves marking all but the last bytes of multi-byte UTF-8 sequences as positions for deletion. For example, the two Chinese characters \begin{CJK*}{UTF8}{gbsn}äœ å¥œ\end{CJK*} are represented as two three-byte UTF-8 sequences \verb'E4 BD A0' and \verb'E5 A5 BD' while the UTF-16 representation must be compressed down to the two code units \verb'4F60' and \verb'597D'. Chinese characters äœ å¥œ are represented as two three-byte UTF-8 sequences E4 BD A0 and E5 A5 BD while the UTF-16 representation must be compressed down to the two code units 4F60 and 597D. In the bit parallel representation, this corresponds to a reduction from six bit positions representing UTF-8 code units (bytes) for deletion. In this case, the portion of the mask corresponding to these input bytes is the bit sequence \verb'110110'. Using this approach, transcoding may then be 110110. Using this approach, transcoding may then be completed by applying parallel deletion and inverse transposition of the UTF-16 bitstreams\cite{Cameron2008}. Rather than immediately paying the costs of deletion and transposition just for transcoding, In essence, the deletion masks for transcoding and for line break normalization each represent a bitwise applied once. A further application of combined filtering is the processing of XML character and entity that this is not true, it is addressed in post-processing. The final step of combined filtering occurs during the process of reducing markup data to tag bytes Content Stream A relatively-unique concept for icXML is the use of a filtered content stream. Rather that parsing an XML document in its original format, the input is transformed through the parallel filtering algorithm, described in section \ref{sec:parfilter}. Combined with the symbol stream, the parser traverses the content stream to effectively reconstructs the input document in its output form. The initial 0 indicates an empty content string. The following \verb|>| The initial 0 indicates an empty content string. The following > indicates that a start tag without any attributes is the first element in this text and the first unused symbol, document, is the element name. directly jump to the end of every string without scanning for it. Following \verbfee'' is a \verb=, which marks the existence of an attribute. Following 'fee' is a =, which marks the existence of an attribute. Because all of the intra-element was performed in the Parabix Subsystem, this must be a legal attribute. Since attributes can only occur within start tags and must be accompanied by a textual value, the next symbol in the symbol stream must be the element name of a start tag, and the following one must be the name of the attribute and the string that follows the \verb= must be its value. However, the subsequent \verb= is not treated as an independent attribute because the parser has yet to read a \verb>, which marks the end of a start tag. Thus only one symbol is taken from the symbol stream and and the following one must be the name of the attribute and the string that follows the = must be its value. However, the subsequent = is not treated as an independent attribute because the parser has yet to read a >, which marks the end of a start tag. Thus only one symbol is taken from the symbol stream and it (along with the string value) is added to the element. Eventually the parser reaches a \verb/, which marks the existence of an end tag. Every end tag requires an Eventually the parser reaches a /, which marks the existence of an end tag. Every end tag requires an element name, which means they require a symbol. Inter-element validation whenever an empty tag is detected to ensure that the appropriate scope-nesting rules have been applied. Namespace Handling In XML, namespaces prevents naming conflicts when multiple vocabularies are used together. It is especially important when a vocabulary application-dependant meaning, such as when Namespaces are bound to uniform resource identifiers (URIs), which are strings used to identify specific names or resources. On line 1 of Figure \ref{fig:namespace1}, the \verb|xmlns| attribute instructs the XML On line 1 of Figure \ref{fig:namespace1}, the xmlns attribute instructs the XML processor to bind the prefix p to the URI 'pub.net' and the default (empty) prefix to book.org. Thus to the XML processor, the \verb|title| on line 2 and \verb|price| on line 4 both read as \verb|"book.org":title| and \verb|"book.org":price| respectively, whereas on line 3 and 5, \verb|p:name| and \verb|price| are seen as \verb|"pub.net":name| and \verb|"pub.net":price|. Even though the actual element name \verb|price|, due to namespace scoping rules they are viewed as two uniquely-named items prefix to book.org. Thus to the XML processor, the title on line 2 and price on line 4 both read as "book.org":title and "book.org":price respectively, whereas on line 3 and 5, p:name and price are seen as "pub.net":name and "pub.net":price. Even though the actual element name price, due to namespace scoping rules they are viewed as two uniquely-named items because the current vocabulary is determined by the namespace(s) that are in-scope. In both Xerces and icXML, every URI has a one-to-one mapping to a URI ID. These persist for the lifetime of the application through the use of a global URI pool. (2) those that repeatedly modify the namespaces in predictable patterns. For that reason, icXML contains an independent namespace stack and utilizes bit vectors to cheaply perform When a prefix is declared (e.g., \verb|xmlns:p="pub.net"|), a namespace binding is created that maps When a prefix is declared (e.g., xmlns:p="pub.net"), a namespace binding is created that maps the prefix (which are assigned Prefix IDs in the symbol resolution process) to the URI. Each unique namespace binding has a unique namespace id (NSID) and every prefix contains a bit vector marking every NSID that has ever been associated with it within the document. For example, in Table \ref{tbl:namespace1}, the prefix binding set of \verb|p| and \verb|xmlns| would be \verb|01| and \verb|11| respectively. prefix binding set of p and xmlns would be 01 and 11 respectively. To resolve the in-scope namespace binding for each prefix, a bit vector of the currently visible namespaces is maintained by the system. By ANDing the prefix bit vector with the currently visible namespaces, the in-scope A namespace binding table, similar to Table \ref{tbl:namespace1}, provides the actual URI ID. To ensure that scoping rules are adhered to, whenever a start tag is encountered, any modification to the currently visible namespaces is calculated and stored Error Handling Xerces outputs error messages in two ways: through the programmer API and as thrown objects for fatal errors. each with its own system for detecting and producing error messages. Within the Parabix Subsystem, all computations are performed in parallel, a block at a time. Errors are derived as artifacts of bitstream calculations, with a 1-bit marking the byte-position of an error within a block, column number. The Markup Processor is a state-driven machine. As such, error detection within it is very similar to Xerces. However, reporting the correct line/column is a much more difficult problem. Multithreading with Pipeline Parallelism As discussed in section \ref{background:xerces}, Xerces can be considered a FSM application. These are embarrassingly sequential.''\cite{Asanovic:EECS-2006-183} and notoriously difficult to parallelize. These are "embarrassingly sequential."\cite{Asanovic:EECS-2006-183} and notoriously difficult to parallelize. However, icXML is designed to organize processing into logical layers. In particular, layers within the Parabix Subsystem are designed to operate of modules. The most straightforward division of work in icXML is to separate the Parabix Subsystem and the Markup Processor into distinct logical layers into two separate stages. The resultant application, icXML-p, is a course-grained software-pipeline application. In this case, the Parabix Subsystem thread$T_1$reads 16k of XML input$I$at a time and produces the content, symbol and URI streams, then stores them in a pre-allocated shared data structure$S$. The Markup Processor thread$T_2$consumes$S$, performs well-formedness and grammar-based validation, In this case, the Parabix Subsystem thread T1 reads 16k of XML input I at a time and produces the content, symbol and URI streams, then stores them in a pre-allocated shared data structure S. The Markup Processor thread T2 consumes S, performs well-formedness and grammar-based validation, and the provides parsed XML data to the application through the Xerces API. The shared data structure is implemented using a ring buffer, In the examples of Figure \ref{threads_timeline1} and \ref{threads_timeline2}, the ring buffer has four entries. A lock-free mechanism is applied to ensure that each entry can only be read or written by one thread at the same time. In Figure \ref{threads_timeline1} the processing time of$T_1$is longer than$T_2$; thus$T_2$always waits for$T_1$to write to the shared memory. Figure \ref{threads_timeline2} illustrates the scenario in which$T_1$is faster and must wait for$T_2\$ to finish reading the shared data before it can reuse the memory space.

In Figure \ref{threads_timeline1} the processing time of T1 is longer than T2; thus T2 always waits for T1 to write to the shared memory. Figure \ref{threads_timeline2} illustrates the scenario in which T1 is faster and must wait for T2 to finish reading the shared data before it can reuse the memory space.

Overall, our design is intended to benefit a range of applications. Conceptually, we consider two design points. while the cost of XML parsing represents an overhead of 40%.

Our design is predicated on a goal of using the Parabix framework to achieve a 50% to 100% improvement in the parsing engine itself. In this case, the single-threaded icXML should achieve a 1.5x speedup over Xerces so that the total application cost reduces to 67% of the original. However, in \icXML-p, our ideal scenario gives us two well-balanced threads However, in icXML-p, our ideal scenario gives us two well-balanced threads each performing about 33% of the original work. In this case, Amdahl's law predicts that we could expect up to a 3x speedup at best.

At the other extreme of our design range, we consider an application in which core parsing cost is 40%.   Assuming the 2x speedup of in which core parsing cost is 40%.  Assuming the 2x speedup of the Parabix Subsystem over the corresponding Xerces core, single-threaded icXML delivers a 25% speedup.   However, the most significant icXML delivers a 25% speedup.  However, the most significant aspect of our two-stage multi-threaded design then becomes the ability to hide the entire latency of parsing within the serial time required by the application.   In this case, we achieve required by the application.  In this case, we achieve an overall speedup in processing time by 1.67x.

Although the structure of the Parabix Subsystem allows division of the work into several pipeline stages and has been demonstrated to be effective

Performance

We evaluate \xerces{}, icXML, \icXML-p against two benchmarking applications:

We evaluate Xerces-C++ 3.1.1, icXML, icXML-p against two benchmarking applications: the Xerces C++ SAXCount sample application, and a real world GML to SVG transformation application. 8 MB L3 cache) running the 64-bit version of Ubuntu 12.04 (Linux).

We analyzed the execution profiles of each XML parser using the performance counters found in the processor. to collect per core hardware events.

Xerces C++ SAXCount

Xerces comes with sample applications that demonstrate salient features of the parser. SAXCount is the simplest such application: and prints out the totals.

Table \ref{XMLDocChars} shows the document characteristics of the XML input files selected for the Xerces C++ SAXCount benchmark. The jaw.xml XML documents and consist entirely of single byte encoded ASCII characters.

A key predictor of the overall parsing performance of an XML file is markup density\footnote{ Markup Density: the ratio of markup bytes used to define the structure of the document vs. its file size.}. of markup densities.

Figure \ref{perf_SAX} compares the performance of Xerces, icXML and pipelined icXML in terms of CPU cycles per byte for the SAXCount application. The speedup for icXML over Xerces is 1.3x to 1.8x. With two threads on the multicore machine, \icXML-p can achieve speedup up to 2.7x. With two threads on the multicore machine, icXML-p can achieve speedup up to 2.7x. Xerces is substantially slowed by dense markup but icXML is less affected through a reduction in branches and the use of parallel-processing techniques. \icXML-p performs better as markup-density increases because the work performed by each stage is icXML-p performs better as markup-density increases because the work performed by each stage is well balanced in this application.

GML2SVG

Conclusion and Future Work

This paper is the first case study documenting the significant performance benefits that may be realized through the integration feasibility of these techniques.

The further development of icXML to move beyond 2-stage pipeline parallelism is ongoing, with realistic prospects for library should offer substantial benefits.

The example of XML parsing may be considered prototypical of finite-state machines applications which have sometimes been considered embarassingly sequential'' and so difficult to parallelize that nothing works.''  So the been considered "embarassingly sequential" and so difficult to parallelize that "nothing works."  So the case study presented here should be considered an important data point in making the case that parallelization can indeed be helpful across a broad array of application types.

To overcome the software engineering challenges in applying parallel bitstream technology to existing software systems,

Bibliography

[Leventhal and Lemoine 2009] Leventhal, Michael and

Note: See TracChangeset for help on using the changeset viewer.