# Changeset 3039 for docs/Balisage13/Bal2013came0601/Bal2013came0601.xml

Ignore:
Timestamp:
Apr 18, 2013, 7:02:21 PM (6 years ago)
Message:

Initial translation. Special characters, figures, tables, bib, to go.

File:
1 edited

Unmodified
Removed
• ## docs/Balisage13/Bal2013came0601/Bal2013came0601.xml

 r3038
Regular Expression Compilation The Parabix Framework The Parabix (parallel bit stream) framework is a transformative approach to XML parsing (and other forms of text processing.) The key idea is to exploit the availability of wide SIMD registers (e.g., 128-bit) in commodity processors to represent data from long blocks of input data by using one register bit per single input byte. To facilitate this, the input data is first transposed into a set of basis bit streams. In Boolean-logic operations\footnote{∧, \∨ and ¬ denote the boolean AND, OR and NOT operators.} are used to classify the input bits into a set of {\it character-class bit streams}, which identify key characters (or groups of characters) with a $1$. For example, one of the fundamental characters in XML is a left-angle bracket. A character is a< if and only if $\lnot(b_0 \lor b_1) \land (b_2 \land b_3) \land (b_4 \land b_5) \land \lnot (b_6 \lor b_7) = 1. Similarly, a character is numeric [0-9] if and only if$\lnot(b_0 \lor b_1) \land (b_2 \land b_3) \land \lnot(b_4 \land (b_5 \lor b_6)). An important observation here is that ranges of characters may require fewer operations than individual characters and multiple classes can share the classification cost. Consider, for example, the XML source data stream shown in the first line of . The remaining lines of this figure show several parallel bit streams that are computed in Parabix-style parsing, with each bit of each stream in one-to-one correspondence to the source character code units of the input stream. For clarity, 1 bits are denoted with 1 in each stream and 0 bits are represented as underscores. The first bit stream shown is that for the opening angle brackets that represent tag openers in XML. The second and third streams show a partition of the tag openers into start tag marks and end tag marks depending on the character immediately following the opener (i.e., \verb:/:'') or not.  The remaining three lines show streams that can be computed in subsequent parsing (using the technique of \bitstream{} addition \cite{cameron-EuroPar2011}), namely streams marking the element names, attribute names and attribute values of tags. Two intuitions may help explain how the Parabix approach can lead to improved XML parsing performance. The first is that the use of the full register width offers a considerable information advantage over sequential byte-at-a-time parsing.  That is, sequential processing of bytes uses just 8 bits of each register, greatly limiting the processor resources that are effectively being used at any one time. The second is that byte-at-a-time loop scanning loops are actually often just computing a single bit of information per iteration: is the scan complete yet? Rather than computing these individual decision-bits, an approach that computes many of them in parallel (e.g., 128 bytes at a time using 128-bit registers) should provide substantial benefit. Previous studies have shown that the Parabix approach improves many aspects of XML processing, including transcoding \cite{Cameron2008}, character classification and validation, tag parsing and well-formedness checking. The first Parabix parser used processor bit scan instructions to considerably accelerate sequential scanning loops for individual characters \cite{CameronHerdyLin2008}. Recent work has incorporated a method of parallel scanning using \bitstream{} addition \cite{cameron-EuroPar2011}, as well as combining SIMD methods with 4-stage pipeline parallelism to further improve throughput \cite{HPCA2012}. Although these research prototypes handled the full syntax of schema-less XML documents, they lacked the functionality required by full XML parsers. Commercial XML processors support transcoding of multiple character sets and can parse and validate against multiple document vocabularies. Additionally, they provide API facilities beyond those found in research prototypes, including the widely used SAX, SAX2 and DOM interfaces.
Sequential vs. Parallel Paradigm Xerces—like all traditional XML parsers—processes XML documents sequentially. Each character is examined to distinguish between the XML-specific markup, such as a left angle bracket <, and the content held within the document. As the parser progresses through the document, it alternates between markup scanning, validation and content processing modes. In other words, Xerces belongs to an equivalent class applications termed FSM applications\footnote{ Herein FSM applications are considered software systems whose behaviour is defined by the inputs, current state and the events associated with transitions of states.}. Each state transition indicates the processing context of subsequent characters. Unfortunately, textual data tends to be unpredictable and any character could induce a state transition. Parabix-style XML parsers utilize a concept of layered processing. A block of source text is transformed into a set of lexical \bitstream{}s, which undergo a series of operations that can be grouped into logical layers, e.g., transposition, character classification, and lexical analysis. Each layer is pipeline parallel and require neither speculation nor pre-parsing stages\cite{HPCA2012}. To meet the API requirements of the document-ordered Xerces output, the results of the Parabix processing layers must be interleaved to produce the equivalent behaviour.
Architecture
Overview \icXML{} is more than an optimized version of Xerces. Many components were grouped, restructured and rearchitected with pipeline parallelism in mind. In this section, we highlight the core differences between the two systems. As shown in Figure \ref{fig:xerces-arch}, Xerces is comprised of five main modules: the transcoder, reader, scanner, namespace binder, and validator. The {\it Transcoder} converts source data into UTF-16 before Xerces parses it as XML; the majority of the character set encoding validation is performed as a byproduct of this process. The {\it Reader} is responsible for the streaming and buffering of all raw and transcoded (UTF-16) text. It tracks the current line/column position, performs line-break normalization and validates context-specific character set issues, such as tokenization of qualified-names. The {\it Scanner} pulls data through the reader and constructs the intermediate representation (IR) of the document; it deals with all issues related to entity expansion, validates the XML well-formedness constraints and any character set encoding issues that cannot be completely handled by the reader or transcoder (e.g., surrogate characters, validation and normalization of character references, etc.) The {\it Namespace Binder} is a core piece of the element stack. It handles namespace scoping issues between different XML vocabularies. This allows the scanner to properly select the correct schema grammar structures. The {\it Validator} takes the IR produced by the Scanner (and potentially annotated by the Namespace Binder) and assesses whether the final output matches the user-defined DTD and schema grammar(s) before passing it to the end-user. In \icXML{} functions are grouped into logical components. As shown in Figure \ref{fig:icxml-arch}, two major categories exist: (1) the \PS{} and (2) the \MP{}. All tasks in (1) use the Parabix Framework \cite{HPCA2012}, which represents data as a set of parallel \bitstream{}s. The {\it Character Set Adapter}, discussed in Section \ref{arch:character-set-adapter}, mirrors Xerces's Transcoder duties; however instead of producing UTF-16 it produces a set of lexical \bitstream{}s, similar to those shown in Figure \ref{fig:parabix1}. These lexical \bitstream{}s are later transformed into UTF-16 in the \CSG{}, after additional processing is performed. The first precursor to producing UTF-16 is the {\it Parallel Markup Parser} phase. It takes the lexical streams and produces a set of marker \bitstream{}s in which a 1-bit identifies significant positions within the input data. One \bitstream{} for each of the critical piece of information is created, such as the beginning and ending of start tags, end tags, element names, attribute names, attribute values and content. Intra-element well-formedness validation is performed as an artifact of this process. Like Xerces, \icXML{} must provide the Line and Column position of each error. The {\it Line-Column Tracker} uses the lexical information to keep track of the document position(s) through the use of an optimized population count algorithm, described in Section \ref{section:arch:errorhandling}. From here, two data-independent branches exist: the Symbol Resolver and Content Preparation Unit. A typical XML file contains few unique element and attribute names—but each of them will occur frequently. \icXML{} stores these as distinct data structures, called symbols, each with their own global identifier (GID). Using the symbol marker streams produced by the Parallel Markup Parser, the {\it Symbol Resolver} scans through the raw data to produce a sequence of GIDs, called the {\it symbol stream}. The final components of the \PS{} are the {\it Content Preparation Unit} and {\it \CSG{}}. The former takes the (transposed) basis \bitstream{}s and selectively filters them, according to the information provided by the Parallel Markup Parser, and the latter transforms the filtered streams into the tagged UTF-16 {\it content stream}, discussed in Section \ref{section:arch:contentstream}. Combined, the symbol and content stream form \icXML{}'s compressed IR of the XML document. The {\it \MP{}}~parses the IR to validate and produce the sequential output for the end user. The {\it Final WF checker} performs inter-element well-formedness validation that would be too costly to perform in bit space, such as ensuring every start tag has a matching end tag. Xerces's namespace binding functionality is replaced by the {\it Namespace Processor}. Unlike Xerces, it is a discrete phase that produces a series of URI identifiers (URI IDs), the {\it URI stream}, which are associated with each symbol occurrence. This is discussed in Section \ref{section:arch:namespacehandling}. Finally, the {\it Validation} layer implements the Xerces's validator. However, preprocessing associated with each symbol greatly reduces the work of this stage.
Character Set Adapters In Xerces, all input is transcoded into UTF-16 to simplify the parsing costs of Xerces itself and provide the end-consumer with a single encoding format. In the important case of UTF-8 to UTF-16 transcoding, the transcoding costs can be significant, because of the need to decode and classify each byte of input, mapping variable-length UTF-8 byte sequences into 16-bit UTF-16 code units with bit manipulation operations. In other cases, transcoding may involve table look-up operations for each byte of input.  In any case, transcoding imposes at least a cost of buffer copying. In \icXML{}, however,  the concept of Character Set Adapters (CSAs) is used to minimize transcoding costs. Given a specified input encoding, a CSA is responsible for checking that input code units represent valid characters, mapping the characters of the encoding into the appropriate \bitstream{}s for XML parsing actions (i.e., producing the lexical item streams), as well as supporting ultimate transcoding requirements.   All of this work is performed using the parallel \bitstream{} representation of the source input. An important observation is that many character sets are an extension to the legacy 7-bit ASCII character set.  This includes the various ISO Latin character sets, UTF-8, UTF-16 and many others. Furthermore, all significant characters for parsing XML are confined to the ASCII repertoire.   Thus, a single common set of lexical item calculations serves to compute lexical item streams for all such ASCII-based character sets. A second observation is that—regardless of which character set is used—quite often all of the characters in a particular block of input will be within the ASCII range. This is a very simple test to perform using the \bitstream{} representation, simply confirming that the bit 0 stream is zero for the entire block.   For blocks satisfying this test, all logic dealing with non-ASCII characters can simply be skipped. Transcoding to UTF-16 becomes trivial as the high eight \bitstream{}s of the UTF-16 form are each set to zero in this case. A third observation is that repeated transcoding of the names of XML elements, attributes and so on can be avoided by using a look-up mechanism. That is, the first occurrence of each symbol is stored in a look-up table mapping the input encoding to a numeric symbol ID.   Transcoding of the symbol is applied at this time.  Subsequent look-up operations can avoid transcoding by simply retrieving the stored representation. As symbol look up is required to apply various XML validation rules, there is achieves the effect of transcoding each occurrence without additional cost. The cost of individual character transcoding is avoided whenever a block of input is confined to the ASCII subset and for all but the first occurrence of any XML element or attribute name. Furthermore, when transcoding is required, the parallel \bitstream{} representation supports efficient transcoding operations. In the important case of UTF-8 to UTF-16 transcoding, the corresponding UTF-16 \bitstream{}s can be calculated in bit parallel fashion based on UTF-8 streams \cite{Cameron2008}, and all but the final bytes of multi-byte sequences can be marked for deletion as discussed in the following subsection. In other cases, transcoding within a block only need be applied for non-ASCII bytes, which are conveniently identified by iterating through the bit 0 stream using bit scan operations.
Content Stream A relatively-unique concept for \icXML{} is the use of a filtered content stream. Rather that parsing an XML document in its original format, the input is transformed into one that is easier for the parser to iterate through and produce the sequential output. In , the source data is transformed into through the parallel filtering algorithm, described in section \ref{sec:parfilter}. Combined with the symbol stream, the parser traverses the content stream to effectively reconstructs the input document in its output form. The initial {\tt\it 0} indicates an empty content string. The following \verb|>| indicates that a start tag without any attributes is the first element in this text and the first unused symbol, document, is the element name. Succeeding that is the content string fee, which is null-terminated in accordance with the Xerces API specification. Unlike Xerces, no memory-copy operations are required to produce these strings, which as Figure~\ref{fig:xerces-profile} shows accounts for 6.83% of Xerces's execution time. Additionally, it is cheap to locate the terminal character of each string: using the String End \bitstream{}, the \PS{} can effectively calculate the offset of each null character in the content stream in parallel, which in turn means the parser can directly jump to the end of every string without scanning for it. Following \verbfee'' is a \verb=, which marks the existence of an attribute. Because all of the intra-element was performed in the \PS{}, this must be a legal attribute. Since attributes can only occur within start tags and must be accompanied by a textual value, the next symbol in the symbol stream must be the element name of a start tag, and the following one must be the name of the attribute and the string that follows the \verb= must be its value. However, the subsequent \verb= is not treated as an independent attribute because the parser has yet to read a \verb>, which marks the end of a start tag. Thus only one symbol is taken from the symbol stream and it (along with the string value) is added to the element. Eventually the parser reaches a \verb/, which marks the existence of an end tag. Every end tag requires an element name, which means they require a symbol. Inter-element validation whenever an empty tag is detected to ensure that the appropriate scope-nesting rules have been applied.
Namespace Handling In XML, namespaces prevents naming conflicts when multiple vocabularies are used together. It is especially important when a vocabulary application-dependant meaning, such as when XML or SVG documents are embedded within XHTML files. Namespaces are bound to uniform resource identifiers (URIs), which are strings used to identify specific names or resources. On line 1 of Figure \ref{fig:namespace1}, the \verb|xmlns| attribute instructs the XML processor to bind the prefix p to the URI 'pub.net' and the default (empty) prefix to book.org. Thus to the XML processor, the \verb|title| on line 2 and \verb|price| on line 4 both read as \verb|"book.org":title| and \verb|"book.org":price| respectively, whereas on line 3 and 5, \verb|p:name| and \verb|price| are seen as \verb|"pub.net":name| and \verb|"pub.net":price|. Even though the actual element name \verb|price|, due to namespace scoping rules they are viewed as two uniquely-named items because the current vocabulary is determined by the namespace(s) that are in-scope. In both Xerces and \icXML{}, every URI has a one-to-one mapping to a URI ID. These persist for the lifetime of the application through the use of a global URI pool. Xerces maintains a stack of namespace scopes that is pushed (popped) every time a start tag (end tag) occurs in the document. Because a namespace declaration affects the entire element, it must be processed prior to grammar validation. This is a costly process considering that a typical namespaced XML document only comes in one of two forms: (1) those that declare a set of namespaces upfront and never change them, and (2) those that repeatedly modify the namespaces in predictable patterns. For that reason, \icXML{} contains an independent namespace stack and utilizes bit vectors to cheaply perform When a prefix is declared (e.g., \verb|xmlns:p="pub.net"|), a namespace binding is created that maps the prefix (which are assigned Prefix IDs in the symbol resolution process) to the URI. Each unique namespace binding has a unique namespace id (NSID) and every prefix contains a bit vector marking every NSID that has ever been associated with it within the document. For example, in Table \ref{tbl:namespace1}, the prefix binding set of \verb|p| and \verb|xmlns| would be \verb|01| and \verb|11| respectively. To resolve the in-scope namespace binding for each prefix, a bit vector of the currently visible namespaces is maintained by the system. By ANDing the prefix bit vector with the currently visible namespaces, the in-scope NSID can be found using a bit-scan intrinsic. A namespace binding table, similar to Table \ref{tbl:namespace1}, provides the actual URI ID. To ensure that scoping rules are adhered to, whenever a start tag is encountered, any modification to the currently visible namespaces is calculated and stored within a stack of bit vectors denoting the locally modified namespace bindings. When an end tag is found, the currently visible namespaces is XORed with the vector at the top of the stack. This allows any number of changes to be performed at each scope-level with a constant time.
Error Handling Xerces outputs error messages in two ways: through the programmer API and as thrown objects for fatal errors. As Xerces parses a file, it uses context-dependant logic to assess whether the next character is legal; if not, the current state determines the type and severity of the error. \icXML{} emits errors in the similar manner—but how it discovers them is substantially different. Recall that in Figure \ref{fig:icxml-arch}, \icXML{} is divided into two sections: the \PS{} and \MP{}, each with its own system for detecting and producing error messages. Within the \PS{}, all computations are performed in parallel, a block at a time. Errors are derived as artifacts of \bitstream{} calculations, with a 1-bit marking the byte-position of an error within a block, and the type of error is determined by the equation that discovered it. The difficulty of error processing in this section is that in Xerces the line and column number must be given with every error production. Two major issues exist because of this: (1) line position adheres to XML white-normalization rules; as such, some sequences of characters, e.g., a carriage return followed by a line feed, are counted as a single new line character. (2) column position is counted in characters, not bytes or code units; thus multi-code-unit code-points and surrogate character pairs are all counted as a single column position. Note that typical XML documents are error-free but the calculation of the line/column position is a constant overhead in Xerces. To reduce this, \icXML{} pushes the bulk cost of the line/column calculation to the occurrence of the error and performs the minimal amount of book-keeping necessary to facilitate it. \icXML{} leverages the byproducts of the Character Set Adapter (CSA) module and amalgamates the information within the Line Column Tracker (LCT). One of the CSA's major responsibilities is transcoding an input text. During this process, white-space normalization rules are applied and multi-code-unit and surrogate characters are detected and validated. A {\it line-feed \bitstream{}}, which marks the positions of the normalized new lines characters, is a natural derivative of this process. Using an optimized population count algorithm, the line count can be summarized cheaply for each valid block of text. Column position is more difficult to calculate. It is possible to scan backwards through the \bitstream{} of new line characters to determine the distance (in code-units) between the position between which an error was detected and the last line feed. However, this distance may exceed than the actual character position for the reasons discussed in (2). To handle this, the CSA generates a {\it skip mask} \bitstream{} by ORing together many relevant \bitstream{}s, such as all trailing multi-code-unit and surrogate characters, and any characters that were removed during the normalization process. When an error is detected, the sum of those skipped positions is subtracted from the distance to determine the actual column number. The \MP{} is a state-driven machine. As such, error detection within it is very similar to Xerces. However, reporting the correct line/column is a much more difficult problem. The \MP{} parses the content stream, which is a series of tagged UTF-16 strings. Each string is normalized in accordance with the XML specification. All symbol data and unnecessary whitespace is eliminated from the stream; thus its impossible to derive the current location using only the content stream. To calculate the location, the \MP{} borrows three additional pieces of information from the \PS{}: the line-feed, skip mask, and a {\it deletion mask stream}, which is a \bitstream{} denoting the (code-unit) position of every datum that was suppressed from the source during the production of the content stream. Armed with these, it is possible to calculate the actual line/column using the same system as the \PS{} until the sum of the negated deletion mask stream is equal to the current position.
Based on the character class compiler, we are currently investigating the construction of a regular expression compiler that can implement bit-stream based parallel regular-expression matching similar to that describe previously for parallel parsing by bistream addition. This compiler works with the assumption that bitstream regular-expression definitions are deterministic; no backtracking is permitted with the parallel bit stream representation. In XML applications, this compiler is primarily intended to enforce regular-expression constraints on string datatype specifications found in XML schema.
Multithreading with Pipeline Parallelism As discussed in section \ref{background:xerces}, Xerces can be considered a FSM application. These are embarrassingly sequential.''\cite{Asanovic:EECS-2006-183} and notoriously difficult to parallelize. However, \icXML{} is designed to organize processing into logical layers. In particular, layers within the \PS{} are designed to operate over significant segments of input data before passing their outputs on for subsequent processing.  This fits well into the general model of pipeline parallelism, in which each thread is in charge of a single module or group of modules. The most straightforward division of work in \icXML{} is to separate the \PS{} and the \MP{} into distinct logical layers into two separate stages. The resultant application, {\it\icXMLp{}}, is a course-grained software-pipeline application. In this case, the \PS{} thread $T_1$ reads 16k of XML input $I$ at a time and produces the content, symbol and URI streams, then stores them in a pre-allocated shared data structure $S$. The \MP{} thread $T_2$ consumes $S$, performs well-formedness and grammar-based validation, and the provides parsed XML data to the application through the Xerces API. The shared data structure is implemented using a ring buffer, where every entry contains an independent set of data streams. In the examples of Figure \ref{threads_timeline1} and \ref{threads_timeline2}, the ring buffer has four entries. A lock-free mechanism is applied to ensure that each entry can only be read or written by one thread at the same time. In Figure \ref{threads_timeline1} the processing time of $T_1$ is longer than $T_2$; thus $T_2$ always waits for $T_1$ to write to the shared memory. Figure \ref{threads_timeline2} illustrates the scenario in which $T_1$ is faster and must wait for $T_2$ to finish reading the shared data before it can reuse the memory space. Overall, our design is intended to benefit a range of applications. Conceptually, we consider two design points. The first, the parsing performed by the \PS{} dominates at 67% of the overall cost, with the cost of application processing (including the driver logic within the \MP{}) at 33%. The second is almost the opposite scenario, the cost of application processing dominates at 60%, while the cost of XML parsing represents an overhead of 40%. Our design is predicated on a goal of using the Parabix framework to achieve a 50% to 100% improvement in the parsing engine itself. In a best case scenario, a 100% improvement of the \PS{} for the design point in which XML parsing dominates at 67% of the total application cost. In this case, the single-threaded \icXML{} should achieve a 1.5x speedup over Xerces so that the total application cost reduces to 67% of the original. However, in \icXMLp{}, our ideal scenario gives us two well-balanced threads each performing about 33% of the original work. In this case, Amdahl's law predicts that we could expect up to a 3x speedup at best. At the other extreme of our design range, we consider an application in which core parsing cost is 40%.   Assuming the 2x speedup of the \PS{} over the corresponding Xerces core, single-threaded \icXML{} delivers a 25% speedup.   However, the most significant aspect of our two-stage multi-threaded design then becomes the ability to hide the entire latency of parsing within the serial time required by the application.   In this case, we achieve an overall speedup in processing time by 1.67x. Although the structure of the \PS{} allows division of the work into several pipeline stages and has been demonstrated to be effective for four pipeline stages in a research prototype \cite{HPCA2012}, our analysis here suggests that the further pipelining of work within the \PS{} is not worthwhile if the cost of application logic is little as 33% of the end-to-end cost using Xerces.  To achieve benefits of further parallelization with multi-core technology, there would need to be reductions in the cost of application logic that could match reductions in core parsing cost.
Unbounded Bit Stream Compilation
Performance We evaluate \xerces{}, \icXML{}, \icXMLp{} against two benchmarking applications: the Xerces C++ SAXCount sample application, and a real world GML to SVG transformation application. We investigated XML parser performance using an Intel Core i7 quad-core (Sandy Bridge) processor (3.40GHz, 4 physical cores, 8 threads (2 per core), 32+32 kB (per core) L1 cache, 256 kB (per core) L2 cache, 8 MB L3 cache) running the 64-bit version of Ubuntu 12.04 (Linux). We analyzed the execution profiles of each XML parser using the performance counters found in the processor. We chose several key hardware events that provide insight into the profile of each application and indicate if the processor is doing useful work. The set of events included in our study are: processor cycles, branch instructions, branch mispredictions, and cache misses. The Performance Application Programming Interface (PAPI) Version 5.5.0 \cite{papi} toolkit was installed on the test system to facilitate the collection of hardware performance monitoring statistics. In addition, we used the Linux perf \cite{perf} utility to collect per core hardware events.
Xerces C++ SAXCount Xerces comes with sample applications that demonstrate salient features of the parser. SAXCount is the simplest such application: it counts the elements, attributes and characters of a given XML file using the (event based) SAX API and prints out the totals. Table \ref{XMLDocChars} shows the document characteristics of the XML input files selected for the Xerces C++ SAXCount benchmark. The jaw.xml represents document-oriented XML inputs and contains the three-byte and four-byte UTF-8 sequence required for the UTF-8 encoding of Japanese characters. The remaining data files are data-oriented XML documents and consist entirely of single byte encoded ASCII characters. A key predictor of the overall parsing performance of an XML file is markup density\footnote{ Markup Density: the ratio of markup bytes used to define the structure of the document vs. its file size.}. This metric has substantial influence on the performance of traditional recursive descent XML parsers because it directly corresponds to the number of state transitions that occur when parsing a document. We use a mixture of document-oriented and data-oriented XML files to analyze performance over a spectrum of markup densities. Figure \ref{perf_SAX} compares the performance of Xerces, \icXML{} and pipelined \icXML{} in terms of CPU cycles per byte for the SAXCount application. The speedup for \icXML{} over Xerces is 1.3x to 1.8x. With two threads on the multicore machine, \icXMLp{} can achieve speedup up to 2.7x. Xerces is substantially slowed by dense markup but \icXML{} is less affected through a reduction in branches and the use of parallel-processing techniques. \icXMLp{} performs better as markup-density increases because the work performed by each stage is well balanced in this application.
GML2SVG
The Catalog of XML Bit Streams presented earlier consist of a set of abstract, unbounded bit streams, each in one-to-one correspondence with input bytes of a text file. Determining how these bit streams are implemented using fixed-width SIMD registers, and possibly processed in fixed-length buffers that represent some multiple of the register width is a source of considerable programming complexity. The general goal of our compilation strategy in this case is to allow operations to be programmed in terms of unbounded bit streams and then automatically reduced to efficient low-level code with the application of a systematic code generation strategy for handling block and buffer boundary crossing. This is work currently in progress.
Conclusion and Future Work This paper is the first case study documenting the significant performance benefits that may be realized through the integration of parallel \bitstream{} technology into existing widely-used software libraries. In the case of the Xerces-C++ XML parser, the combined integration of SIMD and multicore parallelism was shown capable of dramatic producing dramatic increases in throughput and reductions in branch mispredictions and cache misses. The modified parser, going under the name \icXML{} is designed to provide the full functionality of the original Xerces library with complete compatibility of APIs.  Although substantial re-engineering was required to realize the performance potential of parallel technologies, this is an important case study demonstrating the general feasibility of these techniques. The further development of \icXML{} to move beyond 2-stage pipeline parallelism is ongoing, with realistic prospects for four reasonably balanced stages within the library.  For applications such as GML2SVG which are dominated by time spent on XML parsing, such a multistage pipelined parsing library should offer substantial benefits. The example of XML parsing may be considered prototypical of finite-state machines applications which have sometimes been considered embarassingly sequential'' and so difficult to parallelize that nothing works.''  So the case study presented here should be considered an important data point in making the case that parallelization can indeed be helpful across a broad array of application types. To overcome the software engineering challenges in applying parallel \bitstream{} technology to existing software systems, it is clear that better library and tool support is needed. The techniques used in the implementation of \icXML{} and documented in this paper could well be generalized for applications in other contexts and automated through the creation of compiler technology specifically supporting parallel \bitstream{} programming.
Conclusion Parallel bit stream technology offers the opportunity to dramatically speed up the core XML processing components used to implement virtually any XML API. Character validation and transcoding, whitespace processing, and parsing up to including the full validation of tag syntax can be handled fully in parallel using bit stream methods. Bit streams to mark the positions of all element names, attribute names and attribute values can also be produced, followed by fast bit scan operations to generate position and length values. Beyond bit streams, byte-oriented SIMD processing of names and numerals can also accelerate performance beyond sequential byte-at-a-time methods. Advances in processor architecture are likely to further amplify the performance of parallel bit stream technology over traditional byte-at-a-time processing over the next decade. Improvements to SIMD register width, register complement and operation format can all result in further gains. New SIMD instruction set features such as inductive doubling support, parallel extract and deposit instructions, bit interleaving and scatter/gather capabilities should also result in significant speed-ups. Leveraging the intraregister parallelism of parallel bit stream technology within SIMD registers to take of intrachip parallelism on multicore processors should accelerate processing further. Technology transfer using a patent-based open-source business model is a further goal of our work with a view to widespread deployment of parallel bit stream technology in XML processing stacks implementing a variety of APIs. The feasibility of substantial performance improvement in replacement of technology implementing existing APIs has been demonstrated even in complex software architectures involving delivery of performance benefits across the JNI boundary. We are seeking to accelerate these deployment efforts both through the development of compiler technology to reliably apply these methods to a variety of architectures as well as to identify interested collaborators using open-source or commercial models.
Bibliography
Note: See TracChangeset for help on using the changeset viewer.