# Changeset 3039

Ignore:
Timestamp:
Apr 18, 2013, 7:02:21 PM (6 years ago)
Message:

Initial translation. Special characters, figures, tables, bib, to go.

Location:
docs/Balisage13/Bal2013came0601
Files:
2 edited

Unmodified
Removed
• ## docs/Balisage13/Bal2013came0601/Bal2013came0601.html

 r3038 Parallel Bit Stream Technology as a Foundation for XML Parsing Performance

Parallel Bit Stream Technology as a Foundation for XML Parsing Performance

Abstract

By first transforming the octets (bytes) of XML texts into eight parallel bit streams, the SIMD features of commodity processors can be exploited for parallel processing of blocks of 128 input bytes at a time. Established transcoding and parsing techniques are reviewed followed by new techniques including parsing with bitstream addition. Further opportunities are discussed in light of expected advances in CPU architecture and compiler technology. Implications for various APIs and information models are presented as well opportunities for collaborative open-source development.

Prior research on the acceleration of XML processing using SIMD and multi-core parallelism has lead to a number of interesting research prototypes.  This work investigates the extent to which the techniques underlying these prototypes could result in systematic performance benefits when fully integrated into a commercial XML parser. The widely used Xerces-C++ parser of the Apache Software Foundation was chosen as the foundation for the study. A systematic restructuring of the parser was undertaken, while maintaining the existing API for application programmers. Using SIMD techniques alone, an increase in parsing speed of at least 50% was observed in a range of applications. When coupled with pipeline parallelism on dual core processors, improvements of 2x and beyond were realized.

Introduction
A Catalog of Parallel Bit Streams for XML
Introduction
Background
Introduction
Basis Bit Streams
General Streams
Xerces C++ Structure
The Parabix Framework
Architecture
Error Flag Streams
Overview
Combined Parallel Filtering
Content Stream
Namespace Handling
Error Handling
Lexical Item Streams
UTF-8 Byte Classification, Scope and Validation Streams
Performance
UTF-8 Byte Classification Streams
UTF-8 Scope Streams
UTF-8 Validation Streams
XML Character Validation Streams
UTF-8 to UTF-16 Transcoding
UTF-8 Indexed UTF-16 Streams
Xerces C++ SAXCount
GML2SVG
Control Character Streams
XML Character Validation
XML 1.0 End-of-line Handling
Call Out Streams
Comment, Processing Instruction and CDATA Section Call Out Streams
Reference Call Out Streams
Tag Call Out Streams
SIMD Beyond Bitstreams: Names and Numbers
Name Lookup
Numeric Processing
APIs and Parallel Bit Streams
The ILAX Streaming API
Efficient XML in Java Using Array Set Models
Saxon-B TinyTree Example
Compiler Technology
Character Class Compiler
Regular Expression Compilation
Unbounded Bit Stream Compilation
Conclusion
Acknowledgments
Conclusion and Future Work

Introduction

While particular XML applications may benefit from special-purpose hardware such as XML chips [Leventhal and Lemoine 2009] or appliances [Salz, Achilles and Maze 2009], the bulk of the world's XML processing workload will continue to be handled by XML software stacks on commodity processors. Exploiting the SIMD capabilities of such processors such as the SSE instructions of x86 chips, parallel bit stream technology offers the potential of dramatic improvement over byte-at-a-time processing for a variety of XML processing tasks. Character set issues such as Unicode validation and transcoding [Cameron 2007], normalization of line breaks and white space and XML character validation can be handled fully in parallel using this representation. Lexical item streams, such as the bit stream marking the positions of opening angle brackets, can also be formed in parallel. Bit-scan instructions of commodity processors may then be used on lexical item streams to implement rapid single-instruction scanning across variable-length multi-byte text blocks as in the Parabix XML parser [Cameron, Herdy and Lin 2008]. Overall, these techniques may be combined to yield end-to-end performance that may be 1.5X to 15X faster than alternatives [Herdy, Burggraf and Cameron 2008].

Continued research in parallel bit stream techniques as well as more conventional application of SIMD techniques in XML processing offers further prospects for improvement of core XML components as well as for tackling performance-critical tasks further up the stack. A newly prototyped technique for parallel tag parsing using bitstream addition is expected to improve parsing performance even beyond that achieved using sequential bit scans. Several techniques for improved symbol table performance are being investigated, including parallel hash value calculation and length-based sorting using the cheap length determination afforded by bit scans. To deliver the benefits of parallel bit stream technology to the Java world, we are developing Array Set Model (ASM) representations of XML Infoset and other XML information models for efficient transmission across the JNI boundary.

Amplifying these software advances, continuing hardware advances in commodity processors increase the relative advantage of parallel bit stream techniques over traditional byte-at-a-time processors. For example, the Intel Core architecture improved SSE processing to give superscalar execution of bitwise logic operations (3 instructions per cycle vs. 1 in Pentium 4). Upcoming 256-bit AVX technology extends the register set and replaces destructive two-operand instructions with a nondestructive three-operand form. General purpose programming on graphic processing units (GPGPU) such as the upcoming 512-bit Larrabee processor may also be useful for XML applications using parallel bit streams. New instruction set architectures may also offer dramatic improvements in core algorithms. Using the relatively simple extensions to support the principle of inductive doubling, a 3X improvement in several core parallel bit stream algorithms may be achieved [Cameron and Lin 2009]. Other possibilities include direct implementation of parallel extract and parallel deposit (pex/pdep) instructions [Hilewitz and Lee 2006], and bit-level interleave operations as in Larrabee, each of which would have important application to parallel bit stream processing.

Further prospects for XML performance improvement arise from leveraging the intraregister parallelism of parallel bit stream technology to exploit the interchip parallelism of multicore computing. Parallel bit stream techniques can support multicore parallelism in both data partitioning and task partitioning models. For example, the datasection partitioning approach of Wu, Zhang, Yu and Li may be used to partition blocks for speculative parallel parsing on separate cores followed by a postprocessing step to join partial S-trees [Wu et al. 2008].

In our view, the established and expected performance advantages of parallel bit stream technology over traditional byte-at-a-time processing are so compelling that parallel bit stream technology should ultimately form the foundation of every high-performance XML software stack. We envision a common high-performance XML kernel that may be customized to a variety of processor architectures and that supports a wide range of existing and new XML APIs. Widespread deployment of this technology should greatly benefit the XML community in addressing both the deserved and undeserved criticism of XML on performance grounds. A further benefit of improved performance is a substantial greening of XML technologies.

To complement our research program investigating fundamental algorithms and issues in high-performance XML processing, our work also involves development of open source software implementing these algorithms, with a goal of full conformance to relevant specifications. From the research perspective, this approach is valuable in ensuring that the full complexity of required XML processing is addressed in reporting and assessing processing results. However, our goal is also to use this open source software as a basis of technology transfer. A Simon Fraser University spin-off company, called International Characters, Inc., has been created to commercialize the results of this work using a patent-based open source model.

To date, we have not yet been successful in establishing a broader community of participation with our open source code base. Within open-source communities, there is often a general antipathy towards software patents; this may limit engagement with our technology, even though it has been dedicated for free use in open source.

A further complication is the inherent difficulty of SIMD programming in general, and parallel bit stream programming in particular. Considerable work is required with each new algorithmic technique being investigated as well as in retargetting our techniques for each new development in SIMD and multicore processor technologies. To address these concerns, we have increasingly shifted the emphasis of our research program towards compiler technology capable of generating parallel bit stream code from higher-level specifications.

A Catalog of Parallel Bit Streams for XML

Introduction

In this section, we introduce the fundamental concepts of parallel bit stream technology and present a comprehensive catalog of parallel bit streams for use in XML processing. In presenting this catalog, the focus is on the specification of the bit streams as data streams in one-to-one correspondence with the character code units of an input XML stream. The goal is to define these bit streams in the abstract without initially considering memory layouts, register widths or other issues related to particular target architectures. In cataloging these techniques, we also hope to convey a sense of the breadth of applications of parallel bit stream technology to XML processing tasks.

Basis Bit Streams

Given a byte-oriented text stream represented in UTF-8, for example, we define a transform representation of this text consisting of a set of eight parallel bit streams for the individual bits of each byte. Thus, the Bit0 stream is the stream of bits consisting of bit 0 of each byte in the input byte stream, Bit1 is the bit stream consisting of bit 1 of each byte in the input stream and so on. The set of streams Bit0 through Bit7 are known as the basis streams of the parallel bit stream representation. The following table shows an example XML character stream together with its representation as a set of 8 basis streams.

TableÂ I

XML Character Stream Transposition.

Depending on the features of a particular processor architecture, there are a number of algorithms for transposition to parallel bit stream form. Several of these algorithms employ a three-stage structure. In the first stage, the input byte stream is divided into a pair of half-length streams consisting of four bits for each byte, for example, one stream for the high nybble of each byte and another for the low nybble of each byte. In the second stage, these streams of four bits per byte are each divided into streams consisting of two bits per original byte, for example streams for the Bit0/Bit1, Bit2/Bit3, Bit4/Bit5, and Bit6/Bit7 pairs. In the final stage, the streams are further subdivided in the individual bit streams.

Using SIMD capabilities, this process is quite efficient, with an amortized cost of 1.1 CPU cycles per input byte on Intel Core 2 with SSE, or 0.6 CPU cycles per input byte on Power PC G4 with Altivec. With future advances in processor technology, this transposition overhead is expected to reduce, possibly taking advantage of upcoming parallel extract (pex) instructions on Intel technology. In the ideal, only 24 instructions are needed to transform a block of 128 input bytes using 128-bit SSE registers using the inductive doubling instruction set architecture, representing an overhead of less than 0.2 instructions per input byte.

General Streams

This section describes bit streams which support basic processing operations.

DelMask (deletion mask) streams marks character code unit positions for deletion. Since the deletion operation is dependency free across many stages of XML processing, it is possible to simply mark and record deletion positions as deletion mask streams for future processing. A single invocation of a SIMD based parallel deletion algorithm can then perform the deletion of positions accumulated across a number of stages through a bitwise ORing of deletion masks. For example, deletion arises in the replacement of predefined entities with a single character, such as in the replacement of the &amp; entity, with the & character. Deletion also arises in XML end-of-line handling, and CDATA section delimeter processing. Several algorithms to delete bits at positions marked by DelMask are possible [Cameron 2008].

The following table provides an example of generating a DelMask in the context of bit stream based parsing of well-formed character references and predefined entities. The result is the generation of a DelMask stream.

TableÂ II

Error Flag Streams

Error flag streams indicates the character code unit positions of syntactical errors. XML processing examples which benefit from the marking of error positions include UTF-8 character sequence validation and XML parsing [Cameron 2008].

The following table provides an example of using bit streams to parse character references and predefined entities which fail to meet the XML 1.0 well-formedness constraints. The result is the generation of an error flag stream that marks the positions of mal-formed decimal and hexical character references respectively.

TableÂ III

Error Flag Stream Generation

Lexical Item Streams

Lexical item streams differ from traditional streams of tokens in that they are bit streams that mark the positions of tokens, whitespace or delimiters. Additional bit streams, such as the reference streams and callout streams, are subsequently constructed based on the information held within the set of lexical items streams. Differentiation between the actual tokens that may occur at a particular point (e.g., the different XML tokens that begin â<â) may be performed using multicharacter recognizers on the bytestream representation [Cameron, Herdy and Lin 2008].

A key role of lexical item streams in XML parsing is to facilitate fast scanning operations. For example, a left angle bracket lexical item stream may be formed to identify those character code unit positions at which a â<â character occurs. Hardware register bit scan operations may then be used by the XML parser on the left angle bracket stream to efficiently identify the position of the next â<â. Based on the capabilities of current commodity processors, a single register bit scan operation may effectively scan up to 64 byte positions with a single instruction.

Overall, the construction of the full set of lexical item stream computations requires approximately 1.0 CPU cycles per byte when implemented for 128 positions at a time using 128-bit SSE registers on Intel Core2 processors [Cameron, Herdy and Lin 2008]. The following table defines the core lexical item streams defined by the Parabix XML parser.

TableÂ IV

Lexical item stream descriptions.

The following illustrates a number of the lexical item streams.

TableÂ V

Lexical Item Streams

UTF-8 Byte Classification, Scope and Validation Streams

An XML parser must accept the UTF-8 encoding of Unicode [XML 1.0]. It is a fatal error if an XML document determined to be in UTF-8 contains byte sequences that are not legal in that encoding. UTF-8 byte classification, scope, XML character validation and error flag bit streams are defined to validate UTF-8 byte sequences and support transcoding to UTF-16.

UTF-8 Byte Classification Streams

UTF-8 byte classification bit streams classify UTF-8 bytes based on their role in forming single and multibyte sequences. The u8Prefix and u8Suffix bit streams identify bytes that represent, respectively, prefix or suffix bytes of multibyte UTF-8 sequences. The u8UniByte bit stream identifies those bytes that may be considered single-byte sequences. The u8Prefix2, u8Prefix3, and u8Prefix4 refine the u8Prefix respectively indicating prefixes of two, three or four byte sequences respectively.

UTF-8 Scope Streams

Scope streams represent expectations established by UTF-8 prefix bytes. For example, the u8Scope22 bit stream represents the positions at which the second byte of a two-byte sequence is expected based on the occurrence of a two-byte prefix in the immediately preceding position. The u8scope32, u8Scope33, u8Scope42, u8scope43, and u8Scope44 complete the set of UTF-8 scope streams.

The following example demonstrates the UTF-8 character encoding validation process using parallel bit stream techniques. The result of this validation process is an error flag stream identifying the positions at which errors occur.

TableÂ VI

UTF-8 Scope Streams

UTF-8 Validation Streams

Proper formation of UTF-8 byte sequences requires that the correct number of suffix bytes always follow a UTF-8 prefix byte, and that certain illegal byte combinations are ruled out. For example, sequences beginning with the prefix bytes 0xF5 through 0xFF are illegal as they would represent code point values above 10FFFF. In addition, there are constraints on the first suffix byte following certain special prefixes, namely that a suffix following the prefix 0xE0 must fall in the range 0xA0â0xBF, a suffix following the prefix 0xED must fall in the range 0x80â0x9F, a suffix following the prefix 0xF0 must fall in the range 0x90â0xBF and a suffix following the prefix 0xF4 must fall in the range 0x80â0x8F. The task of ensuring that each of these constraints hold is known as UTF-8 validation. The bit streams xE0, xED, xF0, xF4, xA0_xBF, x80_x9F, x90_xBF, and x80_x8F are constructed to flag the aforementioned UTF-8 validation errors. The result of UTF-8 validation is a UTF-8 error flag bit stream contructed as the ORing of a series of UTF-8 validation tests.

XML Character Validation Streams

The UTF-8 character sequences 0xEF 0xBF 0xBF and 0xEF 0xBF 0xBE correspond to the Unicode code points 0xFFFE and 0xFFFF respectively. In XML 1.0, 0xFFFE and 0xFFFF represent characters outside the legal XML character ranges. As such, bit streams which mark 0xEF, 0xBF, and 0xBE character are constructed to flag illegal UTF-8 character sequences.

UTF-8 to UTF-16 Transcoding

UTF-8 is often preferred for storage and data exchange, it is suitable for processing, but it is significantly more complex to process than UTF-16 [Unicode]. As such, XML documents are typically encoded in UTF-8 for serialization and transport, and subsequently transcoded to UTF-16 for processing with programming languages such as Java and C#. Following the parallel bit stream methods developed for the u8u16 transcoder, a high-performance standalone UTF-8 to UTF-16 transcoder [Cameron 2008], transcoding to UTF-16 may be achieved by computing a series of 16 bit streams. One stream for each of the individual bits of a UTF-16 code unit.

The bit streams for UTF-16 are conveniently divided into groups: the eight streams u16Hi0, u16Hi1, ..., u16Hi7 for the high byte of each UTF-16 code unit and the eight streams u16Lo1, ..., u16Lo7 for the low byte. Upon conversion of the parallel bit stream data back to byte streams, eight sequential byte streams U16h0, U16h1, ..., U16Hi7 are used for the high byte of each UTF-16 code unit, while U16Lo0, U16Lo1,..., U16Lo7 are used for the corresponding low byte. Interleaving these streams then produces the full UTF-16 doublebyte stream.

UTF-8 Indexed UTF-16 Streams

UTF-16 bit streams are initially defined in UTF-8 indexed form. That is, with sets of bits in one-to-one correspondence with UTF-8 bytes. However, only one set of UTF-16 bits is required for encoding two or three-byte UTF-8 sequences and only two sets are required for surrogate pairs corresponding to four-byte UTF-8 sequences. The u8LastByte (u8UniByte , u8Scope22 , u8Scope33 , and u8Scope44 ) and u8Scope42 streams mark the positions at which the correct UTF-16 bits are computed. The bit sets at other positions must be deleted to compress the streams to the UTF-16 indexed form.

Control Character Streams

The control character bit streams marks ASCII control characters in the range 0x00-0x1F. Additional control character bit streams mark the tab, carriage return, line feed, and space character. In addition, a bit stream to mark carriage return line combinations is also constructed. Presently, control character bit streams support the operations of XML 1.0 character validation and XML end-of-line handling.

XML Character Validation

Legal characters in XML are the tab, carriage return, and line feed characters, together with all Unicode characters and excluding the surrogate blocks, as well as hexadecimal OxFFFE and OxFFFF [XML 1.0]. The x00_x1F bit stream is constructed and used in combination with the additional control character bit streams to flags the positions of illegal control characters.

XML 1.0 End-of-line Handling

In XML 1.0 the two-character sequence CR LF (carriage return, line feed) as well as any CR character not followed by a LF character must be converted to a single LF character [XML 1.0].

By defining carriage return, line feed, and carriage return line feed bit streams, dentoted CR, LF and CRLF respectively, end-of-line normalization processing can be performed in parallel using only a small number of logical and shift operations.

The following example demonstrates the generation of the CRLF deletion mask. In this example, the position of all CR characters followed by LF characters are marked for deletion. Isolated carriage returns are then replaced with LF characters. Completion of this process satisfies the XML 1.0 end-of-line handling requirements. For clarity, this example encodes input data carriage returns as C characters, whereas line feed characters are shown as L characters.

TableÂ VII

XML 1.0 End-of-line Handling

Call Out Streams

Call out bit streams mark the extents of XML markup structures such as comments, processing instruction and CDATA sections as well as physical structures such as character and entity references and general references.  Call out streams are also formed for logical markup structures such start tags, end tags and empty element tags.

Comment, Processing Instruction and CDATA Section Call Out Streams

Comments, processing instructions and CDATA sections call out streams, Ct_Span, PI_Span and CD_Span respectively, define sections of an XML document which contain markup that is not interpreted by an XML processor. As such, the union of Ct_Span, PI_Span and CD_Span streams defines the regions of non-interpreteable markup. The stream formed by this union is termed the CtCDPI_Mask.

The following tables provides an example of constructing the CtCDPI_Mask.

TableÂ VIII

With the removal of all non-interpreteable markup, several phases of parallel bit stream based SIMD operations may follow operating on up to 128 byte positions on current commondity processors and assured of XML markup relevancy. For example, with the extents identification of comments, processing instructions and CDATA secions, XML names may be identified and length sorted for efficient symbol table construction.

As an aside, comments and CDATA sections must first be validated to ensure that comments do not contain "--" sequences and that CDATA sections do not contain illegal "]]>" sequences prior to ignorable markup stream generation.

Reference Call Out Streams

The reference call out streams are the GenRefs, DecRefs, and HexRefs streams. This subset of the call out streams marks the extents of all but the closing semicolon of general and character references.

Predefined character (&lt;,&gt;,&amp;,&apos;,&quot;) and numeric character references (&#nnnn;, &#xhhhh;) must be replaced by a single character [XML 1.0]. As previously shown, this subset of call out streams enables the construction of a DelMask for references.

Tag Call Out Streams

Whereas sequential bit scans over lexical item streams form the basis of XML parsing, in the current Parabix parser a new method of parallel parsing has been developed and prototyped using the concept of bitstream addition. Fundamental to this method is the concept of a cursor stream, a bit stream marking the positions of multiple parallel parses currently in process.

The results of parallel parsing using the bit stream addition technique produces a set of tag call out bit streams. These streams mark the extents of each start tag, end tag and empty element tag. Within tags, additional streams mark start and end positions for tag names, as well as attribute names and values. An error flag stream marks the positions of any syntactic errors encountered during parsing.

The set of tag call out streams consists of the ElemNames, AttNames, AttVals, Tags, EmptyTagEnds and EndTags bit streams. The following example demonstrates the bit stream output produced which from parallel parsing using bit stream addition.

TableÂ IX

Tag Call Out Streams

SIMD Beyond Bitstreams: Names and Numbers

Whereas the fundamental innovation of our work is the use of SIMD technology in implementing parallel bit streams for XML, there are also important ways in which more traditional byte-oriented SIMD operations can be useful in accelerating other aspects of XML processing.

Name Lookup

Efficient symbol table mechanisms for looking up element and attribute names is important for almost all XML processing applications. It is also an important technique merely for assessing well-formedness of an XML document; rather than validating the character-by-character composition of each occurrence of an XML name as it is encountered, it is more efficient to validate all but the first occurrence by first determining whether the name already exists in a table of prevalidated names.

The first symbol table mechanism deployed in the Parabix parser simply used the hashmaps of the C++ standard template library, without deploying any SIMD technology. However, with the overhead of character validation, transcoding and parsing dramatically reduced by parallel bit stream technology, we found that symbol lookups then accounted for about half of the remaining execution time in a statistics gathering application [Cameron, Herdy and Lin 2008]. Thus, symbol table processing was identified as a major target for further performance improvement.

Our first effort to improve symbol table performance was to employ the splash tables with cuckoo hashing as described by Ross [Ross 2006], using SIMD technology for parallel bucket processing. Although this technique did turn out to have the advantage of virtually constant-time performance even for very large vocabularies, it was not particularly helpful for the relatively small vocabularies typically found in XML document processing.

However, a second approach has been found to be quite useful, taking advantage of parallel bit streams for cheap determination of symbol length. In essence, the length of a name can be determined very cheaply using a single bit scan operation. This then makes it possible to use length-sorted symbol table processing, as follows. First, the occurrences of all names are stored in arrays indexed by length. Then the length-sorted arrays may each be inserted into the symbol table in turn. The advantage of this is that a separate loop may be written for each length. Length sorting makes for very efficient name processing. For example hash value computations and name comparisons can be made by loading multibyte values and performing appropriate shifting and masking operations, without the need for a byte-at-a-time loop. In initial experiments, this length-sorting approach was found to reduce symbol lookup cost by a factor of two.

Current research includes the application of SIMD technology to further enhance the performance of length-sorted lookup. We have identified a promising technique for parallel processing of multiple name occurrences using a parallel trie lookup technique. Given an array of occurrences of names of a particular length, the first one, two or four bytes of each name are gathered and stored in a linear array. SIMD techniques are then used to compare these prefixes with the possible prefixes for the current position within the trie. In general, a very small number of possibilities exist for each trie node, allowing for fast linear search through all possibilities. Typically, the parallelism is expected to exceed the number of possibilities to search through at each node. With length-sorting to separate the top-level trie into many small subtries, we expect only a single step of symbol lookup to be needed in most practical instances.

The gather step of this algorithm is actually a common technique in SIMD processing. Instruction set support for gather operations is a likely future direction for SIMD technology.

Numeric Processing

Many XML applications involve numeric data fields as attribute values or element content. Although most current XML APIs uniformly return information to applications in the form of character strings, it is reasonable to consider direct API support for numeric conversions within a high-performance XML engine. With string to numeric conversion such a common need, why leave it to application programmers?

High-performance string to numeric conversion using SIMD operations also can considerably outperform the byte-at-a-time loops that most application programmers or libraries might employ. A first step is reduction of ASCII bytes to corresponding decimal nybbles using a SIMD packing operation. Then an inductive doubling algorithm using SIMD operations may be employed. First, 16 sets of adjacent nybble values in the range 0-9 can be combined in just a few SIMD operations to 16 byte values in the range 0-99. Then 8 sets of byte values may similarly be combined with further SIMD processing to produce doublebyte values in the range 0-9999. Further combination of doublebyte values into 32-bit integers and so on can also be performed using SIMD operations.

Using appropriate gather operations to bring numeric strings into appropriate array structures, an XML engine could offer high-performance numeric conversion services to XML application programmers. We expect this to be an important direction for our future work, particularly in support of APIs that focus on direct conversion of XML data into business objects.

APIs and Parallel Bit Streams

The ILAX Streaming API

The In-Line API for XML (ILAX) is the base API provided with the Parabix parser. It is intended for low-level extensions compiled right into the engine, with minimum possible overhead. It is similar to streaming event-based APIs such as SAX, but implemented by inline substitution rather than using callbacks. In essence, an extension programmer provides method bodies for event-processing methods declared internal to the Parabix parsing engine, compiling the event processing code directly with the core code of the engine.

Although ILAX can be used directly for application programming, its primary use is for implementing engine extensions that support higher-level APIs. For example, the implementation of C or C++ based streaming APIs based on the Expat [Expat] or general SAX models can be quite directly implemented. C/C++ DOM or other tree-based APIs can also be fairly directly implemented. However, delivering Parabix performance to Java-based XML applications is challenging due to the considerable overhead of crossing the Java Native Interface (JNI) boundary. This issue is addressed with the Array Set Model (ASM) concept discussed in the following section.

With the recent development of parallel parsing using bitstream addition, it is likely that the underlying ILAX interface of Parabix will change. In essence, ILAX suffers the drawback of all event-based interfaces: they are fundamentally sequential in number. As research continues, we expect efficient parallel methods building on parallel bit stream foundations to move up the stack of XML processing requirements. Artificially imposing sequential processing is thus expected to constrain further advances in XML performance.

Efficient XML in Java Using Array Set Models

In our GML-to-SVG case study, we identified the lack of high-performance XML processing solutions for Java to be of particular interest. Java byte code does not provide access to the SIMD capabilities of the underlying machine architecture. Java just-in-time compilers might be capable of using some SIMD facilities, but there is no real prospect of conventional compiler technology translating byte-at-a-time algorithms into parallel bit stream code. So the primary vehicle for delivering high-performance XML processing is to call native parallel bit stream code written in C through JNI capabilities.

However, each JNI call is expensive, so it is desirable to minimize the number of calls and get as much work done during each call as possible. This mitigates against direct implementation of streaming APIs in Java through one-to-one mappings to an underlying streaming API in C. Instead, we have concentrated on gathering information on the C side into data structures that can then be passed to the Java side. However, using either C pointer-based structures or C++ objects is problematic because these are difficult to interpret on the Java side and are not amenable to Java's automatic storage management system. Similarly, Java objects cannot be conveniently created on the C side. However, it is possible to transfer arrays of simple data values (bytes or integers) between C and Java, so that makes a reasonable focus for bulk data communication between C and Java.

Array Set Models are array-based representations of information representing an XML document in accord with XML InfoSet [XML Infoset] or other XML data models relevant to particular APIs. As well as providing a mechanism for efficient bulk data communication across the JNI boundary, ASMs potentially have a number of other benefits in high-performance XML processing.

• Prefetching. Commodity processors commonly support hardware and/or software prefetching to ensure that data is available in a processor cache when it is needed. In general, prefetching is most effective in conjunction with the continuous sequential memory access patterns associated with array processing.

• DMA. Some processing environments provide Direct Memory Access (DMA) controllers for block data movement in parallel with computation. For example, the Cell Broadband Engine uses DMA controllers to move the data to and from the local stores of the synergistic processing units. Arrays of contiguous data elements are well suited to bulk data movement using DMA.

• SIMD. Single Instruction Multiple Data (SIMD) capabilities of modern processor instruction sets allow simultaneous application of particular instructions to sets of elements from parallel arrays. For effective use of SIMD capabilities, an SoA (Structure of Arrays) model is preferrable to an AoS (Array of Structures) model.

• Multicore processors. Array-oriented processing can enable the effective distribution of work to the individual cores of a multicore system in two distinct ways. First, provided that sequential dependencies can be minimized or eliminated, large arrays can be divided into separate segments to be processed in parallel on each core. Second, pipeline parallelism can be used to implement efficient multipass processing with each pass consisting of a processing kernel with array-based input and array-based output.

• Streaming buffers for large XML documents. In the event that an XML document is larger than can be reasonably represented entirely within processor memory, a buffer-based streaming model can be applied to work through a document using sliding windows over arrays of elements stored in document order.

Saxon-B TinyTree Example

As a first example of the ASM concept, current work includes a proof-of-concept to deliver a high-performance replacement for building the TinyTree data structure used in Saxon-B 6.5.5, an open-source XSLT 2.0 processor written in Java [Saxon]. Although XSLT stylesheets may be cached for performance, the caching of source XML documents is typically not possible. A new TinyTree object to represent the XML source document is thus commonly constructed with each new query so that the overall performance of simple queries on large source XML documents is highly dependent on TinyTree build time. Indeed, in a study of Saxon-SA, the commercial version of Saxon, query time was shown to be dominated by TinyTree build time [Kay 2008]. Similar performance results are demonstrable for the Saxon-B XSLT processor as well.

The Saxon-B processor studied is a pure Java solution, converting a SAX (Simple API for XML) event stream into the TinyTree Java object using the efficient Aelfred XML parser [Ãlfred]. The TinyTree structure is itself an array-based structure mapping well suited to the ASM concept. It consists of six parallel arrays of integers indexed on node number and containing one entry for each node in the source document, with the exception of attribute and namespace nodes [Saxon]. Four of the arrays respectively provide node kind, name code, depth, and next sibling information for each node, while the two others are overloaded for different purposes based on node kind value. For example, in the context of a text node , one of the overloaded arrays holds the text buffer offset value whereas the other holds the text buffer length value. Attributes and namespaces are represented using similiar parallel array of values. The stored TinyTree values are primarily primitive Java types, however, object types such as Java Strings and Java StringBuffers are also used to hold attribute values and comment values respectively.

In addition to the TinyTree object, Saxon-B maintains a NamePool object which represents a collection of XML name triplets. Each triplet is composed of a Namespace URI, a Namespace prefix and a local name and encoded as an integer value known as a namecode. Namecodes permit efficient name search and look-up using integer comparison. Namecodes may also be subsequently decoded to recover namespace and local name information.

Using the Parabix ILAX interface, a high-performance reimplementation of TinyTree and NamePool data structures was built to compare with the Saxon-B implementation. In fact, two functionally equivalent versions of the ASM java class were constructed. An initial version was constructed based on a set of primitive Java arrays constructed and allocated in the Java heap space via JNI New<PrimitiveType>Array method call. In this version, the JVM garbage collector is aware of all memory allocated in the native code. However, in this approach, large array copy operations limited overall performance to approximately a 2X gain over the Saxon-B build time.

To further address the performance penalty imposed by copying large array values, a second version of the ASM Java object was constructed based on natively backed Direct Memory Byte Buffers [Hitchens 2002]. In this version the JVM garbage collector is unaware any native memory resources backing the Direct Memory Byte Buffers. Large JNI-based copy operations are avoided; however, system memory must be explicitly deallocated via a Java native method call. Using this approach, our preliminary results show an approximate total 2.5X gain over Saxon-B build time.

Compiler Technology

An important focus of our recent work is on the development of compiler technology to automatically generate the low-level SIMD code necessary to implement bit stream processing given suitable high-level specifications. This has several potential benefits. First, it can eliminate the tedious and error-prone programming of bit stream operations in terms of register-at-a-time SIMD operations. Second, compilation technology can automatically employ a variety of performance improvement techniques that are difficult to apply manually. These include algorithms for instruction scheduling and register allocation as well as optimization techniques for common subexpression expression elimination and register rematerialization among others. Third, compiler technology makes it easier to make changes to the low-level code for reasons of perfective or adaptive maintenance.

Beyond these reasons, compiler technology also offers the opportunity for retargetting the generation of code to accommodate different processor architectures and API requirements. Strategies for efficient parallel bit stream code can vary considerably depending on processor resources such as the number of registers available, the particular instruction set architecture supported, the size of L1 and L2 data caches, the number of available cores and so on. Separate implementation of custom code for each processor architecture would thus be likely to be prohibitively expensive, prone to errors and inconsistencies and difficult to maintain. Using compilation technology, however, the idea would be to implement a variety of processor-specific back-ends all using a common front end based on parallel bit streams.

Character Class Compiler

The first compiler component that we have implemented is a character class compiler, capable of generation all the bit stream logic necessary to produce a set of lexical item streams each corresponding to some particular set of characters to be recognized. By taking advantage of common patterns between characters within classes, and special optimization logic for recognizing character-class ranges, our existing compiler is able to generate well-optimized code for complex sets of character classes involving numbers of special characters as well as characters within specific sets of ranges.

Regular Expression Compilation

Based on the character class compiler, we are currently investigating the construction of a regular expression compiler that can implement bit-stream based parallel regular-expression matching similar to that describe previously for parallel parsing by bistream addition. This compiler works with the assumption that bitstream regular-expression definitions are deterministic; no backtracking is permitted with the parallel bit stream representation. In XML applications, this compiler is primarily intended to enforce regular-expression constraints on string datatype specifications found in XML schema.

Unbounded Bit Stream Compilation

The Catalog of XML Bit Streams presented earlier consist of a set of abstract, unbounded bit streams, each in one-to-one correspondence with input bytes of a text file. Determining how these bit streams are implemented using fixed-width SIMD registers, and possibly processed in fixed-length buffers that represent some multiple of the register width is a source of considerable programming complexity. The general goal of our compilation strategy in this case is to allow operations to be programmed in terms of unbounded bit streams and then automatically reduced to efficient low-level code with the application of a systematic code generation strategy for handling block and buffer boundary crossing. This is work currently in progress.

Conclusion

Parallel bit stream technology offers the opportunity to dramatically speed up the core XML processing components used to implement virtually any XML API. Character validation and transcoding, whitespace processing, and parsing up to including the full validation of tag syntax can be handled fully in parallel using bit stream methods. Bit streams to mark the positions of all element names, attribute names and attribute values can also be produced, followed by fast bit scan operations to generate position and length values. Beyond bit streams, byte-oriented SIMD processing of names and numerals can also accelerate performance beyond sequential byte-at-a-time methods.

Advances in processor architecture are likely to further amplify the performance of parallel bit stream technology over traditional byte-at-a-time processing over the next decade. Improvements to SIMD register width, register complement and operation format can all result in further gains. New SIMD instruction set features such as inductive doubling support, parallel extract and deposit instructions, bit interleaving and scatter/gather capabilities should also result in significant speed-ups. Leveraging the intraregister parallelism of parallel bit stream technology within SIMD registers to take of intrachip parallelism on multicore processors should accelerate processing further.

Technology transfer using a patent-based open-source business model is a further goal of our work with a view to widespread deployment of parallel bit stream technology in XML processing stacks implementing a variety of APIs. The feasibility of substantial performance improvement in replacement of technology implementing existing APIs has been demonstrated even in complex software architectures involving delivery of performance benefits across the JNI boundary. We are seeking to accelerate these deployment efforts both through the development of compiler technology to reliably apply these methods to a variety of architectures as well as to identify interested collaborators using open-source or commercial models.

Acknowledgments

This work is supported in part by research grants and scholarships from the Natural Sciences and Engineering Research Council of Canada, the Mathematics of Information Technology and Complex Systems Network and the British Columbia Innovation Council.

We thank our colleague Dan Lin (Linda) for her work in high-performance symbol table processing.

Background

Xerces C++ Structure

The Xerces C++ parser features comprehensive support for a variety of character encodings both commonplace (e.g., UTF-8, UTF-16) and rarely used (e.g., EBCDIC), support for multiple XML vocabularies through the XML namespace mechanism, as well as complete implementations of structure and data validation through multiple grammars declared using either legacy DTDs (document type definitions) or modern XML Schema facilities. Xerces also supports several APIs for accessing parser services, including event-based parsing using either pull parsing or SAX/SAX2 push-style parsing as well as a DOM tree-based parsing interface.

Xerces, like all traditional parsers, processes XML documents sequentially a byte-at-a-time from the first to the last byte of input data. Each byte passes through several processing layers and is classified and eventually validated within the context of the document state. This introduces implicit dependencies between the various tasks within the application that make it difficult to optimize for performance. As a complex software system, no one feature dominates the overall parsing performance. Figure \ref{fig:xerces-profile} shows the execution time profile of the top ten functions in a typical run. Even if it were possible, Amdahl's Law dictates that tackling any one of these functions for parallelization in isolation would only produce a minute improvement in performance. Unfortunately, early investigation into these functions found that incorporating speculation-free thread-level parallelization was impossible and they were already performing well in their given tasks; thus only trivial enhancements were attainable. In order to obtain a systematic acceleration of Xerces, it should be expected that a comprehensive restructuring is required, involving all aspects of the parser.

The Parabix Framework

The Parabix (parallel bit stream) framework is a transformative approach to XML parsing (and other forms of text processing.) The key idea is to exploit the availability of wide SIMD registers (e.g., 128-bit) in commodity processors to represent data from long blocks of input data by using one register bit per single input byte. To facilitate this, the input data is first transposed into a set of basis bit streams. In Boolean-logic operations\footnote{$â§$, $\âš$ and $Â¬$ denote the boolean AND, OR and NOT operators.} are used to classify the input bits into a set of {\it character-class bit streams}, which identify key characters (or groups of characters) with a $1$. For example, one of the fundamental characters in XML is a left-angle bracket. A character is a< if and only if $\lnot(b_0 \lor b_1) \land (b_2 \land b_3) \land (b_4 \land b_5) \land \lnot (b_6 \lor b_7) = 1. Similarly, a character is numeric [0-9] if and only if$\lnot(b_0 \lor b_1) \land (b_2 \land b_3) \land \lnot(b_4 \land (b_5 \lor b_6)). An important observation here is that ranges of characters may require fewer operations than individual characters and multiple classes can share the classification cost.

Consider, for example, the XML source data stream shown in the first line of . The remaining lines of this figure show several parallel bit streams that are computed in Parabix-style parsing, with each bit of each stream in one-to-one correspondence to the source character code units of the input stream. For clarity, 1 bits are denoted with 1 in each stream and 0 bits are represented as underscores. The first bit stream shown is that for the opening angle brackets that represent tag openers in XML. The second and third streams show a partition of the tag openers into start tag marks and end tag marks depending on the character immediately following the opener (i.e., \verb:/:'') or not.  The remaining three lines show streams that can be computed in subsequent parsing (using the technique of \bitstream{} addition \cite{cameron-EuroPar2011}), namely streams marking the element names, attribute names and attribute values of tags.

Two intuitions may help explain how the Parabix approach can lead to improved XML parsing performance. The first is that the use of the full register width offers a considerable information advantage over sequential byte-at-a-time parsing.  That is, sequential processing of bytes uses just 8 bits of each register, greatly limiting the processor resources that are effectively being used at any one time. The second is that byte-at-a-time loop scanning loops are actually often just computing a single bit of information per iteration: is the scan complete yet? Rather than computing these individual decision-bits, an approach that computes many of them in parallel (e.g., 128 bytes at a time using 128-bit registers) should provide substantial benefit.

Previous studies have shown that the Parabix approach improves many aspects of XML processing, including transcoding \cite{Cameron2008}, character classification and validation, tag parsing and well-formedness checking. The first Parabix parser used processor bit scan instructions to considerably accelerate sequential scanning loops for individual characters \cite{CameronHerdyLin2008}. Recent work has incorporated a method of parallel scanning using \bitstream{} addition \cite{cameron-EuroPar2011}, as well as combining SIMD methods with 4-stage pipeline parallelism to further improve throughput \cite{HPCA2012}. Although these research prototypes handled the full syntax of schema-less XML documents, they lacked the functionality required by full XML parsers.

Commercial XML processors support transcoding of multiple character sets and can parse and validate against multiple document vocabularies. Additionally, they provide API facilities beyond those found in research prototypes, including the widely used SAX, SAX2 and DOM interfaces.

Xercesâlike all traditional XML parsersâprocesses XML documents sequentially. Each character is examined to distinguish between the XML-specific markup, such as a left angle bracket <, and the content held within the document. As the parser progresses through the document, it alternates between markup scanning, validation and content processing modes.

In other words, Xerces belongs to an equivalent class applications termed FSM applications\footnote{ Herein FSM applications are considered software systems whose behaviour is defined by the inputs, current state and the events associated with transitions of states.}. Each state transition indicates the processing context of subsequent characters. Unfortunately, textual data tends to be unpredictable and any character could induce a state transition.

Parabix-style XML parsers utilize a concept of layered processing. A block of source text is transformed into a set of lexical \bitstream{}s, which undergo a series of operations that can be grouped into logical layers, e.g., transposition, character classification, and lexical analysis. Each layer is pipeline parallel and require neither speculation nor pre-parsing stages\cite{HPCA2012}. To meet the API requirements of the document-ordered Xerces output, the results of the Parabix processing layers must be interleaved to produce the equivalent behaviour.

Architecture

Overview

\icXML{} is more than an optimized version of Xerces. Many components were grouped, restructured and rearchitected with pipeline parallelism in mind. In this section, we highlight the core differences between the two systems. As shown in Figure \ref{fig:xerces-arch}, Xerces is comprised of five main modules: the transcoder, reader, scanner, namespace binder, and validator. The {\it Transcoder} converts source data into UTF-16 before Xerces parses it as XML; the majority of the character set encoding validation is performed as a byproduct of this process. The {\it Reader} is responsible for the streaming and buffering of all raw and transcoded (UTF-16) text. It tracks the current line/column position, performs line-break normalization and validates context-specific character set issues, such as tokenization of qualified-names. The {\it Scanner} pulls data through the reader and constructs the intermediate representation (IR) of the document; it deals with all issues related to entity expansion, validates the XML well-formedness constraints and any character set encoding issues that cannot be completely handled by the reader or transcoder (e.g., surrogate characters, validation and normalization of character references, etc.) The {\it Namespace Binder} is a core piece of the element stack. It handles namespace scoping issues between different XML vocabularies. This allows the scanner to properly select the correct schema grammar structures. The {\it Validator} takes the IR produced by the Scanner (and potentially annotated by the Namespace Binder) and assesses whether the final output matches the user-defined DTD and schema grammar(s) before passing it to the end-user.

In \icXML{} functions are grouped into logical components. As shown in Figure \ref{fig:icxml-arch}, two major categories exist: (1) the \PS{} and (2) the \MP{}. All tasks in (1) use the Parabix Framework \cite{HPCA2012}, which represents data as a set of parallel \bitstream{}s. The {\it Character Set Adapter}, discussed in Section \ref{arch:character-set-adapter}, mirrors Xerces's Transcoder duties; however instead of producing UTF-16 it produces a set of lexical \bitstream{}s, similar to those shown in Figure \ref{fig:parabix1}. These lexical \bitstream{}s are later transformed into UTF-16 in the \CSG{}, after additional processing is performed. The first precursor to producing UTF-16 is the {\it Parallel Markup Parser} phase. It takes the lexical streams and produces a set of marker \bitstream{}s in which a 1-bit identifies significant positions within the input data. One \bitstream{} for each of the critical piece of information is created, such as the beginning and ending of start tags, end tags, element names, attribute names, attribute values and content. Intra-element well-formedness validation is performed as an artifact of this process. Like Xerces, \icXML{} must provide the Line and Column position of each error. The {\it Line-Column Tracker} uses the lexical information to keep track of the document position(s) through the use of an optimized population count algorithm, described in Section \ref{section:arch:errorhandling}. From here, two data-independent branches exist: the Symbol Resolver and Content Preparation Unit.

A typical XML file contains few unique element and attribute namesâbut each of them will occur frequently. \icXML{} stores these as distinct data structures, called symbols, each with their own global identifier (GID). Using the symbol marker streams produced by the Parallel Markup Parser, the {\it Symbol Resolver} scans through the raw data to produce a sequence of GIDs, called the {\it symbol stream}.

The final components of the \PS{} are the {\it Content Preparation Unit} and {\it \CSG{}}. The former takes the (transposed) basis \bitstream{}s and selectively filters them, according to the information provided by the Parallel Markup Parser, and the latter transforms the filtered streams into the tagged UTF-16 {\it content stream}, discussed in Section \ref{section:arch:contentstream}.

Combined, the symbol and content stream form \icXML{}'s compressed IR of the XML document. The {\it \MP{}}~parses the IR to validate and produce the sequential output for the end user. The {\it Final WF checker} performs inter-element well-formedness validation that would be too costly to perform in bit space, such as ensuring every start tag has a matching end tag. Xerces's namespace binding functionality is replaced by the {\it Namespace Processor}. Unlike Xerces, it is a discrete phase that produces a series of URI identifiers (URI IDs), the {\it URI stream}, which are associated with each symbol occurrence. This is discussed in Section \ref{section:arch:namespacehandling}. Finally, the {\it Validation} layer implements the Xerces's validator. However, preprocessing associated with each symbol greatly reduces the work of this stage.

In Xerces, all input is transcoded into UTF-16 to simplify the parsing costs of Xerces itself and provide the end-consumer with a single encoding format. In the important case of UTF-8 to UTF-16 transcoding, the transcoding costs can be significant, because of the need to decode and classify each byte of input, mapping variable-length UTF-8 byte sequences into 16-bit UTF-16 code units with bit manipulation operations. In other cases, transcoding may involve table look-up operations for each byte of input.  In any case, transcoding imposes at least a cost of buffer copying.

In \icXML{}, however,  the concept of Character Set Adapters (CSAs) is used to minimize transcoding costs. Given a specified input encoding, a CSA is responsible for checking that input code units represent valid characters, mapping the characters of the encoding into the appropriate \bitstream{}s for XML parsing actions (i.e., producing the lexical item streams), as well as supporting ultimate transcoding requirements.   All of this work is performed using the parallel \bitstream{} representation of the source input.

An important observation is that many character sets are an extension to the legacy 7-bit ASCII character set.  This includes the various ISO Latin character sets, UTF-8, UTF-16 and many others. Furthermore, all significant characters for parsing XML are confined to the ASCII repertoire.   Thus, a single common set of lexical item calculations serves to compute lexical item streams for all such ASCII-based character sets.

A second observation is thatâregardless of which character set is usedâquite often all of the characters in a particular block of input will be within the ASCII range. This is a very simple test to perform using the \bitstream{} representation, simply confirming that the bit 0 stream is zero for the entire block.   For blocks satisfying this test, all logic dealing with non-ASCII characters can simply be skipped. Transcoding to UTF-16 becomes trivial as the high eight \bitstream{}s of the UTF-16 form are each set to zero in this case.

A third observation is that repeated transcoding of the names of XML elements, attributes and so on can be avoided by using a look-up mechanism. That is, the first occurrence of each symbol is stored in a look-up table mapping the input encoding to a numeric symbol ID.   Transcoding of the symbol is applied at this time.  Subsequent look-up operations can avoid transcoding by simply retrieving the stored representation. As symbol look up is required to apply various XML validation rules, there is achieves the effect of transcoding each occurrence without additional cost.

The cost of individual character transcoding is avoided whenever a block of input is confined to the ASCII subset and for all but the first occurrence of any XML element or attribute name. Furthermore, when transcoding is required, the parallel \bitstream{} representation supports efficient transcoding operations. In the important case of UTF-8 to UTF-16 transcoding, the corresponding UTF-16 \bitstream{}s can be calculated in bit parallel fashion based on UTF-8 streams \cite{Cameron2008}, and all but the final bytes of multi-byte sequences can be marked for deletion as discussed in the following subsection. In other cases, transcoding within a block only need be applied for non-ASCII bytes, which are conveniently identified by iterating through the bit 0 stream using bit scan operations.

Combined Parallel Filtering

As just mentioned, UTF-8 to UTF-16 transcoding involves marking all but the last bytes of multi-byte UTF-8 sequences as positions for deletion.   For example, the two Chinese characters \begin{CJK*}{UTF8}{gbsn}äœ å¥œ\end{CJK*} are represented as two three-byte UTF-8 sequences \verb'E4 BD A0' and \verb'E5 A5 BD' while the UTF-16 representation must be compressed down to the two code units \verb'4F60' and \verb'597D'. In the bit parallel representation, this corresponds to a reduction from six bit positions representing UTF-8 code units (bytes) down to just two bit positions representing UTF-16 code units (double bytes).   This compression may be achieved by arranging to calculate the correct UTF-16 bits at the final position of each sequence and creating a deletion mask to mark the first two bytes of each 3-byte sequence for deletion.   In this case, the portion of the mask corresponding to these input bytes is the bit sequence \verb'110110'.  Using this approach, transcoding may then be completed by applying parallel deletion and inverse transposition of the UTF-16 \bitstream{}s\cite{Cameron2008}.

Rather than immediately paying the costs of deletion and transposition just for transcoding, however, \icXML{} defers these steps so that the deletion masks for several stages of processing may be combined. In particular, this includes core XML requirements to normalize line breaks and to replace character reference and entity references by their corresponding text.   In the case of line break normalization, all forms of line breaks, including bare carriage returns (CR), line feeds (LF) and CR-LF combinations must be normalized to a single LF character in each case.   In \icXML{}, this is achieved by first marking CR positions, performing two bit parallel operations to transform the marked CRs into LFs, and then marking for deletion any LF that is found immediately after the marked CR as shown by the Pablo source code in Figure \ref{fig:LBnormalization}.

In essence, the deletion masks for transcoding and for line break normalization each represent a bitwise filter; these filters can be combined using bitwise-or so that the parallel deletion algorithm need only be applied once.

A further application of combined filtering is the processing of XML character and entity references.   Consider, for example, the references & or <. which must be replaced in XML processing with the single & and < characters, respectively. The approach in \icXML{} is to mark all but the first character positions of each reference for deletion, leaving a single character position unmodified.  Thus, for the references & or < the masks 01111 and 011111 are formed and combined into the overall deletion mask.   After the deletion and inverse transposition operations are finally applied, a post-processing step inserts the proper character at these positions.   One note about this process is that it is speculative; references are assumed to generally be replaced by a single UTF-16 code unit.   In the case, that this is not true, it is addressed in post-processing.

The final step of combined filtering occurs during the process of reducing markup data to tag bytes preceding each significant XML transition as described in section~\ref{section:arch:contentstream}.  Overall, \icXML{} avoids separate buffer copying operations for each of the these filtering steps, paying the cost of parallel deletion and inverse transposition only once. Currently, \icXML{} employs the parallel-prefix compress algorithm of Steele~\cite{HackersDelight}  Performance is independent of the number of positions deleted. Future versions of \icXML{} are expected to take advantage of the parallel extract operation~\cite{HilewitzLee2006} that Intel is now providing in its Haswell architecture.

Content Stream

A relatively-unique concept for \icXML{} is the use of a filtered content stream. Rather that parsing an XML document in its original format, the input is transformed into one that is easier for the parser to iterate through and produce the sequential output. In , the source data is transformed into through the parallel filtering algorithm, described in section \ref{sec:parfilter}.

Combined with the symbol stream, the parser traverses the content stream to effectively reconstructs the input document in its output form. The initial {\tt\it 0} indicates an empty content string. The following \verb|>| indicates that a start tag without any attributes is the first element in this text and the first unused symbol, document, is the element name. Succeeding that is the content string fee, which is null-terminated in accordance with the Xerces API specification. Unlike Xerces, no memory-copy operations are required to produce these strings, which as Figure~\ref{fig:xerces-profile} shows accounts for 6.83% of Xerces's execution time. Additionally, it is cheap to locate the terminal character of each string: using the String End \bitstream{}, the \PS{} can effectively calculate the offset of each null character in the content stream in parallel, which in turn means the parser can directly jump to the end of every string without scanning for it.

Following \verbfee'' is a \verb=, which marks the existence of an attribute. Because all of the intra-element was performed in the \PS{}, this must be a legal attribute. Since attributes can only occur within start tags and must be accompanied by a textual value, the next symbol in the symbol stream must be the element name of a start tag, and the following one must be the name of the attribute and the string that follows the \verb= must be its value. However, the subsequent \verb= is not treated as an independent attribute because the parser has yet to read a \verb>, which marks the end of a start tag. Thus only one symbol is taken from the symbol stream and it (along with the string value) is added to the element. Eventually the parser reaches a \verb/, which marks the existence of an end tag. Every end tag requires an element name, which means they require a symbol. Inter-element validation whenever an empty tag is detected to ensure that the appropriate scope-nesting rules have been applied.

Namespace Handling

In XML, namespaces prevents naming conflicts when multiple vocabularies are used together. It is especially important when a vocabulary application-dependant meaning, such as when XML or SVG documents are embedded within XHTML files. Namespaces are bound to uniform resource identifiers (URIs), which are strings used to identify specific names or resources. On line 1 of Figure \ref{fig:namespace1}, the \verb|xmlns| attribute instructs the XML processor to bind the prefix p to the URI 'pub.net' and the default (empty) prefix to book.org. Thus to the XML processor, the \verb|title| on line 2 and \verb|price| on line 4 both read as \verb|"book.org":title| and \verb|"book.org":price| respectively, whereas on line 3 and 5, \verb|p:name| and \verb|price| are seen as \verb|"pub.net":name| and \verb|"pub.net":price|. Even though the actual element name \verb|price|, due to namespace scoping rules they are viewed as two uniquely-named items because the current vocabulary is determined by the namespace(s) that are in-scope.

In both Xerces and \icXML{}, every URI has a one-to-one mapping to a URI ID. These persist for the lifetime of the application through the use of a global URI pool. Xerces maintains a stack of namespace scopes that is pushed (popped) every time a start tag (end tag) occurs in the document. Because a namespace declaration affects the entire element, it must be processed prior to grammar validation. This is a costly process considering that a typical namespaced XML document only comes in one of two forms: (1) those that declare a set of namespaces upfront and never change them, and (2) those that repeatedly modify the namespaces in predictable patterns.

For that reason, \icXML{} contains an independent namespace stack and utilizes bit vectors to cheaply perform When a prefix is declared (e.g., \verb|xmlns:p="pub.net"|), a namespace binding is created that maps the prefix (which are assigned Prefix IDs in the symbol resolution process) to the URI. Each unique namespace binding has a unique namespace id (NSID) and every prefix contains a bit vector marking every NSID that has ever been associated with it within the document. For example, in Table \ref{tbl:namespace1}, the prefix binding set of \verb|p| and \verb|xmlns| would be \verb|01| and \verb|11| respectively. To resolve the in-scope namespace binding for each prefix, a bit vector of the currently visible namespaces is maintained by the system. By ANDing the prefix bit vector with the currently visible namespaces, the in-scope NSID can be found using a bit-scan intrinsic. A namespace binding table, similar to Table \ref{tbl:namespace1}, provides the actual URI ID.

To ensure that scoping rules are adhered to, whenever a start tag is encountered, any modification to the currently visible namespaces is calculated and stored within a stack of bit vectors denoting the locally modified namespace bindings. When an end tag is found, the currently visible namespaces is XORed with the vector at the top of the stack. This allows any number of changes to be performed at each scope-level with a constant time.

Error Handling

Xerces outputs error messages in two ways: through the programmer API and as thrown objects for fatal errors. As Xerces parses a file, it uses context-dependant logic to assess whether the next character is legal; if not, the current state determines the type and severity of the error. \icXML{} emits errors in the similar mannerâbut how it discovers them is substantially different. Recall that in Figure \ref{fig:icxml-arch}, \icXML{} is divided into two sections: the \PS{} and \MP{}, each with its own system for detecting and producing error messages.

Within the \PS{}, all computations are performed in parallel, a block at a time. Errors are derived as artifacts of \bitstream{} calculations, with a 1-bit marking the byte-position of an error within a block, and the type of error is determined by the equation that discovered it. The difficulty of error processing in this section is that in Xerces the line and column number must be given with every error production. Two major issues exist because of this: (1) line position adheres to XML white-normalization rules; as such, some sequences of characters, e.g., a carriage return followed by a line feed, are counted as a single new line character. (2) column position is counted in characters, not bytes or code units; thus multi-code-unit code-points and surrogate character pairs are all counted as a single column position. Note that typical XML documents are error-free but the calculation of the line/column position is a constant overhead in Xerces. To reduce this, \icXML{} pushes the bulk cost of the line/column calculation to the occurrence of the error and performs the minimal amount of book-keeping necessary to facilitate it. \icXML{} leverages the byproducts of the Character Set Adapter (CSA) module and amalgamates the information within the Line Column Tracker (LCT). One of the CSA's major responsibilities is transcoding an input text. During this process, white-space normalization rules are applied and multi-code-unit and surrogate characters are detected and validated. A {\it line-feed \bitstream{}}, which marks the positions of the normalized new lines characters, is a natural derivative of this process. Using an optimized population count algorithm, the line count can be summarized cheaply for each valid block of text. Column position is more difficult to calculate. It is possible to scan backwards through the \bitstream{} of new line characters to determine the distance (in code-units) between the position between which an error was detected and the last line feed. However, this distance may exceed than the actual character position for the reasons discussed in (2). To handle this, the CSA generates a {\it skip mask} \bitstream{} by ORing together many relevant \bitstream{}s, such as all trailing multi-code-unit and surrogate characters, and any characters that were removed during the normalization process. When an error is detected, the sum of those skipped positions is subtracted from the distance to determine the actual column number.

The \MP{} is a state-driven machine. As such, error detection within it is very similar to Xerces. However, reporting the correct line/column is a much more difficult problem. The \MP{} parses the content stream, which is a series of tagged UTF-16 strings. Each string is normalized in accordance with the XML specification. All symbol data and unnecessary whitespace is eliminated from the stream; thus its impossible to derive the current location using only the content stream. To calculate the location, the \MP{} borrows three additional pieces of information from the \PS{}: the line-feed, skip mask, and a {\it deletion mask stream}, which is a \bitstream{} denoting the (code-unit) position of every datum that was suppressed from the source during the production of the content stream. Armed with these, it is possible to calculate the actual line/column using the same system as the \PS{} until the sum of the negated deletion mask stream is equal to the current position.

As discussed in section \ref{background:xerces}, Xerces can be considered a FSM application. These are embarrassingly sequential.''\cite{Asanovic:EECS-2006-183} and notoriously difficult to parallelize. However, \icXML{} is designed to organize processing into logical layers. In particular, layers within the \PS{} are designed to operate over significant segments of input data before passing their outputs on for subsequent processing.  This fits well into the general model of pipeline parallelism, in which each thread is in charge of a single module or group of modules.

The most straightforward division of work in \icXML{} is to separate the \PS{} and the \MP{} into distinct logical layers into two separate stages. The resultant application, {\it\icXMLp{}}, is a course-grained software-pipeline application. In this case, the \PS{} thread $T_1$ reads 16k of XML input $I$ at a time and produces the content, symbol and URI streams, then stores them in a pre-allocated shared data structure $S$. The \MP{} thread $T_2$ consumes $S$, performs well-formedness and grammar-based validation, and the provides parsed XML data to the application through the Xerces API. The shared data structure is implemented using a ring buffer, where every entry contains an independent set of data streams. In the examples of Figure \ref{threads_timeline1} and \ref{threads_timeline2}, the ring buffer has four entries. A lock-free mechanism is applied to ensure that each entry can only be read or written by one thread at the same time. In Figure \ref{threads_timeline1} the processing time of $T_1$ is longer than $T_2$; thus $T_2$ always waits for $T_1$ to write to the shared memory. Figure \ref{threads_timeline2} illustrates the scenario in which $T_1$ is faster and must wait for $T_2$ to finish reading the shared data before it can reuse the memory space.

Overall, our design is intended to benefit a range of applications. Conceptually, we consider two design points. The first, the parsing performed by the \PS{} dominates at 67% of the overall cost, with the cost of application processing (including the driver logic within the \MP{}) at 33%. The second is almost the opposite scenario, the cost of application processing dominates at 60%, while the cost of XML parsing represents an overhead of 40%.

Our design is predicated on a goal of using the Parabix framework to achieve a 50% to 100% improvement in the parsing engine itself. In a best case scenario, a 100% improvement of the \PS{} for the design point in which XML parsing dominates at 67% of the total application cost. In this case, the single-threaded \icXML{} should achieve a 1.5x speedup over Xerces so that the total application cost reduces to 67% of the original. However, in \icXMLp{}, our ideal scenario gives us two well-balanced threads each performing about 33% of the original work. In this case, Amdahl's law predicts that we could expect up to a 3x speedup at best.

At the other extreme of our design range, we consider an application in which core parsing cost is 40%.   Assuming the 2x speedup of the \PS{} over the corresponding Xerces core, single-threaded \icXML{} delivers a 25% speedup.   However, the most significant aspect of our two-stage multi-threaded design then becomes the ability to hide the entire latency of parsing within the serial time required by the application.   In this case, we achieve an overall speedup in processing time by 1.67x.

Although the structure of the \PS{} allows division of the work into several pipeline stages and has been demonstrated to be effective for four pipeline stages in a research prototype \cite{HPCA2012}, our analysis here suggests that the further pipelining of work within the \PS{} is not worthwhile if the cost of application logic is little as 33% of the end-to-end cost using Xerces.  To achieve benefits of further parallelization with multi-core technology, there would need to be reductions in the cost of application logic that could match reductions in core parsing cost.

Performance

We evaluate \xerces{}, \icXML{}, \icXMLp{} against two benchmarking applications: the Xerces C++ SAXCount sample application, and a real world GML to SVG transformation application. We investigated XML parser performance using an Intel Core i7 quad-core (Sandy Bridge) processor (3.40GHz, 4 physical cores, 8 threads (2 per core), 32+32 kB (per core) L1 cache, 256 kB (per core) L2 cache, 8 MB L3 cache) running the 64-bit version of Ubuntu 12.04 (Linux).

We analyzed the execution profiles of each XML parser using the performance counters found in the processor. We chose several key hardware events that provide insight into the profile of each application and indicate if the processor is doing useful work. The set of events included in our study are: processor cycles, branch instructions, branch mispredictions, and cache misses. The Performance Application Programming Interface (PAPI) Version 5.5.0 \cite{papi} toolkit was installed on the test system to facilitate the collection of hardware performance monitoring statistics. In addition, we used the Linux perf \cite{perf} utility to collect per core hardware events.

Xerces C++ SAXCount

Xerces comes with sample applications that demonstrate salient features of the parser. SAXCount is the simplest such application: it counts the elements, attributes and characters of a given XML file using the (event based) SAX API and prints out the totals.

Table \ref{XMLDocChars} shows the document characteristics of the XML input files selected for the Xerces C++ SAXCount benchmark. The jaw.xml represents document-oriented XML inputs and contains the three-byte and four-byte UTF-8 sequence required for the UTF-8 encoding of Japanese characters. The remaining data files are data-oriented XML documents and consist entirely of single byte encoded ASCII characters.

A key predictor of the overall parsing performance of an XML file is markup density\footnote{ Markup Density: the ratio of markup bytes used to define the structure of the document vs. its file size.}. This metric has substantial influence on the performance of traditional recursive descent XML parsers because it directly corresponds to the number of state transitions that occur when parsing a document. We use a mixture of document-oriented and data-oriented XML files to analyze performance over a spectrum of markup densities.

Figure \ref{perf_SAX} compares the performance of Xerces, \icXML{} and pipelined \icXML{} in terms of CPU cycles per byte for the SAXCount application. The speedup for \icXML{} over Xerces is 1.3x to 1.8x. With two threads on the multicore machine, \icXMLp{} can achieve speedup up to 2.7x. Xerces is substantially slowed by dense markup but \icXML{} is less affected through a reduction in branches and the use of parallel-processing techniques. \icXMLp{} performs better as markup-density increases because the work performed by each stage is well balanced in this application.

GML2SVG

Conclusion and Future Work

This paper is the first case study documenting the significant performance benefits that may be realized through the integration of parallel \bitstream{} technology into existing widely-used software libraries. In the case of the Xerces-C++ XML parser, the combined integration of SIMD and multicore parallelism was shown capable of dramatic producing dramatic increases in throughput and reductions in branch mispredictions and cache misses. The modified parser, going under the name \icXML{} is designed to provide the full functionality of the original Xerces library with complete compatibility of APIs.  Although substantial re-engineering was required to realize the performance potential of parallel technologies, this is an important case study demonstrating the general feasibility of these techniques.

The further development of \icXML{} to move beyond 2-stage pipeline parallelism is ongoing, with realistic prospects for four reasonably balanced stages within the library.  For applications such as GML2SVG which are dominated by time spent on XML parsing, such a multistage pipelined parsing library should offer substantial benefits.

The example of XML parsing may be considered prototypical of finite-state machines applications which have sometimes been considered embarassingly sequential'' and so difficult to parallelize that nothing works.''  So the case study presented here should be considered an important data point in making the case that parallelization can indeed be helpful across a broad array of application types.

To overcome the software engineering challenges in applying parallel \bitstream{} technology to existing software systems, it is clear that better library and tool support is needed. The techniques used in the implementation of \icXML{} and documented in this paper could well be generalized for applications in other contexts and automated through the creation of compiler technology specifically supporting parallel \bitstream{} programming.

Bibliography

[Leventhal and Lemoine 2009] Leventhal, Michael and

[Leventhal and Lemoine 2009] Leventhal, Michael and Eric Lemoine 2009. The XML chip at 6 years. Proceedings of International Symposium on Processing XML Efficiently 2009, MontrÃ©al.

[Salz, Achilles and Maze 2009] Salz, Richard,

[Salz, Achilles and Maze 2009] Salz, Richard, Heather Achilles, and David Maze. 2009. Hardware and software trade-offs in the IBM DataPower XML XG4 processor card. Proceedings of International Symposium on Processing XML Efficiently 2009, MontrÃ©al.

[Cameron 2007] Cameron, Robert D. 2007. A Case Study

[Cameron 2007] Cameron, Robert D. 2007. A Case Study in SIMD Text Processing with Parallel Bit Streams UTF-8 to UTF-16 Transcoding. Proceedings of 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming 2008, Salt Lake City, Utah. On the Web at http://research.ihost.com/ppopp08/.

[Cameron, Herdy and Lin 2008] Cameron, Robert D.,

[Cameron, Herdy and Lin 2008] Cameron, Robert D., Kenneth S Herdy, and Dan Lin. 2008. High Performance XML Parsing Using Parallel Bit Stream Technology. Proceedings of CASCON 2008. 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming 2008, Toronto.

[Herdy, Burggraf and Cameron 2008] Herdy, Kenneth

[Herdy, Burggraf and Cameron 2008] Herdy, Kenneth S., Robert D. Cameron and David S. Burggraf. 2008. High Performance GML to SVG Transformation for the Visual Presentation of Geographic Data in Web-Based Mapping Systems. Nuremburg. On the Web at http://www.svgopen.org/2008/papers/74-HighPerformance_GML_to_SVG_Transformation_for_the_Visual_Presentation_of_Geographic_Data_in_WebBased_Mapping_Systems/.

[Ross 2006] Ross, Kenneth A. 2006. Efficient hash

[Ross 2006] Ross, Kenneth A. 2006. Efficient hash probes on modern processors. Proceedings of ICDE, 2006. ICDE 2006, Atlanta. On the Web at www.cs.columbia.edu/~kar/pubsk/icde2007.pdf.

[Cameron and Lin 2009] Cameron, Robert D. and Dan

[Cameron and Lin 2009] Cameron, Robert D. and Dan Lin. 2009. Architectural Support for SWAR Text Processing with Parallel Bit Streams: The Inductive Doubling Principle. Proceedings of ASPLOS 2009, Washington, DC.

[Wu et al. 2008] Wu, Yu, Qi Zhang, Zhiqiang Yu and

[Wu et al. 2008] Wu, Yu, Qi Zhang, Zhiqiang Yu and Jianhui Li. 2008. A Hybrid Parallel Processing for XML Parsing and Schema Validation. Proceedings of Balisage 2008, MontrÃ©al. On the Web at http://www.balisage.net/Proceedings/vol1/html/Wu01/BalisageVol1-Wu01.html.

[Cameron 2008] u8u16 - A High-Speed UTF-8 to UTF-16

[Cameron 2008] u8u16 - A High-Speed UTF-8 to UTF-16 Transcoder Using Parallel Bit Streams Technical Report 2007-18. 2007. School of Computing Science Simon Fraser University, June 21 2007.

[XML 1.0] Extensible Markup Language (XML) 1.0 (Fifth

[XML 1.0] Extensible Markup Language (XML) 1.0 (Fifth Edition) W3C Recommendation 26 November 2008. On the Web at http://www.w3.org/TR/REC-xml/.

[Unicode] The Unicode Consortium. 2009. On the Web at

[Unicode] The Unicode Consortium. 2009. On the Web at http://unicode.org/.

[Hilewitz and Lee 2006] Hilewitz, Y. and Ruby B. Lee.

[Hilewitz and Lee 2006] Hilewitz, Y. and Ruby B. Lee. 2006. Fast Bit Compression and Expansion with Parallel Extract and Parallel Deposit Instructions. Proceedings of the IEEE 17th International Conference on Application-Specific Systems, Architectures and Processors (ASAP), pp. 65-72, September 11-13, 2006.

[XML Infoset] XML Information Set (Second Edition) W3C

[XML Infoset] XML Information Set (Second Edition) W3C Recommendation 4 February 2004. On the Web at http://www.w3.org/TR/xml-infoset/.

[Saxon] SAXON The XSLT and XQuery Processor. On the Web

[Saxon] SAXON The XSLT and XQuery Processor. On the Web at http://saxon.sourceforge.net/.

[Kay 2008] Kay, Michael Y. 2008. Ten Reasons Why Saxon

[Kay 2008] Kay, Michael Y. 2008. Ten Reasons Why Saxon XQuery is Fast, IEEE Data Engineering Bulletin, December 2008.

[Ãlfred] The Ãlfred XML Parser. On the Web at

[Ãlfred] The Ãlfred XML Parser. On the Web at http://saxon.sourceforge.net/aelfred.html.

[Hitchens 2002] Hitchens, Ron. Java NIO. O'Reilly, 2002.

[Expat] The Expat XML Parser.

[Hitchens 2002] Hitchens, Ron. Java NIO. O'Reilly, 2002.

[Expat] The Expat XML Parser. http://expat.sourceforge.net/.

Input Data < t a g / >
ASCII 00111100 01110100 01100001 01100111 00101111 00111110
Bit0 0 0 0 0 0 0
Bit1 0 1 1 1 0 0
Bit2 1 1 1 1 1 1
Bit3 1 1 0 0 0 1
Bit4 1 0 0 0 1 1
Bit5 1 1 0 1 1 1
Bit6 0 0 0 1 1 1
Bit7 0 0 1 1 1 0
Input Data &gt; &#13; &#x0a;
GenRefs _11______________
DecRefs _______11________
HexRefs ______________11_
ErrorFlag _________________
Input Data &gt; &#, &#x;
GenRefs _11___________
DecRefs ______________
HexRefs ______________
ErrorFlag _______1____1_
LAngle Marks the position of any left angle bracket character.
RAngle Marks the position of any right angle bracket character.
LBracket Marks the position of any left square bracker character.
RBracket Marks the position of any right square bracket character.
Exclam Marks the position of any exclamation mark character.
QMark Marks the position of any question mark character.
Hyphen Marks the position of any hyphen character.
Equals Marks the position of any equal sign character.
SQuote Marks the position of any single quote character.
DQuote Marks the position of any double quote character.
Slash Marks the position of any forward slash character
NameScan Marks the position of any XML name character.
WS Marks the position of any XML 1.0 whitespace character.
PI_start Marks the position of the start of any processing instruction at the '?' character position.
PI_end Marks the position of any end of any processing instruction at the '>' character position.
CtCD_start Marks the position of the start of any comment or CDATA section at the '!' character position.
EndTag_start Marks the position of any end tag at the '/' character position.
CD_end Marks the position of the end of any CDATA section at the '>' character position.
DoubleHyphen Marks the position of any double hyphen character.
RefStart Marks the position of any ampersand character.
Hash Marks the position of any hash character.
x Marks the position of any 'x' character.
Digit Marks the position of any digit.
Hex Marks the position of any hexidecimal character.
Semicolon Marks the position of any semicolon character.
Input Data <tag><tag> text &lt; &#x3e; </tag></tag>
LAngle 1____1______________________1_____1_____
RAngle ____1____1_______________________1_____1
WS __________1____1____1______1____________
RefStart ________________1____1__________________
Hex __1____1____1___________11_____1_____1__
Semicolon ___________________1______1_____________
Slash _____________________________1_____1____
Input Data A Text in Farsi:Â ÙÂ ÙÂ Â Ù Â ØªÂ ÙÂ Â ÙÂ Ø§Â Ø±Â Ø³Â Ù
Low Nybbles 10458409E061239A099838187910968A9509399
u8Unibyte 11111111111111111__________1______1____
u8Prefix _________________1_1_1_1_1__1_1_1__1_1_
u8Suffix __________________1_1_1_1_1__1_1_1__1_1
u8Prefix2 _________________1_1_1_1_1__1_1_1__1_1_
u8Scope22 __________________1_1_1_1_1__1_1_1__1_1
ErrorFlag _______________________________________
Input Data first line C second line CL third line L one more C nothing left
CR -----------1-------------1------------------------1-------------
LF --------------------------1------------1------------------------
Input Data <?php?> <!-- example --> <![CDATA[ shift: a<<1 ]]>
CD_Span ___________________________1111111111111111111111_
Ct_Span ___________111111111111___________________________
PI_Span _11111____________________________________________
ErrorFlag __________________________________________________
Input Data <root><t1>text</t1><t2 a1='foo' a2 = 'fie'>more</t2><tag3 att3='b'/></root>
ElemNames _1111__11___________11_______________________________1111__________________
AttNames _______________________11_______11________________________1111_____________
AttrVals __________________________11111______11111_____________________111_________
EmptyTagEnds ___________________________________________________________________1_______
EndTags _______________111______________________________111__________________11111_
Start/EmptyTags _1111__11___________1111111111111111111111___________11111111111111________
ErrorFlag ___________________________________________________________________________
• ## docs/Balisage13/Bal2013came0601/Bal2013came0601.xml

 r3038
Regular Expression Compilation The Parabix Framework The Parabix (parallel bit stream) framework is a transformative approach to XML parsing (and other forms of text processing.) The key idea is to exploit the availability of wide SIMD registers (e.g., 128-bit) in commodity processors to represent data from long blocks of input data by using one register bit per single input byte. To facilitate this, the input data is first transposed into a set of basis bit streams. In Boolean-logic operations\footnote{∧, \∨ and ¬ denote the boolean AND, OR and NOT operators.} are used to classify the input bits into a set of {\it character-class bit streams}, which identify key characters (or groups of characters) with a $1$. For example, one of the fundamental characters in XML is a left-angle bracket. A character is a< if and only if $\lnot(b_0 \lor b_1) \land (b_2 \land b_3) \land (b_4 \land b_5) \land \lnot (b_6 \lor b_7) = 1. Similarly, a character is numeric [0-9] if and only if$\lnot(b_0 \lor b_1) \land (b_2 \land b_3) \land \lnot(b_4 \land (b_5 \lor b_6)). An important observation here is that ranges of characters may require fewer operations than individual characters and multiple classes can share the classification cost. Consider, for example, the XML source data stream shown in the first line of . The remaining lines of this figure show several parallel bit streams that are computed in Parabix-style parsing, with each bit of each stream in one-to-one correspondence to the source character code units of the input stream. For clarity, 1 bits are denoted with 1 in each stream and 0 bits are represented as underscores. The first bit stream shown is that for the opening angle brackets that represent tag openers in XML. The second and third streams show a partition of the tag openers into start tag marks and end tag marks depending on the character immediately following the opener (i.e., \verb:/:'') or not.  The remaining three lines show streams that can be computed in subsequent parsing (using the technique of \bitstream{} addition \cite{cameron-EuroPar2011}), namely streams marking the element names, attribute names and attribute values of tags. Two intuitions may help explain how the Parabix approach can lead to improved XML parsing performance. The first is that the use of the full register width offers a considerable information advantage over sequential byte-at-a-time parsing.  That is, sequential processing of bytes uses just 8 bits of each register, greatly limiting the processor resources that are effectively being used at any one time. The second is that byte-at-a-time loop scanning loops are actually often just computing a single bit of information per iteration: is the scan complete yet? Rather than computing these individual decision-bits, an approach that computes many of them in parallel (e.g., 128 bytes at a time using 128-bit registers) should provide substantial benefit. Previous studies have shown that the Parabix approach improves many aspects of XML processing, including transcoding \cite{Cameron2008}, character classification and validation, tag parsing and well-formedness checking. The first Parabix parser used processor bit scan instructions to considerably accelerate sequential scanning loops for individual characters \cite{CameronHerdyLin2008}. Recent work has incorporated a method of parallel scanning using \bitstream{} addition \cite{cameron-EuroPar2011}, as well as combining SIMD methods with 4-stage pipeline parallelism to further improve throughput \cite{HPCA2012}. Although these research prototypes handled the full syntax of schema-less XML documents, they lacked the functionality required by full XML parsers. Commercial XML processors support transcoding of multiple character sets and can parse and validate against multiple document vocabularies. Additionally, they provide API facilities beyond those found in research prototypes, including the widely used SAX, SAX2 and DOM interfaces.
Sequential vs. Parallel Paradigm Xerces—like all traditional XML parsers—processes XML documents sequentially. Each character is examined to distinguish between the XML-specific markup, such as a left angle bracket <, and the content held within the document. As the parser progresses through the document, it alternates between markup scanning, validation and content processing modes. In other words, Xerces belongs to an equivalent class applications termed FSM applications\footnote{ Herein FSM applications are considered software systems whose behaviour is defined by the inputs, current state and the events associated with transitions of states.}. Each state transition indicates the processing context of subsequent characters. Unfortunately, textual data tends to be unpredictable and any character could induce a state transition. Parabix-style XML parsers utilize a concept of layered processing. A block of source text is transformed into a set of lexical \bitstream{}s, which undergo a series of operations that can be grouped into logical layers, e.g., transposition, character classification, and lexical analysis. Each layer is pipeline parallel and require neither speculation nor pre-parsing stages\cite{HPCA2012}. To meet the API requirements of the document-ordered Xerces output, the results of the Parabix processing layers must be interleaved to produce the equivalent behaviour.
Architecture
Overview \icXML{} is more than an optimized version of Xerces. Many components were grouped, restructured and rearchitected with pipeline parallelism in mind. In this section, we highlight the core differences between the two systems. As shown in Figure \ref{fig:xerces-arch}, Xerces is comprised of five main modules: the transcoder, reader, scanner, namespace binder, and validator. The {\it Transcoder} converts source data into UTF-16 before Xerces parses it as XML; the majority of the character set encoding validation is performed as a byproduct of this process. The {\it Reader} is responsible for the streaming and buffering of all raw and transcoded (UTF-16) text. It tracks the current line/column position, performs line-break normalization and validates context-specific character set issues, such as tokenization of qualified-names. The {\it Scanner} pulls data through the reader and constructs the intermediate representation (IR) of the document; it deals with all issues related to entity expansion, validates the XML well-formedness constraints and any character set encoding issues that cannot be completely handled by the reader or transcoder (e.g., surrogate characters, validation and normalization of character references, etc.) The {\it Namespace Binder} is a core piece of the element stack. It handles namespace scoping issues between different XML vocabularies. This allows the scanner to properly select the correct schema grammar structures. The {\it Validator} takes the IR produced by the Scanner (and potentially annotated by the Namespace Binder) and assesses whether the final output matches the user-defined DTD and schema grammar(s) before passing it to the end-user. In \icXML{} functions are grouped into logical components. As shown in Figure \ref{fig:icxml-arch}, two major categories exist: (1) the \PS{} and (2) the \MP{}. All tasks in (1) use the Parabix Framework \cite{HPCA2012}, which represents data as a set of parallel \bitstream{}s. The {\it Character Set Adapter}, discussed in Section \ref{arch:character-set-adapter}, mirrors Xerces's Transcoder duties; however instead of producing UTF-16 it produces a set of lexical \bitstream{}s, similar to those shown in Figure \ref{fig:parabix1}. These lexical \bitstream{}s are later transformed into UTF-16 in the \CSG{}, after additional processing is performed. The first precursor to producing UTF-16 is the {\it Parallel Markup Parser} phase. It takes the lexical streams and produces a set of marker \bitstream{}s in which a 1-bit identifies significant positions within the input data. One \bitstream{} for each of the critical piece of information is created, such as the beginning and ending of start tags, end tags, element names, attribute names, attribute values and content. Intra-element well-formedness validation is performed as an artifact of this process. Like Xerces, \icXML{} must provide the Line and Column position of each error. The {\it Line-Column Tracker} uses the lexical information to keep track of the document position(s) through the use of an optimized population count algorithm, described in Section \ref{section:arch:errorhandling}. From here, two data-independent branches exist: the Symbol Resolver and Content Preparation Unit. A typical XML file contains few unique element and attribute names—but each of them will occur frequently. \icXML{} stores these as distinct data structures, called symbols, each with their own global identifier (GID). Using the symbol marker streams produced by the Parallel Markup Parser, the {\it Symbol Resolver} scans through the raw data to produce a sequence of GIDs, called the {\it symbol stream}. The final components of the \PS{} are the {\it Content Preparation Unit} and {\it \CSG{}}. The former takes the (transposed) basis \bitstream{}s and selectively filters them, according to the information provided by the Parallel Markup Parser, and the latter transforms the filtered streams into the tagged UTF-16 {\it content stream}, discussed in Section \ref{section:arch:contentstream}. Combined, the symbol and content stream form \icXML{}'s compressed IR of the XML document. The {\it \MP{}}~parses the IR to validate and produce the sequential output for the end user. The {\it Final WF checker} performs inter-element well-formedness validation that would be too costly to perform in bit space, such as ensuring every start tag has a matching end tag. Xerces's namespace binding functionality is replaced by the {\it Namespace Processor}. Unlike Xerces, it is a discrete phase that produces a series of URI identifiers (URI IDs), the {\it URI stream}, which are associated with each symbol occurrence. This is discussed in Section \ref{section:arch:namespacehandling}. Finally, the {\it Validation} layer implements the Xerces's validator. However, preprocessing associated with each symbol greatly reduces the work of this stage.
Character Set Adapters In Xerces, all input is transcoded into UTF-16 to simplify the parsing costs of Xerces itself and provide the end-consumer with a single encoding format. In the important case of UTF-8 to UTF-16 transcoding, the transcoding costs can be significant, because of the need to decode and classify each byte of input, mapping variable-length UTF-8 byte sequences into 16-bit UTF-16 code units with bit manipulation operations. In other cases, transcoding may involve table look-up operations for each byte of input.  In any case, transcoding imposes at least a cost of buffer copying. In \icXML{}, however,  the concept of Character Set Adapters (CSAs) is used to minimize transcoding costs. Given a specified input encoding, a CSA is responsible for checking that input code units represent valid characters, mapping the characters of the encoding into the appropriate \bitstream{}s for XML parsing actions (i.e., producing the lexical item streams), as well as supporting ultimate transcoding requirements.   All of this work is performed using the parallel \bitstream{} representation of the source input. An important observation is that many character sets are an extension to the legacy 7-bit ASCII character set.  This includes the various ISO Latin character sets, UTF-8, UTF-16 and many others. Furthermore, all significant characters for parsing XML are confined to the ASCII repertoire.   Thus, a single common set of lexical item calculations serves to compute lexical item streams for all such ASCII-based character sets. A second observation is that—regardless of which character set is used—quite often all of the characters in a particular block of input will be within the ASCII range. This is a very simple test to perform using the \bitstream{} representation, simply confirming that the bit 0 stream is zero for the entire block.   For blocks satisfying this test, all logic dealing with non-ASCII characters can simply be skipped. Transcoding to UTF-16 becomes trivial as the high eight \bitstream{}s of the UTF-16 form are each set to zero in this case. A third observation is that repeated transcoding of the names of XML elements, attributes and so on can be avoided by using a look-up mechanism. That is, the first occurrence of each symbol is stored in a look-up table mapping the input encoding to a numeric symbol ID.   Transcoding of the symbol is applied at this time.  Subsequent look-up operations can avoid transcoding by simply retrieving the stored representation. As symbol look up is required to apply various XML validation rules, there is achieves the effect of transcoding each occurrence without additional cost. The cost of individual character transcoding is avoided whenever a block of input is confined to the ASCII subset and for all but the first occurrence of any XML element or attribute name. Furthermore, when transcoding is required, the parallel \bitstream{} representation supports efficient transcoding operations. In the important case of UTF-8 to UTF-16 transcoding, the corresponding UTF-16 \bitstream{}s can be calculated in bit parallel fashion based on UTF-8 streams \cite{Cameron2008}, and all but the final bytes of multi-byte sequences can be marked for deletion as discussed in the following subsection. In other cases, transcoding within a block only need be applied for non-ASCII bytes, which are conveniently identified by iterating through the bit 0 stream using bit scan operations.
Content Stream A relatively-unique concept for \icXML{} is the use of a filtered content stream. Rather that parsing an XML document in its original format, the input is transformed into one that is easier for the parser to iterate through and produce the sequential output. In , the source data is transformed into through the parallel filtering algorithm, described in section \ref{sec:parfilter}. Combined with the symbol stream, the parser traverses the content stream to effectively reconstructs the input document in its output form. The initial {\tt\it 0} indicates an empty content string. The following \verb|>| indicates that a start tag without any attributes is the first element in this text and the first unused symbol, document, is the element name. Succeeding that is the content string fee, which is null-terminated in accordance with the Xerces API specification. Unlike Xerces, no memory-copy operations are required to produce these strings, which as Figure~\ref{fig:xerces-profile} shows accounts for 6.83% of Xerces's execution time. Additionally, it is cheap to locate the terminal character of each string: using the String End \bitstream{}, the \PS{} can effectively calculate the offset of each null character in the content stream in parallel, which in turn means the parser can directly jump to the end of every string without scanning for it. Following \verbfee'' is a \verb=, which marks the existence of an attribute. Because all of the intra-element was performed in the \PS{}, this must be a legal attribute. Since attributes can only occur within start tags and must be accompanied by a textual value, the next symbol in the symbol stream must be the element name of a start tag, and the following one must be the name of the attribute and the string that follows the \verb= must be its value. However, the subsequent \verb= is not treated as an independent attribute because the parser has yet to read a \verb>, which marks the end of a start tag. Thus only one symbol is taken from the symbol stream and it (along with the string value) is added to the element. Eventually the parser reaches a \verb/, which marks the existence of an end tag. Every end tag requires an element name, which means they require a symbol. Inter-element validation whenever an empty tag is detected to ensure that the appropriate scope-nesting rules have been applied.
Namespace Handling In XML, namespaces prevents naming conflicts when multiple vocabularies are used together. It is especially important when a vocabulary application-dependant meaning, such as when XML or SVG documents are embedded within XHTML files. Namespaces are bound to uniform resource identifiers (URIs), which are strings used to identify specific names or resources. On line 1 of Figure \ref{fig:namespace1}, the \verb|xmlns| attribute instructs the XML processor to bind the prefix p to the URI 'pub.net' and the default (empty) prefix to book.org. Thus to the XML processor, the \verb|title| on line 2 and \verb|price| on line 4 both read as \verb|"book.org":title| and \verb|"book.org":price| respectively, whereas on line 3 and 5, \verb|p:name| and \verb|price| are seen as \verb|"pub.net":name| and \verb|"pub.net":price|. Even though the actual element name \verb|price|, due to namespace scoping rules they are viewed as two uniquely-named items because the current vocabulary is determined by the namespace(s) that are in-scope. In both Xerces and \icXML{}, every URI has a one-to-one mapping to a URI ID. These persist for the lifetime of the application through the use of a global URI pool. Xerces maintains a stack of namespace scopes that is pushed (popped) every time a start tag (end tag) occurs in the document. Because a namespace declaration affects the entire element, it must be processed prior to grammar validation. This is a costly process considering that a typical namespaced XML document only comes in one of two forms: (1) those that declare a set of namespaces upfront and never change them, and (2) those that repeatedly modify the namespaces in predictable patterns. For that reason, \icXML{} contains an independent namespace stack and utilizes bit vectors to cheaply perform When a prefix is declared (e.g., \verb|xmlns:p="pub.net"|), a namespace binding is created that maps the prefix (which are assigned Prefix IDs in the symbol resolution process) to the URI. Each unique namespace binding has a unique namespace id (NSID) and every prefix contains a bit vector marking every NSID that has ever been associated with it within the document. For example, in Table \ref{tbl:namespace1}, the prefix binding set of \verb|p| and \verb|xmlns| would be \verb|01| and \verb|11| respectively. To resolve the in-scope namespace binding for each prefix, a bit vector of the currently visible namespaces is maintained by the system. By ANDing the prefix bit vector with the currently visible namespaces, the in-scope NSID can be found using a bit-scan intrinsic. A namespace binding table, similar to Table \ref{tbl:namespace1}, provides the actual URI ID. To ensure that scoping rules are adhered to, whenever a start tag is encountered, any modification to the currently visible namespaces is calculated and stored within a stack of bit vectors denoting the locally modified namespace bindings. When an end tag is found, the currently visible namespaces is XORed with the vector at the top of the stack. This allows any number of changes to be performed at each scope-level with a constant time.
Error Handling Xerces outputs error messages in two ways: through the programmer API and as thrown objects for fatal errors. As Xerces parses a file, it uses context-dependant logic to assess whether the next character is legal; if not, the current state determines the type and severity of the error. \icXML{} emits errors in the similar manner—but how it discovers them is substantially different. Recall that in Figure \ref{fig:icxml-arch}, \icXML{} is divided into two sections: the \PS{} and \MP{}, each with its own system for detecting and producing error messages. Within the \PS{}, all computations are performed in parallel, a block at a time. Errors are derived as artifacts of \bitstream{} calculations, with a 1-bit marking the byte-position of an error within a block, and the type of error is determined by the equation that discovered it. The difficulty of error processing in this section is that in Xerces the line and column number must be given with every error production. Two major issues exist because of this: (1) line position adheres to XML white-normalization rules; as such, some sequences of characters, e.g., a carriage return followed by a line feed, are counted as a single new line character. (2) column position is counted in characters, not bytes or code units; thus multi-code-unit code-points and surrogate character pairs are all counted as a single column position. Note that typical XML documents are error-free but the calculation of the line/column position is a constant overhead in Xerces. To reduce this, \icXML{} pushes the bulk cost of the line/column calculation to the occurrence of the error and performs the minimal amount of book-keeping necessary to facilitate it. \icXML{} leverages the byproducts of the Character Set Adapter (CSA) module and amalgamates the information within the Line Column Tracker (LCT). One of the CSA's major responsibilities is transcoding an input text. During this process, white-space normalization rules are applied and multi-code-unit and surrogate characters are detected and validated. A {\it line-feed \bitstream{}}, which marks the positions of the normalized new lines characters, is a natural derivative of this process. Using an optimized population count algorithm, the line count can be summarized cheaply for each valid block of text. Column position is more difficult to calculate. It is possible to scan backwards through the \bitstream{} of new line characters to determine the distance (in code-units) between the position between which an error was detected and the last line feed. However, this distance may exceed than the actual character position for the reasons discussed in (2). To handle this, the CSA generates a {\it skip mask} \bitstream{} by ORing together many relevant \bitstream{}s, such as all trailing multi-code-unit and surrogate characters, and any characters that were removed during the normalization process. When an error is detected, the sum of those skipped positions is subtracted from the distance to determine the actual column number. The \MP{} is a state-driven machine. As such, error detection within it is very similar to Xerces. However, reporting the correct line/column is a much more difficult problem. The \MP{} parses the content stream, which is a series of tagged UTF-16 strings. Each string is normalized in accordance with the XML specification. All symbol data and unnecessary whitespace is eliminated from the stream; thus its impossible to derive the current location using only the content stream. To calculate the location, the \MP{} borrows three additional pieces of information from the \PS{}: the line-feed, skip mask, and a {\it deletion mask stream}, which is a \bitstream{} denoting the (code-unit) position of every datum that was suppressed from the source during the production of the content stream. Armed with these, it is possible to calculate the actual line/column using the same system as the \PS{} until the sum of the negated deletion mask stream is equal to the current position.
Based on the character class compiler, we are currently investigating the construction of a regular expression compiler that can implement bit-stream based parallel regular-expression matching similar to that describe previously for parallel parsing by bistream addition. This compiler works with the assumption that bitstream regular-expression definitions are deterministic; no backtracking is permitted with the parallel bit stream representation. In XML applications, this compiler is primarily intended to enforce regular-expression constraints on string datatype specifications found in XML schema.
Multithreading with Pipeline Parallelism As discussed in section \ref{background:xerces}, Xerces can be considered a FSM application. These are embarrassingly sequential.''\cite{Asanovic:EECS-2006-183} and notoriously difficult to parallelize. However, \icXML{} is designed to organize processing into logical layers. In particular, layers within the \PS{} are designed to operate over significant segments of input data before passing their outputs on for subsequent processing.  This fits well into the general model of pipeline parallelism, in which each thread is in charge of a single module or group of modules. The most straightforward division of work in \icXML{} is to separate the \PS{} and the \MP{} into distinct logical layers into two separate stages. The resultant application, {\it\icXMLp{}}, is a course-grained software-pipeline application. In this case, the \PS{} thread $T_1$ reads 16k of XML input $I$ at a time and produces the content, symbol and URI streams, then stores them in a pre-allocated shared data structure $S$. The \MP{} thread $T_2$ consumes $S$, performs well-formedness and grammar-based validation, and the provides parsed XML data to the application through the Xerces API. The shared data structure is implemented using a ring buffer, where every entry contains an independent set of data streams. In the examples of Figure \ref{threads_timeline1} and \ref{threads_timeline2}, the ring buffer has four entries. A lock-free mechanism is applied to ensure that each entry can only be read or written by one thread at the same time. In Figure \ref{threads_timeline1} the processing time of $T_1$ is longer than $T_2$; thus $T_2$ always waits for $T_1$ to write to the shared memory. Figure \ref{threads_timeline2} illustrates the scenario in which $T_1$ is faster and must wait for $T_2$ to finish reading the shared data before it can reuse the memory space. Overall, our design is intended to benefit a range of applications. Conceptually, we consider two design points. The first, the parsing performed by the \PS{} dominates at 67% of the overall cost, with the cost of application processing (including the driver logic within the \MP{}) at 33%. The second is almost the opposite scenario, the cost of application processing dominates at 60%, while the cost of XML parsing represents an overhead of 40%. Our design is predicated on a goal of using the Parabix framework to achieve a 50% to 100% improvement in the parsing engine itself. In a best case scenario, a 100% improvement of the \PS{} for the design point in which XML parsing dominates at 67% of the total application cost. In this case, the single-threaded \icXML{} should achieve a 1.5x speedup over Xerces so that the total application cost reduces to 67% of the original. However, in \icXMLp{}, our ideal scenario gives us two well-balanced threads each performing about 33% of the original work. In this case, Amdahl's law predicts that we could expect up to a 3x speedup at best. At the other extreme of our design range, we consider an application in which core parsing cost is 40%.   Assuming the 2x speedup of the \PS{} over the corresponding Xerces core, single-threaded \icXML{} delivers a 25% speedup.   However, the most significant aspect of our two-stage multi-threaded design then becomes the ability to hide the entire latency of parsing within the serial time required by the application.   In this case, we achieve an overall speedup in processing time by 1.67x. Although the structure of the \PS{} allows division of the work into several pipeline stages and has been demonstrated to be effective for four pipeline stages in a research prototype \cite{HPCA2012}, our analysis here suggests that the further pipelining of work within the \PS{} is not worthwhile if the cost of application logic is little as 33% of the end-to-end cost using Xerces.  To achieve benefits of further parallelization with multi-core technology, there would need to be reductions in the cost of application logic that could match reductions in core parsing cost.
Unbounded Bit Stream Compilation
Performance We evaluate \xerces{}, \icXML{}, \icXMLp{} against two benchmarking applications: the Xerces C++ SAXCount sample application, and a real world GML to SVG transformation application. We investigated XML parser performance using an Intel Core i7 quad-core (Sandy Bridge) processor (3.40GHz, 4 physical cores, 8 threads (2 per core), 32+32 kB (per core) L1 cache, 256 kB (per core) L2 cache, 8 MB L3 cache) running the 64-bit version of Ubuntu 12.04 (Linux). We analyzed the execution profiles of each XML parser using the performance counters found in the processor. We chose several key hardware events that provide insight into the profile of each application and indicate if the processor is doing useful work. The set of events included in our study are: processor cycles, branch instructions, branch mispredictions, and cache misses. The Performance Application Programming Interface (PAPI) Version 5.5.0 \cite{papi} toolkit was installed on the test system to facilitate the collection of hardware performance monitoring statistics. In addition, we used the Linux perf \cite{perf} utility to collect per core hardware events.
Xerces C++ SAXCount Xerces comes with sample applications that demonstrate salient features of the parser. SAXCount is the simplest such application: it counts the elements, attributes and characters of a given XML file using the (event based) SAX API and prints out the totals. Table \ref{XMLDocChars} shows the document characteristics of the XML input files selected for the Xerces C++ SAXCount benchmark. The jaw.xml represents document-oriented XML inputs and contains the three-byte and four-byte UTF-8 sequence required for the UTF-8 encoding of Japanese characters. The remaining data files are data-oriented XML documents and consist entirely of single byte encoded ASCII characters. A key predictor of the overall parsing performance of an XML file is markup density\footnote{ Markup Density: the ratio of markup bytes used to define the structure of the document vs. its file size.}. This metric has substantial influence on the performance of traditional recursive descent XML parsers because it directly corresponds to the number of state transitions that occur when parsing a document. We use a mixture of document-oriented and data-oriented XML files to analyze performance over a spectrum of markup densities. Figure \ref{perf_SAX} compares the performance of Xerces, \icXML{} and pipelined \icXML{} in terms of CPU cycles per byte for the SAXCount application. The speedup for \icXML{} over Xerces is 1.3x to 1.8x. With two threads on the multicore machine, \icXMLp{} can achieve speedup up to 2.7x. Xerces is substantially slowed by dense markup but \icXML{} is less affected through a reduction in branches and the use of parallel-processing techniques. \icXMLp{} performs better as markup-density increases because the work performed by each stage is well balanced in this application.
GML2SVG
The Catalog of XML Bit Streams presented earlier consist of a set of abstract, unbounded bit streams, each in one-to-one correspondence with input bytes of a text file. Determining how these bit streams are implemented using fixed-width SIMD registers, and possibly processed in fixed-length buffers that represent some multiple of the register width is a source of considerable programming complexity. The general goal of our compilation strategy in this case is to allow operations to be programmed in terms of unbounded bit streams and then automatically reduced to efficient low-level code with the application of a systematic code generation strategy for handling block and buffer boundary crossing. This is work currently in progress.
Conclusion and Future Work This paper is the first case study documenting the significant performance benefits that may be realized through the integration of parallel \bitstream{} technology into existing widely-used software libraries. In the case of the Xerces-C++ XML parser, the combined integration of SIMD and multicore parallelism was shown capable of dramatic producing dramatic increases in throughput and reductions in branch mispredictions and cache misses. The modified parser, going under the name \icXML{} is designed to provide the full functionality of the original Xerces library with complete compatibility of APIs.  Although substantial re-engineering was required to realize the performance potential of parallel technologies, this is an important case study demonstrating the general feasibility of these techniques. The further development of \icXML{} to move beyond 2-stage pipeline parallelism is ongoing, with realistic prospects for four reasonably balanced stages within the library.  For applications such as GML2SVG which are dominated by time spent on XML parsing, such a multistage pipelined parsing library should offer substantial benefits. The example of XML parsing may be considered prototypical of finite-state machines applications which have sometimes been considered embarassingly sequential'' and so difficult to parallelize that nothing works.''  So the case study presented here should be considered an important data point in making the case that parallelization can indeed be helpful across a broad array of application types. To overcome the software engineering challenges in applying parallel \bitstream{} technology to existing software systems, it is clear that better library and tool support is needed. The techniques used in the implementation of \icXML{} and documented in this paper could well be generalized for applications in other contexts and automated through the creation of compiler technology specifically supporting parallel \bitstream{} programming.
Conclusion Parallel bit stream technology offers the opportunity to dramatically speed up the core XML processing components used to implement virtually any XML API. Character validation and transcoding, whitespace processing, and parsing up to including the full validation of tag syntax can be handled fully in parallel using bit stream methods. Bit streams to mark the positions of all element names, attribute names and attribute values can also be produced, followed by fast bit scan operations to generate position and length values. Beyond bit streams, byte-oriented SIMD processing of names and numerals can also accelerate performance beyond sequential byte-at-a-time methods. Advances in processor architecture are likely to further amplify the performance of parallel bit stream technology over traditional byte-at-a-time processing over the next decade. Improvements to SIMD register width, register complement and operation format can all result in further gains. New SIMD instruction set features such as inductive doubling support, parallel extract and deposit instructions, bit interleaving and scatter/gather capabilities should also result in significant speed-ups. Leveraging the intraregister parallelism of parallel bit stream technology within SIMD registers to take of intrachip parallelism on multicore processors should accelerate processing further. Technology transfer using a patent-based open-source business model is a further goal of our work with a view to widespread deployment of parallel bit stream technology in XML processing stacks implementing a variety of APIs. The feasibility of substantial performance improvement in replacement of technology implementing existing APIs has been demonstrated even in complex software architectures involving delivery of performance benefits across the JNI boundary. We are seeking to accelerate these deployment efforts both through the development of compiler technology to reliably apply these methods to a variety of architectures as well as to identify interested collaborators using open-source or commercial models.
Bibliography
Note: See TracChangeset for help on using the changeset viewer.