# Changeset 3040

Ignore:
Timestamp:
Apr 18, 2013, 8:20:23 PM (6 years ago)
Message:

Location:
docs/Balisage13
Files:
1 deleted
2 edited

### Legend:

Unmodified
 r3039 To facilitate this, the input data is first transposed into a set of basis bit streams. In is represented as 8 basis bit streams, $\tt b{0 \ldots 7}$. --> Boolean-logic operations\footnote{∧, \∨ and ¬ denote the boolean AND, OR and NOT operators.} are used to classify the input bits into a set of {\it character-class bit streams}, which identify key are used to classify the input bits into a set of character-class bit streams, which identify key characters (or groups of characters) with a $1$. For example, one of the fundamental characters in XML is a left-angle bracket. A character is a< if and only if $\lnot(b_0 \lor b_1) \land (b_2 \land b_3) \land (b_4 \land b_5) \land \lnot (b_6 \lor b_7) = 1. A character is a< if and only if ¬(b0 ∨ b1) ∧ (b2 ∧ b3) ∧ (b4 ∧ b5) ∧ ¬ (b6 ∨ b7) = 1. Similarly, a character is numeric [0-9] if and only if$\lnot(b_0 \lor b_1) \land (b_2 \land b_3) \land \lnot(b_4 \land (b_5 \lor b_6)). [0-9] if and only if $¬(b0 ∨ b1) ∧ (b2 ∧ b3) ∧ ¬(b4 ∧ (b5 ∨ b6)). An important observation here is that ranges of characters may require fewer operations than individual characters and \begin{center} \begin{tabular}{r |c |c |c |c |c |c |c |c |} &$\mbox{\fontsize{11}{11}\selectfont $\tt b_{0}$}$&$\mbox{\fontsize{11}{11}\selectfont $\tt b_{1}$}$&$\mbox{\fontsize{11}{11}\selectfont $\tt b_{2}$}$&$\mbox{\fontsize{11}{11}\selectfont $\tt b_{3}$}$&$\mbox{\fontsize{11}{11}\selectfont $\tt b_{4}$}$&$\mbox{\fontsize{11}{11}\selectfont $\tt b_{5}$}$&$\mbox{\fontsize{11}{11}\selectfont $\tt b_{6}$}$&$\mbox{\fontsize{11}{11}\selectfont $\tt b_{7}$}$\\ &$\mbox{\fontsize{11}{11}\selectfont $\tt b{0}$}$&$\mbox{\fontsize{11}{11}\selectfont $\tt b{1}$}$&$\mbox{\fontsize{11}{11}\selectfont $\tt b{2}$}$&$\mbox{\fontsize{11}{11}\selectfont $\tt b{3}$}$&$\mbox{\fontsize{11}{11}\selectfont $\tt b{4}$}$&$\mbox{\fontsize{11}{11}\selectfont $\tt b{5}$}$&$\mbox{\fontsize{11}{11}\selectfont $\tt b{6}$}$&$\mbox{\fontsize{11}{11}\selectfont $\tt b{7}$}$\\ & \ttfamily{0} & \ttfamily{1} & \ttfamily{1} & \ttfamily{0} & \ttfamily{0} & \ttfamily{0} & \ttfamily{1} & \bfseries\ttfamily{0} \\ & \ttfamily{0} & \ttfamily{0} & \ttfamily{1} & \ttfamily{1} & \ttfamily{0} & \ttfamily{1} & \ttfamily{1} & \bfseries\ttfamily{1} \\ lines show streams that can be computed in subsequent parsing (using the technique of \bitstream{} addition \cite{cameron-EuroPar2011}), namely streams marking the element names, of bitstream addition \cite{cameron-EuroPar2011}), namely streams marking the element names, attribute names and attribute values of tags. sequential scanning loops for individual characters \cite{CameronHerdyLin2008}. Recent work has incorporated a method of parallel scanning using \bitstream{} addition \cite{cameron-EuroPar2011}, as scanning using bitstream addition \cite{cameron-EuroPar2011}, as well as combining SIMD methods with 4-stage pipeline parallelism to further improve throughput \cite{HPCA2012}. Parabix-style XML parsers utilize a concept of layered processing. A block of source text is transformed into a set of lexical \bitstream{}s, A block of source text is transformed into a set of lexical bitstreams, which undergo a series of operations that can be grouped into logical layers, e.g., transposition, character classification, and lexical analysis. \icXML{} is more than an optimized version of Xerces. Many components were grouped, restructured and icXML is more than an optimized version of Xerces. Many components were grouped, restructured and rearchitected with pipeline parallelism in mind. In this section, we highlight the core differences between the two systems. As shown in Figure \ref{fig:xerces-arch}, Xerces is comprised of five main modules: the transcoder, reader, scanner, namespace binder, and validator. The {\it Transcoder} converts source data into UTF-16 before Xerces parses it as XML; The Transcoder converts source data into UTF-16 before Xerces parses it as XML; the majority of the character set encoding validation is performed as a byproduct of this process. The {\it Reader} is responsible for the streaming and buffering of all raw and transcoded (UTF-16) text. The Reader is responsible for the streaming and buffering of all raw and transcoded (UTF-16) text. It tracks the current line/column position, performs line-break normalization and validates context-specific character set issues, such as tokenization of qualified-names. The {\it Scanner} pulls data through the reader and constructs the intermediate representation (IR) The Scanner pulls data through the reader and constructs the intermediate representation (IR) of the document; it deals with all issues related to entity expansion, validates the XML well-formedness constraints and any character set encoding issues that cannot be completely handled by the reader or transcoder (e.g., surrogate characters, validation and normalization of character references, etc.) The {\it Namespace Binder} is a core piece of the element stack. The Namespace Binder is a core piece of the element stack. It handles namespace scoping issues between different XML vocabularies. This allows the scanner to properly select the correct schema grammar structures. The {\it Validator} takes the IR produced by the Scanner (and The Validator takes the IR produced by the Scanner (and potentially annotated by the Namespace Binder) and assesses whether the final output matches the user-defined DTD and schema grammar(s) before passing it to the end-user. In \icXML{} functions are grouped into logical components. As shown in Figure \ref{fig:icxml-arch}, two major categories exist: (1) the \PS{} and (2) the \MP{}. All tasks in (1) use the Parabix Framework \cite{HPCA2012}, which represents data as a set of parallel \bitstream{}s. The {\it Character Set Adapter}, discussed in Section \ref{arch:character-set-adapter}, In icXML functions are grouped into logical components. As shown in Figure \ref{fig:icxml-arch}, two major categories exist: (1) the Parabix Subsystem and (2) the Markup Processor. All tasks in (1) use the Parabix Framework \cite{HPCA2012}, which represents data as a set of parallel bitstreams. The Character Set Adapter, discussed in Section \ref{arch:character-set-adapter}, mirrors Xerces's Transcoder duties; however instead of producing UTF-16 it produces a set of lexical \bitstream{}s, similar to those shown in Figure \ref{fig:parabix1}. These lexical \bitstream{}s are later transformed into UTF-16 in the \CSG{}, set of lexical bitstreams, similar to those shown in Figure \ref{fig:parabix1}. These lexical bitstreams are later transformed into UTF-16 in the Content Stream Generator, after additional processing is performed. The first precursor to producing UTF-16 is the {\it Parallel Markup Parser} phase. It takes the lexical streams and produces a set of marker \bitstream{}s in which a 1-bit identifies significant positions within the input data. One \bitstream{} for each of the critical piece of information is created, such as The first precursor to producing UTF-16 is the Parallel Markup Parser phase. It takes the lexical streams and produces a set of marker bitstreams in which a 1-bit identifies significant positions within the input data. One bitstream for each of the critical piece of information is created, such as the beginning and ending of start tags, end tags, element names, attribute names, attribute values and content. Intra-element well-formedness validation is performed as an artifact of this process. Like Xerces, \icXML{} must provide the Line and Column position of each error. The {\it Line-Column Tracker} uses the lexical information to keep track of the document position(s) through the use of an Like Xerces, icXML must provide the Line and Column position of each error. The Line-Column Tracker uses the lexical information to keep track of the document position(s) through the use of an optimized population count algorithm, described in Section \ref{section:arch:errorhandling}. From here, two data-independent branches exist: the Symbol Resolver and Content Preparation Unit. A typical XML file contains few unique element and attribute names—but each of them will occur frequently. \icXML{} stores these as distinct data structures, called symbols, each with their own global identifier (GID). Using the symbol marker streams produced by the Parallel Markup Parser, the {\it Symbol Resolver} scans through the raw data to produce a sequence of GIDs, called the {\it symbol stream}. The final components of the \PS{} are the {\it Content Preparation Unit} and {\it \CSG{}}. The former takes the (transposed) basis \bitstream{}s and selectively filters them, according to the icXML stores these as distinct data structures, called symbols, each with their own global identifier (GID). Using the symbol marker streams produced by the Parallel Markup Parser, the Symbol Resolver scans through the raw data to produce a sequence of GIDs, called the symbol stream. The final components of the Parabix Subsystem are the Content Preparation Unit and Content Stream Generator. The former takes the (transposed) basis bitstreams and selectively filters them, according to the information provided by the Parallel Markup Parser, and the latter transforms the filtered streams into the tagged UTF-16 {\it content stream}, discussed in Section \ref{section:arch:contentstream}. Combined, the symbol and content stream form \icXML{}'s compressed IR of the XML document. The {\it \MP{}}~parses the IR to validate and produce the sequential output for the end user. The {\it Final WF checker} performs inter-element well-formedness validation that would be too costly filtered streams into the tagged UTF-16 content stream, discussed in Section \ref{section:arch:contentstream}. Combined, the symbol and content stream form icXML's compressed IR of the XML document. The Markup Processor~parses the IR to validate and produce the sequential output for the end user. The Final WF checker performs inter-element well-formedness validation that would be too costly to perform in bit space, such as ensuring every start tag has a matching end tag. Xerces's namespace binding functionality is replaced by the {\it Namespace Processor}. Unlike Xerces, it is a discrete phase that produces a series of URI identifiers (URI IDs), the {\it URI stream}, which are Xerces's namespace binding functionality is replaced by the Namespace Processor. Unlike Xerces, it is a discrete phase that produces a series of URI identifiers (URI IDs), the URI stream, which are associated with each symbol occurrence. This is discussed in Section \ref{section:arch:namespacehandling}. Finally, the {\it Validation} layer implements the Xerces's validator. Finally, the Validation layer implements the Xerces's validator. However, preprocessing associated with each symbol greatly reduces the work of this stage. \includegraphics[height=0.6\textheight,width=0.5\textwidth]{plots/icxml.pdf} \end{center} \caption{\icXML{} Architecture} \caption{icXML Architecture} \label{fig:icxml-arch} \end{figure} In \icXML{}, however, the concept of Character Set Adapters (CSAs) is used to minimize transcoding costs. In icXML, however, the concept of Character Set Adapters (CSAs) is used to minimize transcoding costs. Given a specified input encoding, a CSA is responsible for checking that input code units represent valid characters, mapping the characters of the encoding into the appropriate \bitstream{}s for XML parsing actions (i.e., producing the lexical item the appropriate bitstreams for XML parsing actions (i.e., producing the lexical item streams), as well as supporting ultimate transcoding requirements. All of this work is performed using the parallel \bitstream{} representation of the source input. is performed using the parallel bitstream representation of the source input. A second observation is that—regardless of which character set is used—quite often all of the characters in a particular block of input will be within the ASCII range. This is a very simple test to perform using the \bitstream{} representation, simply confirming that the This is a very simple test to perform using the bitstream representation, simply confirming that the bit 0 stream is zero for the entire block. For blocks satisfying this test, all logic dealing with non-ASCII characters can simply be skipped. Transcoding to UTF-16 becomes trivial as the high eight \bitstream{}s of the Transcoding to UTF-16 becomes trivial as the high eight bitstreams of the UTF-16 form are each set to zero in this case. The cost of individual character transcoding is avoided whenever a block of input is confined to the ASCII subset and for all but the first occurrence of any XML element or attribute name. Furthermore, when transcoding is required, the parallel \bitstream{} representation Furthermore, when transcoding is required, the parallel bitstream representation supports efficient transcoding operations. In the important case of UTF-8 to UTF-16 transcoding, the corresponding UTF-16 \bitstream{}s In the important case of UTF-8 to UTF-16 transcoding, the corresponding UTF-16 bitstreams can be calculated in bit parallel fashion based on UTF-8 streams \cite{Cameron2008}, and all but the final bytes of multi-byte sequences can be marked for deletion as \verb'110110'. Using this approach, transcoding may then be completed by applying parallel deletion and inverse transposition of the UTF-16 \bitstream{}s\cite{Cameron2008}. UTF-16 bitstreams\cite{Cameron2008}. Markup Identifiers & \verb_________1______________1_________1______1_1____________1_________\\ Deletion Mask & \verb_11111111_____1111111111_1____1111_11_______11111111_____111111111\\ Undeleted Data & \verb{\tt\it 0}\verb________>fee{\tt\it 0}\verb__________=_fie{\tt\it 0}\verb____=__foe{\tt\it 0}\verb>{\tt\it 0}\verb/________fum{\tt\it 0}\verb/_________ Undeleted Data & \verb0\verb________>fee0\verb__________=_fie0\verb____=__foe{\tt0\verb>0\verb/________fum0\verb/_________ \end{tabular} \end{center} Rather than immediately paying the costs of deletion and transposition just for transcoding, however, \icXML{} defers these steps so that the deletion however, icXML defers these steps so that the deletion masks for several stages of processing may be combined. In particular, this includes core XML requirements returns (CR), line feeds (LF) and CR-LF combinations must be normalized to a single LF character in each case. In \icXML{}, this is achieved by each case. In icXML, this is achieved by first marking CR positions, performing two bit parallel operations to transform the marked which must be replaced in XML processing with the single & and < characters, respectively. The approach in \icXML{} is to mark all but the first character The approach in icXML is to mark all but the first character positions of each reference for deletion, leaving a single character position unmodified. Thus, for the the process of reducing markup data to tag bytes preceding each significant XML transition as described in section~\ref{section:arch:contentstream}. Overall, \icXML{} in section~\ref{section:arch:contentstream}. Overall, icXML avoids separate buffer copying operations for each of the these filtering steps, paying the cost of parallel deletion and inverse transposition only once. Currently, \icXML{} employs the parallel-prefix compress algorithm Currently, icXML employs the parallel-prefix compress algorithm of Steele~\cite{HackersDelight} Performance is independent of the number of positions deleted. Future versions of \icXML{} are expected to Future versions of icXML are expected to take advantage of the parallel extract operation~\cite{HilewitzLee2006} that Intel is now providing in its Haswell architecture. Content Stream A relatively-unique concept for \icXML{} is the use of a filtered content stream. A relatively-unique concept for icXML is the use of a filtered content stream. Rather that parsing an XML document in its original format, the input is transformed into one that is easier for the parser to iterate through and produce the sequential is transformed into through the parallel filtering algorithm, described in section \ref{sec:parfilter}. Combined with the symbol stream, the parser traverses the content stream to effectively reconstructs the input document in its output form. The initial {\tt\it 0} indicates an empty content string. The following \verb|>| The initial 0 indicates an empty content string. The following \verb|>| indicates that a start tag without any attributes is the first element in this text and the first unused symbol, document, is the element name. accounts for 6.83% of Xerces's execution time. Additionally, it is cheap to locate the terminal character of each string: using the String End \bitstream{}, the \PS{} can effectively calculate the offset of each using the String End bitstream, the Parabix Subsystem can effectively calculate the offset of each null character in the content stream in parallel, which in turn means the parser can directly jump to the end of every string without scanning for it. Following \verbfee'' is a \verb=, which marks the existence of an attribute. Because all of the intra-element was performed in the \PS{}, this must be a legal attribute. Because all of the intra-element was performed in the Parabix Subsystem, this must be a legal attribute. Since attributes can only occur within start tags and must be accompanied by a textual value, the next symbol in the symbol stream must be the element name of a start tag, In both Xerces and \icXML{}, every URI has a one-to-one mapping to a URI ID. In both Xerces and icXML, every URI has a one-to-one mapping to a URI ID. These persist for the lifetime of the application through the use of a global URI pool. Xerces maintains a stack of namespace scopes that is pushed (popped) every time a start tag (end tag) occurs For that reason, \icXML{} contains an independent namespace stack and utilizes bit vectors to cheaply perform For that reason, icXML contains an independent namespace stack and utilizes bit vectors to cheaply perform As Xerces parses a file, it uses context-dependant logic to assess whether the next character is legal; if not, the current state determines the type and severity of the error. \icXML{} emits errors in the similar manner—but how it discovers them is substantially different. Recall that in Figure \ref{fig:icxml-arch}, \icXML{} is divided into two sections: the \PS{} and \MP{}, icXML emits errors in the similar manner—but how it discovers them is substantially different. Recall that in Figure \ref{fig:icxml-arch}, icXML is divided into two sections: the Parabix Subsystem and Markup Processor, each with its own system for detecting and producing error messages. Within the \PS{}, all computations are performed in parallel, a block at a time. Errors are derived as artifacts of \bitstream{} calculations, with a 1-bit marking the byte-position of an error within a block, Within the Parabix Subsystem, all computations are performed in parallel, a block at a time. Errors are derived as artifacts of bitstream calculations, with a 1-bit marking the byte-position of an error within a block, and the type of error is determined by the equation that discovered it. The difficulty of error processing in this section is that in Xerces the line and column number must be given Note that typical XML documents are error-free but the calculation of the line/column position is a constant overhead in Xerces. To reduce this, \icXML{} pushes the bulk cost of the line/column calculation to the occurrence of the error and To reduce this, icXML pushes the bulk cost of the line/column calculation to the occurrence of the error and performs the minimal amount of book-keeping necessary to facilitate it. \icXML{} leverages the byproducts of the Character Set Adapter (CSA) module and amalgamates the information icXML leverages the byproducts of the Character Set Adapter (CSA) module and amalgamates the information within the Line Column Tracker (LCT). One of the CSA's major responsibilities is transcoding an input text. During this process, white-space normalization rules are applied and multi-code-unit and surrogate characters are detected and validated. A {\it line-feed \bitstream{}}, which marks the positions of the normalized new lines characters, is a natural derivative of A line-feed bitstream, which marks the positions of the normalized new lines characters, is a natural derivative of this process. Using an optimized population count algorithm, the line count can be summarized cheaply for each valid block of text. Column position is more difficult to calculate. It is possible to scan backwards through the \bitstream{} of new line characters to determine the distance (in code-units) It is possible to scan backwards through the bitstream of new line characters to determine the distance (in code-units) between the position between which an error was detected and the last line feed. However, this distance may exceed than the actual character position for the reasons discussed in (2). To handle this, the CSA generates a {\it skip mask} \bitstream{} by ORing together many relevant \bitstream{}s, To handle this, the CSA generates a skip mask bitstream by ORing together many relevant bitstreams, such as all trailing multi-code-unit and surrogate characters, and any characters that were removed during the normalization process. The \MP{} is a state-driven machine. As such, error detection within it is very similar to Xerces. The Markup Processor is a state-driven machine. As such, error detection within it is very similar to Xerces. However, reporting the correct line/column is a much more difficult problem. The \MP{} parses the content stream, which is a series of tagged UTF-16 strings. The Markup Processor parses the content stream, which is a series of tagged UTF-16 strings. Each string is normalized in accordance with the XML specification. All symbol data and unnecessary whitespace is eliminated from the stream; thus its impossible to derive the current location using only the content stream. To calculate the location, the \MP{} borrows three additional pieces of information from the \PS{}: the line-feed, skip mask, and a {\it deletion mask stream}, which is a \bitstream{} denoting the (code-unit) position of every To calculate the location, the Markup Processor borrows three additional pieces of information from the Parabix Subsystem: the line-feed, skip mask, and a deletion mask stream, which is a bitstream denoting the (code-unit) position of every datum that was suppressed from the source during the production of the content stream. Armed with these, it is possible to calculate the actual line/column using the same system as the \PS{} until the sum of the negated deletion mask stream is equal to the current position. the same system as the Parabix Subsystem until the sum of the negated deletion mask stream is equal to the current position. As discussed in section \ref{background:xerces}, Xerces can be considered a FSM application. These are embarrassingly sequential.''\cite{Asanovic:EECS-2006-183} and notoriously difficult to parallelize. However, \icXML{} is designed to organize processing into logical layers. In particular, layers within the \PS{} are designed to operate However, icXML is designed to organize processing into logical layers. In particular, layers within the Parabix Subsystem are designed to operate over significant segments of input data before passing their outputs on for subsequent processing. This fits well into the general model of pipeline The most straightforward division of work in \icXML{} is to separate the \PS{} and the \MP{} into distinct logical layers into two separate stages. The resultant application, {\it\icXMLp{}}, is a course-grained software-pipeline application. In this case, the \PS{} thread$T_1$reads 16k of XML input$I$at a time and produces the The most straightforward division of work in icXML is to separate the Parabix Subsystem and the Markup Processor into distinct logical layers into two separate stages. The resultant application, icXML-p, is a course-grained software-pipeline application. In this case, the Parabix Subsystem thread$T_1$reads 16k of XML input$I$at a time and produces the content, symbol and URI streams, then stores them in a pre-allocated shared data structure$S$. The \MP{} thread$T_2$consumes$S$, performs well-formedness and grammar-based validation, The Markup Processor thread$T_2$consumes$S\$, performs well-formedness and grammar-based validation, and the provides parsed XML data to the application through the Xerces API. The shared data structure is implemented using a ring buffer, Overall, our design is intended to benefit a range of applications. Conceptually, we consider two design points. The first, the parsing performed by the \PS{} dominates at 67% of the overall cost, with the cost of application processing (including the driver logic within the \MP{}) at 33%. The first, the parsing performed by the Parabix Subsystem dominates at 67% of the overall cost, with the cost of application processing (including the driver logic within the Markup Processor) at 33%. The second is almost the opposite scenario, the cost of application processing dominates at 60%, while the cost of XML parsing represents an overhead of 40%. framework to achieve a 50% to 100% improvement in the parsing engine itself. In a best case scenario, a 100% improvement of the \PS{} for the design point in which a 100% improvement of the Parabix Subsystem for the design point in which XML parsing dominates at 67% of the total application cost. In this case, the single-threaded \icXML{} should achieve a 1.5x speedup over Xerces In this case, the single-threaded icXML should achieve a 1.5x speedup over Xerces so that the total application cost reduces to 67% of the original. However, in \icXMLp{}, our ideal scenario gives us two well-balanced threads However, in \icXML-p, our ideal scenario gives us two well-balanced threads each performing about 33% of the original work. In this case, Amdahl's law predicts that we could expect up to a 3x speedup at best. At the other extreme of our design range, we consider an application in which core parsing cost is 40%.   Assuming the 2x speedup of the \PS{} over the corresponding Xerces core, single-threaded \icXML{} delivers a 25% speedup.   However, the most significant the Parabix Subsystem over the corresponding Xerces core, single-threaded icXML delivers a 25% speedup.   However, the most significant aspect of our two-stage multi-threaded design then becomes the ability to hide the entire latency of parsing within the serial time Although the structure of the \PS{} allows division of the work into Although the structure of the Parabix Subsystem allows division of the work into several pipeline stages and has been demonstrated to be effective for four pipeline stages in a research prototype \cite{HPCA2012}, our analysis here suggests that the further pipelining of work within the \PS{} is not worthwhile if the cost of application logic is little as the Parabix Subsystem is not worthwhile if the cost of application logic is little as 33% of the end-to-end cost using Xerces.  To achieve benefits of further parallelization with multi-core technology, there would Performance We evaluate \xerces{}, \icXML{}, \icXMLp{} against two benchmarking applications: We evaluate \xerces{}, icXML, \icXML-p against two benchmarking applications: the Xerces C++ SAXCount sample application, and a real world GML to SVG transformation application. Figure \ref{perf_SAX} compares the performance of Xerces, \icXML{} and pipelined \icXML{} in terms of Figure \ref{perf_SAX} compares the performance of Xerces, icXML and pipelined icXML in terms of CPU cycles per byte for the SAXCount application. The speedup for \icXML{} over Xerces is 1.3x to 1.8x. With two threads on the multicore machine, \icXMLp{} can achieve speedup up to 2.7x. The speedup for icXML over Xerces is 1.3x to 1.8x. With two threads on the multicore machine, \icXML-p can achieve speedup up to 2.7x. Xerces is substantially slowed by dense markup but \icXML{} is less affected through a reduction in branches and the use of parallel-processing techniques. \icXMLp{} performs better as markup-density increases because the work performed by each stage is but icXML is less affected through a reduction in branches and the use of parallel-processing techniques. \icXML-p performs better as markup-density increases because the work performed by each stage is well balanced in this application. This paper is the first case study documenting the significant performance benefits that may be realized through the integration of parallel \bitstream{} technology into existing widely-used software libraries. of parallel bitstream technology into existing widely-used software libraries. In the case of the Xerces-C++ XML parser, the combined integration of SIMD and multicore parallelism was shown capable of dramatic producing dramatic increases in throughput and reductions in branch mispredictions and cache misses. The modified parser, going under the name \icXML{} is designed The modified parser, going under the name icXML is designed to provide the full functionality of the original Xerces library with complete compatibility of APIs.  Although substantial The further development of \icXML{} to move beyond 2-stage The further development of icXML to move beyond 2-stage pipeline parallelism is ongoing, with realistic prospects for four reasonably balanced stages within the library.  For To overcome the software engineering challenges in applying parallel \bitstream{} technology to existing software systems, parallel bitstream technology to existing software systems, it is clear that better library and tool support is needed. The techniques used in the implementation of \icXML{} and The techniques used in the implementation of icXML and documented in this paper could well be generalized for applications in other contexts and automated through the creation of compiler technology specifically supporting parallel \bitstream{} programming. parallel bitstream programming.