# Changeset 2872 for docs

Ignore:
Timestamp:
Jan 30, 2013, 6:03:41 PM (6 years ago)
Message:

edits

Location:
docs/Working/icXML
Files:
14 edited

Unmodified
Added
Removed
• ## docs/Working/icXML/abstract.tex

 r2869 Prior research on the acceleration of XML processing using SIMD and multicore parallelism has lead to using SIMD and multi-core parallelism has lead to a number of interesting research prototypes.  This work investigates the extent to which the techniques underlying
• ## docs/Working/icXML/arch-charactersetadapters.tex

 r2866 In the important case of UTF-8 to UTF-16 transcoding, the transcoding costs can be significant, because of the need to decode and classify each byte of input, mapping variable-length UTF-8 byte sequences into 16-bit UTF-16 code units with bit manipulation operations.   In other cases, transcoding may involve table lookup operations for each byte of input.  In any case, byte sequences into 16-bit UTF-16 code units with bit manipulation operations. In other cases, transcoding may involve table look-up operations for each byte of input.  In any case, transcoding imposes at least a cost of buffer copying. Given a specified input encoding, a CSA is responsible for checking that input code units represent valid characters, mapping the characters of the encoding into the appropriate bit streams for XML parsing actions (i.e., producing the lexical item the appropriate \bitstream{}s for XML parsing actions (i.e., producing the lexical item streams), as well as supporting ultimate transcoding requirements.   All of this work is performed using the parallel bit stream representation of the source input. is performed using the parallel \bitstream{} representation of the source input. An important observation is that many character sets are an A second observation is that---regardless of which character set is used---quite often all of the characters in a particular block of input will be within the ASCII range. This is a very simple test to perform using the bit stream representation, simply confirming that the This is a very simple test to perform using the \bitstream{} representation, simply confirming that the bit 0 stream is zero for the entire block.   For blocks satisfying this test, all logic dealing with non-ASCII characters can simply be skipped. Transcoding to UTF-16 becomes trivial as the high eight bit streams of the Transcoding to UTF-16 becomes trivial as the high eight \bitstream{}s of the UTF-16 form are each set to zero in this case. A third observation is that repeated transcoding of the names of XML elements, attributes and so on can be avoided by using a lookup mechanism. That is, the first occurrence of each symbol is stored in a lookup elements, attributes and so on can be avoided by using a look-up mechanism. That is, the first occurrence of each symbol is stored in a look-up table mapping the input encoding to a numeric symbol ID.   Transcoding of the symbol is applied at this time.  Subsequent lookup operations of the symbol is applied at this time.  Subsequent look-up operations can avoid transcoding by simply retrieving the stored representation. As symbol lookup is required to apply various XML validation rules, As symbol look up is required to apply various XML validation rules, there is achieves the effect of transcoding each occurrence without additional cost. The cost of individual character transcoding is avoided whenever a block of input is confined to the ASCII subset and for all but the first occurrence of any XML element or attribute name. Furthermore, when transcoding is required, the parallel bit stream representation supports efficient transcoding operations.   In the important case of UTF-8 to UTF-16 transcoding, the corresponding UTF-16 bit streams Furthermore, when transcoding is required, the parallel \bitstream{} representation supports efficient transcoding operations. In the important case of UTF-8 to UTF-16 transcoding, the corresponding UTF-16 \bitstream{}s can be calculated in bit parallel fashion based on UTF-8 streams \cite{Cameron2008}, and all but the final bytes of multibyte sequences can be marked for deletion as and all but the final bytes of multi-byte sequences can be marked for deletion as discussed in the following subsection. In other cases, transcoding within a block only need be applied for non-ASCII
• ## docs/Working/icXML/arch-contentstream.tex

 r2531 accounts for $6.83\%$ of Xerces's execution time. Additionally, it is cheap to locate the terminal character of each string: using the String End bit stream, the \PS{} can effectively calculate the offset of each using the String End \bitstream{}, the \PS{} can effectively calculate the offset of each null character in the content stream in parallel, which in turn means the parser can directly jump to the end of every string without scanning for it.
• ## docs/Working/icXML/arch-errorhandling.tex

 r2866 Within the \PS{}, all computations are performed in parallel, a block at a time. Errors are derived as artifacts of bit stream calculations, with a 1-bit marking the byte-position of an error within a block, Errors are derived as artifacts of \bitstream{} calculations, with a 1-bit marking the byte-position of an error within a block, and the type of error is determined by the equation that discovered it. The difficulty of error processing in this section is that in Xerces the line and column number must be given During this process, white-space normalization rules are applied and multi-code-unit and surrogate characters are detected and validated. A {\it line-feed bit stream}, which marks the positions of the normalized new lines characters, is a natural derivative of A {\it line-feed \bitstream{}}, which marks the positions of the normalized new lines characters, is a natural derivative of this process. Using an optimized population count algorithm, the line count can be summarized cheaply for each valid block of text. % The optimization delays the counting process .... Column position is more difficult to calculate. It is possible to scan backwards through the bit stream of new line characters to determine the distance (in code-units) It is possible to scan backwards through the \bitstream{} of new line characters to determine the distance (in code-units) between the position between which an error was detected and the last line feed. However, this distance may exceed than the acutal character position for the reasons discussed in (2). To handle this, the CSA generates a {\it skip mask} bit stream by ORing together many relevant bit streams, than the actual character position for the reasons discussed in (2). To handle this, the CSA generates a {\it skip mask} \bitstream{} by ORing together many relevant \bitstream{}s, such as all trailing multi-code-unit and surrogate characters, and any characters that were removed during the normalization process. column number. % \begin{figure}[ht] % {\bf TODO: An example of a skip mask, error mask, and the raw data and transcoded data for it. % Should a multi-byte character be used and/or some CRLFs to show the difficulties?} % \label{fig:error_mask} % \caption{} % \end{figure} The \MP{} is a state-driven machine. As such, error detection within it is very similar to Xerces. thus its impossible to derive the current location using only the content stream. To calculate the location, the \MP{} borrows three additional pieces of information from the \PS{}: the line-feed, skip mask, and a {\it deletion mask stream}, which is a bit stream denoting the (code-unit) position of every datum that was surpressed from the source during the production of the content stream. the line-feed, skip mask, and a {\it deletion mask stream}, which is a \bitstream{} denoting the (code-unit) position of every datum that was suppressed from the source during the production of the content stream. Armed with these, it is possible to calculate the actual line/column using the same system as the \PS{} until the sum of the negated deletion mask stream is equal to the current position.
• ## docs/Working/icXML/arch-namespace.tex

 r2522 To resolve the in-scope namespace binding for each prefix, a bit vector of the currently visible namespaces is maintained by the system. By ANDing the prefix bit vector with the currently visible namespaces, the in-scope NSID can be found using a bit scan instruction. NSID can be found using a bit-scan intrinsic. A namespace binding table, similar to Table \ref{tbl:namespace1}, provides the actual URI ID.
• ## docs/Working/icXML/arch-overview.tex

 r2871 In \icXML{} functions are grouped into logical components. As shown in Figure \ref{fig:icxml-arch}, two major categories exist: (1) the \PS{} and (2) the \MP{}. All tasks in (1) use the Parabix Framework \cite{HPCA2012}, which represents data as a set of parallel bit streams. All tasks in (1) use the Parabix Framework \cite{HPCA2012}, which represents data as a set of parallel \bitstream{}s. The {\it Character Set Adapter}, discussed in Section \ref{arch:character-set-adapter}, mirrors Xerces's Transcoder duties; however instead of producing UTF-16 it produces a set of lexical bit streams, similar to those shown in Figure \ref{fig:parabix1}. These lexical bit streams are later transformed into UTF-16 in the \CSG{}, set of lexical \bitstream{}s, similar to those shown in Figure \ref{fig:parabix1}. These lexical \bitstream{}s are later transformed into UTF-16 in the \CSG{}, after additional processing is performed. The first precursor to producing UTF-16 is the {\it Parallel Markup Parser} phase. It takes the lexical streams and produces a set of marker bit streams in which a 1-bit identifies significant positions within the input data. One bit stream for each of the critical piece of information is created, such as It takes the lexical streams and produces a set of marker \bitstream{}s in which a 1-bit identifies significant positions within the input data. One \bitstream{} for each of the critical piece of information is created, such as the beginning and ending of start tags, end tags, element names, attribute names, attribute values and content. Intra-element well-formedness validation is performed as an artifact of this process. The {\it Line-Column Tracker} uses the lexical information to keep track of the document position(s) through the use of an optimized population count algorithm, described in Section \ref{section:arch:errorhandling}. From here, two data-independent branches exist: the Symbol Pesolver and Content Preperation Unit. From here, two data-independent branches exist: the Symbol Resolver and Content Preparation Unit. A typical XML file contains few unique element and attribute names---but each of them will occur frequently. the raw data to produce a sequence of GIDs, called the {\it symbol stream}. The final components of the \PS{} are the {\it Content Preperation Unit} and {\it \CSG{}}. The former takes the (transposed) basis bit streams and selectively filters them, according to the The final components of the \PS{} are the {\it Content Preparation Unit} and {\it \CSG{}}. The former takes the (transposed) basis \bitstream{}s and selectively filters them, according to the information provided by the Parallel Markup Parser, and the latter transforms the filtered streams into the tagged UTF-16 {\it content stream}, discussed in Section \ref{section:arch:contentstream}. Combined, the symbol and content stream form \icXML{}'s compressed IR of the XML document. The {\it \MP{}}~parses the IR to validate and produce the sequential output for the end user. The {\it Final WF checker} performs inter-element wellformedness validation that would be too costly to perform in bitspace, such as ensuring every start tag has a matching end tag. The {\it Final WF checker} performs inter-element well-formedness validation that would be too costly to perform in bit space, such as ensuring every start tag has a matching end tag. Xerces's namespace binding functionality is replaced by the {\it Namespace Processor}. Unlike Xerces, it is a discrete phase that produces a series of URI identifiers (URI IDs), the {\it URI stream}, which are \label{fig:icxml-arch} \end{figure} % Probably not the right area but should we discuss issues with Xerces design that we tried to correct? % - over-reliance on hash tables when domain knowledge dictated none would be needed % - constant buffering of text to ensure that every QName/NCName and content was contained within a single string % - abundant use of heap allocated memory % - text conversions done in multiple areas % - poor cache utilization; attempted to improve by using smaller layers of tasks in bulk % As the previous section aluded, the greatest difference between sequential parsing methods % and the Parabix parsing model is how data is processed. % Consider Figure \ref{fig:parabix1} again. In it, the start tags are located independent of the end % tags. In order to produce Xerces-equivalent output, icXML must emit the start and end tag % events in sequential order, with all attribute data associated with the correct tag. % % % % The Parabix framework, however, does not allow for this (and would be hindered performance wise if % forced to.) % Thus our first question was, How can we how can we take full advantage % of Parabix whilst producing Xerces-equivalent output?'' Our answer came by analyzing what Xerces produced % when given an input text. % % By analyzing Xerces internal data structures and its produced output, two major observations were obvious: % (1) input data is transcoded into UTF-16 to ensure that there is a single standard character type, both % internally (within the grammar structures and hash tables) and externally (for the end user). % (2) all elements and attributes (both qualified and unqualified) are associated with a unique element % declaration or attribute definition within a specific grammar structure. Xerces emits the appropriate % grammar reference in place of the element or attribute string. %   From Xerces to icXML % %   - Philosophy:  Maximizing Bit Stream Processing % %   - Character Set Adapters vs. Transcoding %   - Bitstreams 1: Charset Validation and Transcoding equations %   - Bitstreams 2: Parabix style parsing and validation % %   - Bitstreams 3: Parallel filtering and normalization %           - LB normalization %           - reference compression -> single code unit speculation %           - parallel string termination % %   - Bitstreams 4: Symbol processing % %   - From bit streams to doublebyte streams: the content buffer % %   - Namespace Processing: A Bitset approach.
• ## docs/Working/icXML/background-fundemental-differences.tex

 r2866 \subsection {Sequential vs. Parallel Paradigm} % Sequential: bytes through layers Xerces---like all traditional XML parsers---processes XML documents sequentially. Each character is examined to distinguish between the validation and content processing modes. In other words, Xerces belongs to an equivalent class applications termed FSM applications\footnote{ Herein FSM applications are software systems whose behavior is defined by the inputs, Herein FSM applications are considered software systems whose behaviour is defined by the inputs, current state and the events associated with transitions of states.}. Each state transition indicates the processing context of subsequent characters. Unfortunately, textual data tends to be unpredictable and any character could induce a state transition. % Unfortunately, textual data tends to consist of variable-length strings sequenced in % unpredictable patterns. % Each character must be examined in sequence because any character could be a state transition until deemed otherwise. % Parallel: blocks/segments/buffers through layers Parabix-style XML parsers utilize a concept of layered processing. A block of source text is transformed into a set of lexical bit streams, A block of source text is transformed into a set of lexical \bitstream{}s, which undergo a series of operations that can be grouped into logical layers, e.g., transposition, character classification, and lexical analysis. Each layer is pipeline parallel and require neither speculation nor pre-parsing stages\cite{HPCA2012}. % In adapting to the requirements of the Xerces sequential parsing API, % however, the resultant parallel bit streams may out-of-order \wrt{} the source document. % Hence they must be amalgamated and iterated through to produce sequential output. To meet the API requirements of the document-ordered Xerces output, the results of the Parabix processing layers must be interleaved to produce the equivalent behavior. the results of the Parabix processing layers must be interleaved to produce the equivalent behaviour.
• ## docs/Working/icXML/background-parabix.tex

 r2866 lines show streams that can be computed in subsequent parsing (using the technique of bitstream addition \cite{cameron-EuroPar2011}), namely streams marking the element names, of \bitstream{} addition \cite{cameron-EuroPar2011}), namely streams marking the element names, attribute names and attribute values of tags. sequential scanning loops for individual characters \cite{CameronHerdyLin2008}. Recent work has incorporated a method of parallel scanning using bitstream addition \cite{cameron-EuroPar2011}, as scanning using \bitstream{} addition \cite{cameron-EuroPar2011}, as well as combining SIMD methods with 4-stage pipeline parallelism to further improve throughput \cite{HPCA2012}. Commercial XML processors support transcoding of multiple character sets and can parse and validate against multiple document vocabulaties. validate against multiple document vocabularies. Additionally, they provide API facilities beyond those found in research prototypes, including the widely used SAX, SAX2 and DOM interfaces.
• ## docs/Working/icXML/background-xerces.tex

 r2866 Figure \ref{fig:xerces-profile} shows the execution time profile of the top ten functions in a typical run. Even if it were possible, Amdahl's Law dictates that tackling any one of these functions for parallelization in isolation would only produce a minute improvement in perfomance. Unfortunetly, early investigation into these functions found parallelization in isolation would only produce a minute improvement in performance. Unfortunately, early investigation into these functions found that incorporating speculation-free thread-level parallelization was impossible and they were already performing well in their given tasks; \label {fig:xerces-profile} \end{figure} % Figure \ref{fig:xerces-arch} shows the % overall architecture of the Xerces C++ parser. % In analyzing the structure of Xerces, it was found that % there were a number of individual byte-at-a-time % processing tasks. % % \begin{enumerate} % \item Transcoding of source data to UTF-16 % \item Character validation. % \item Line break normalization. % \item Character classification. % \item Line-column calculation. % \item Escape insertion and replacement. % \item Surrogate handling. % \item Name processing. % \item Markup parsing. % \item Attribute validation. % %\item Attribute checking. % %\item xmlns attribute processing. % \item Namespace processing. % \item Grammars, content model and data type validation. % \end{enumerate}
• ## docs/Working/icXML/conclusion.tex

 r2869 This paper is the first case study documenting the significant performance benefits that may be realized through the integration of parallel bit stream technology into existing widely-used software libraries. of parallel \bitstream{} technology into existing widely-used software libraries. In the case of the Xerces-C++ XML parser, the combined integration of SIMD and multicore parallelism was to provide the full functionality of the original Xerces library with complete compatibility of APIs.  Although substantial reengineering was required to realize the re-engineering was required to realize the performance potential of parallel technologies, this is an important case study demonstrating the general To overcome the software engineering challenges in applying parallel bit stream technology to existing software systems, parallel \bitstream{} technology to existing software systems, it is clear that better library and tool support is needed. The techniques used in the implementation of \icXML{} and applications in other contexts and automated through the creation of compiler technology specifically supporting parallel bit stream programming. parallel \bitstream{} programming.
• ## docs/Working/icXML/icxml-main.tex

 r2871 \def \MP {Markup Processor} \def \wrt {with respect to} \def \bitstream{bitstream} \title{\icXML{}:  Accelerating a Commercial XML Parser Using SIMD and Multicore Technologies}
• ## docs/Working/icXML/multithread.tex

 r2871 %\section{Leveraging SIMD Parallelism for Multicore: Pipeline Parallelism} % As discussed in section \ref{background:xerces}, Xerces can be considered a complex finite-state machine % Finite-state machine belongs to the hardest application class to parallelize and process efficiently % among all presented in Berkeley study reports \cite{Asanovic:EECS-2006-183}. % However, \icXML{} reconstructs Xerces and provides logical layers between modules, % which naturally enables pipeline parallel processing. As discussed in section \ref{background:xerces}, Xerces can be considered a FSM application. These are embarassingly sequential.''\cite{Asanovic:EECS-2006-183} and notoriously difficult to parallelize. These are embarrassingly sequential.''\cite{Asanovic:EECS-2006-183} and notoriously difficult to parallelize. However, \icXML{} is designed to organize processing into logical layers. In particular, layers within the \PS{} are designed to operate The most straightforward division of work in \icXML{} is to separate the \PS{} and the \MP{} into distinct logical layers into two seperate stages. the \PS{} and the \MP{} into distinct logical layers into two separate stages. The resultant application, {\it\icXMLp{}}, is a course-grained software-pipeline application. In this case, the \PS{} thread $T_1$ reads 16k of XML input $I$ at a time and produces the \end{figure} % In our pipeline model, each thread is in charge of one module or one group of modules. % A straight forward division is to take advantage of the layer between \PS{} and \MP{}. % In this case, the first thread $T_1$ will read 16k of XML input $I$ at a time % and process all the modules in \PS{} to generates % content buffer, symbol array, URI array, and store them to a pre-allocated shared data structure $S$. % The second thread $T_2$ consumes the data provided by the first thread and % goes through all the modules in Markup Processor and writes output $O$. % The shared data structure is implemented using a ring buffer, % where each entry consists of all the arrays shared between the two threads with size of 160k. % In the example of Figure \ref{threads_timeline1} and \ref{threads_timeline2}, the ring buffer has four entries. % A lock-free mechanism is applied to ensure that each entry can only be read or written by one thread at the same time. % In Figure \ref{threads_timeline1}, the processing time of the first thread is longer, % thus the second thread always wait for the first thread to finish processing one chunk of input % and write to the shared memory. % Figure \ref{threads_timeline2} illustrates a different situation where the second thread is slower % and the first thread has to wait for the second thread finishing reading the shared data before it can reuse the memory space. Overall, our design is intended to benefit a range of applications. Conceptually, we consider two design points. the \PS{} over the corresponding Xerces core, single-threaded \icXML{} delivers a 25\% speedup.   However, the most significant aspect of our two-stage multithreaded design then becomes the aspect of our two-stage multi-threaded design then becomes the ability to hide the entire latency of parsing within the serial time required by the application.   In this case, we achieve the \PS{} is not worthwhile if the cost of application logic is little as 33\% of the end-to-end cost using Xerces.  To achieve benefits of further parallelization with multicore technology, there would further parallelization with multi-core technology, there would need to be reductions in the cost of application logic that could match reductions in core parsing cost. % \begin{figure} % \includegraphics[width=0.45\textwidth]{plots/threads_timeline1.pdf} % \caption{} % \label{threads_timeline1} % \end{figure} % % \begin{figure} % \includegraphics[width=0.45\textwidth]{plots/threads_timeline2.pdf} % \caption{} % \label{threads_timeline2} % \end{figure}
• ## docs/Working/icXML/parfilter.tex

 r2866 As just mentioned, UTF-8 to UTF-16 transcoding involves marking all but the last bytes of multibyte UTF-8 sequences as all but the last bytes of multi-byte UTF-8 sequences as positions for deletion.   For example, the two Chinese characters \begin{CJK*}{UTF8}{gbsn}äœ å¥œ\end{CJK*} from six bit positions representing UTF-8 code units (bytes) down to just two bit positions representing UTF-16 code units (doublebytes).   This compression may be achieved by (double bytes).   This compression may be achieved by arranging to calculate the correct UTF-16 bits at the final position of each sequence and creating a deletion \verb'110110'.  Using this approach, transcoding may then be completed by applying parallel deletion and inverse transposition of the UTF-16 bit streams\cite{Cameron2008}. UTF-16 \bitstream{}s\cite{Cameron2008}. \begin{figure*}[tbh] CRLF = pablo.Advance(lex.CR) & lex.LF callouts.delmask |= CRLF # Adjust LF streams for newline/column tracker # Adjust LF streams for line/column tracker lex.LF |= lex.CR lex.LF ^= CRLF combined into the overall deletion mask.   After the deletion and inverse transposition operations are finally applied, a postprocessing step inserts the proper character applied, a post-processing step inserts the proper character at these positions.   One note about this process is that it is speculative; references are assumed to generally be replaced by a single UTF-16 code unit.   In the case, that this is not true, it is addressed in postprocessing. that this is not true, it is addressed in post-processing. The final step of combined filtering occurs during
• ## docs/Working/icXML/performance.tex

 r2871 CPU cycles per byte for the SAXCount application. The speedup for \icXML{} over Xerces is 1.3x to 1.8x. With two threads on the multicore machine, our pipelined version can achieve speedup up to 2.7x. With two threads on the multicore machine, \icXMLp{} can achieve speedup up to 2.7x. Xerces is substantially slowed by dense markup but \icXML{} is less affected through a reduction in branches and the use of parallel-processing techniques. fewer branches.  Figure \ref{branchmiss_GML2SVG} shows the corresponding improvement in branching behaviour, with a dramatic reduction in branch misses per kB. It is also interesting to note that pipelined \icXML{} goes even further.   In essence, in using pipeline parallelism to split the instruction It is also interesting to note that \icXMLp{} goes even further. In essence, in using pipeline parallelism to split the instruction stream onto separate cores, the branch target buffers on each core are less overloaded and able to increase the successful branch prediction rate. and data-cache performance with the improvements in instruction-cache behaviour the most dramatic.   Single-threaded \icXML{} shows substantially improved performance over Xerces on both measures.   The pipelined version shows a slight worsening in data-cache performance, well more than offset by a further dramatic reduction in instruction-cache miss rate.   Again partitioning the instruction stream through the pipeline parallelism model has significant benefit. performance over Xerces on both measures. Although \icXMLp{} is slightly worse \wrt{} data-cache performance, this is more than offset by a further dramatic reduction in instruction-cache miss rate. Again partitioning the instruction stream through the pipeline parallelism model has significant benefit. \begin{figure}
Note: See TracChangeset for help on using the changeset viewer.