Changeset 2520

Ignore:
Timestamp:
Oct 20, 2012, 12:04:06 PM (7 years ago)
Message:

Parallel filtering

Location:
docs/Working/icXML
Files:
2 edited

Legend:

Unmodified
 r2429 \label{sec:parfilter} Several stages of standard XML processing include compressions that filter out data from certain input byte positions in various ways. These include the compression of multibyte sequences in UTF-8 to UTF-16 transcoding, the compression of CR-LF sequences to single LF characters in line break normalization and the compression of character and entity references of the form \verb:": or \verb:<: to the single \verb:":' and \verb:<:' characters, respectively. Within icXML, a further compression step reduces all markup data within the input document to a single tag byte preceding each significant XML transition as described in section \ref{sec:contentbuffer}. As just mentioned, UTF-8 to UTF-16 transcoding involves marking all but the last bytes of multibyte UTF-8 sequences as positions for deletion.   For example, the two Chinese characters \begin{CJK*}{UTF8}{gbsn}äœ å¥œ\end{CJK*} are represented as two three-byte UTF-8 sequences \verb'E4 BD A0' and \verb'E5 A5 BD' while the UTF-16 representation must be compressed down to the two code units \verb'4F60' and \verb'597D'. In the bit parallel representation, this corresponds to a reduction from six bit positions representing UTF-8 code units (bytes) down to just two bit positions representing UTF-16 code units (doublebytes).   This compression may be achieved by arranging to calculate the correct UTF-16 bits at the final position of each sequence and creating a deletion mask to mark the first two bytes of each 3-byte sequence for deletion.   In this case, the portion of the mask corresponding to these input bytes is the bit sequence \verb'110110'.  Using this approach, transcoding may then be completed by applying parallel deletion and inverse transposition of the UTF-16 bit streams\cite{Cameron2008}. \begin{figure}[htbp] \begin{tabular}{ll} source text & \begin{CJK*}{UTF8}{gbsn}äœ å¥œ\end{CJK*}\&\#33; \\ UTF8  & E4 BD A0 E5 A5 BD 26 23 33 33 3B 0D 0A\\ UTF16 & 4F60 597D 0026 0023 0033 0033 003B 000D 000A\\ CRLF to LF& 4F60 597D 0026 0023 0033 0033 003B 000A\\ Ref Replace & 4F60 597D 0021 000A\\ \end{tabular} \caption{} \label{fig:compression1} Rather than immediately paying the costs of deletion and transposition just for transcoding, however, \icXML{} defers these steps so that the deletion masks for several stages of processing may be combined. In particular, this includes core XML requirements to normalize line breaks and to replace character reference and entity references by their corresponding text.   In the case of line break normalization, all forms of line breaks, including bare carriage returns (CR), line feeds (LF) and CR-LF combinations must be normalized to a single LF character in each case.   In \icXML{}, this is achieved by first marking CR positions, performing two bit parallel operations to transform the marked CRs into LFs, and then marking for deletion any LF that is found immediately after the marked CR as shown by the Pablo source code in Figure \ref{fig:LBnormalization}. \begin{figure} \begin{verbatim} # XML 1.0 line-break normalization rules. if lex.CR: # Modify CR (#x0D) to LF (#x0A) u16lo.bit_5 ^= lex.CR u16lo.bit_6 ^= lex.CR u16lo.bit_7 ^= lex.CR CRLF = pablo.Advance(lex.CR) & lex.LF callouts.delmask |= CRLF # Adjust LF streams for newline/column tracker lex.LF |= lex.CR lex.LF ^= CRLF \end{verbatim} \caption{Line Break Normalization Logic}\label{fig:LBnormalization} \end{figure} In essence, the deletion masks for transcoding and for line break normalization each represent a bitwise filter; these filters can be combined using bitwise-or so that the parallel deletion algorithm need only be applied once. \begin{figure}[htbp] \begin{tabular}{ll} source text & \begin{CJK*}{UTF8}{gbsn}äœ å¥œ\end{CJK*}\&\#33; \\ UTF8  & E4 BD A0 E5 A5 BD 26 23 33 33 3B 0D 0A\\ delmask for UTF16 & 1101100000000\\ delmask for CR& 0000000000010\\ delmask for Ref & 0000000111100\\ delmask & 1101100111110 \end{tabular} \caption{} \label{fig:compression2} \end{figure} A further application of combined filtering is the processing of XML character and entity references.   Consider, for example, the references \verb:&: or \verb:<:. which must be replaced in XML processing with the single \verb:&:' and \verb:<:' characters, respectively. The approach in \icXML{} is to mark all but the first character positions of each reference for deletion, leaving a single character position unmodified.  Thus, for the references \verb:&: or \verb:<: the masks \verb:01111: and \verb:011111: are formed and combined into the overall deletion mask.   After the deletion and inverse transposition operations are finally applied, a postprocessing step inserts the proper character at these positions.   One note about this process is that it is speculative; references are assumed to generally be replaced by a single UTF-16 code unit.   In the case, that this is not true, it is addressed in postprocessing. In Xerces and other sequential XML processors, each of these compression steps is typically implemented as a byte-at-a-time string-buffer to string-buffer operation.  This involves substantial copying and shifting of data at each step. Figure \ref{fig:compression1} illustrates.   Line 1 shows the input code units in UTF-8, with two 3-byte sequences for the Chinese characters ni3 hao3 represented in hexadecimal form.   In Line 2, the UTF-16 code units are shown, with a reduction in the total number of code units by 4, although each code unit is  now a 16-bit quantity.  This compression is carried out by the UTF-8 to UTF-16 transcoder in Xerces. Line 3 then shows the reduction of the two CR-LF sequences to simple LF values. % This is the line break normalization that takes place within the Reader??? Line 4 then shows the further compression by replacement of the character and general entity references by their corresponding character values. In icXML, separate buffer copying operations for each of these steps is avoided by taking advantage of a parallel deletion algorithm to do compression.   The basic concept is to mark each position to be deleted using a 1-bit and then implement deletion using parallel block-at-a-time operations, following for example, the parallel compress algorithm\cite{HackersDelight}. Figure \ref{fig:compression2} shows three deletion masks corresponding to the three compressions of Figure \ref{fig:compression1}. In the fourth line of this figure, the bitwise-or of these three masks is computed.   This mask is then used to perform the deletions in a single application of our parallel deletion algorithm.  By combining the three steps, we reduce the cost of three separate buffer transformations to that of a single deletion operation.  The performance of the parallel deletion algorithm The final step of combined filtering occurs during the process of reducing markup data to tag bytes preceding each significant XML transition as described in section \ref{sec:contentbuffer}.  Overall, \icXML{} avoids separate buffer copying operations for each of the these filtering steps, paying the cost of parallel deletion and inverse transposition only once. Currently, \icXML{} employs the parallel-prefix compress algorithm of Steele\cite{HackersDelight}  Performance is independent of the number of positions deleted.  As a further note, Intel's recent Haswell architecture has introduced a parallel extract operation\cite{} that can directly perform parallel deletion. further note, future versions of \icXML{} are expected to take advantage of the parallel extract operation\cite{} that Intel is now providing in its Haswell architecture \cite{}.