source: docs/Working/icXML/parfilter.tex @ 2449

Last change on this file since 2449 was 2429, checked in by nmedfort, 7 years ago

some progress

File size: 3.4 KB
Line 
1\subsection{Combined Parallel Filtering}
2\label{sec:parfilter}
3
4Several stages of standard XML processing include compressions
5that filter out data from certain input byte positions
6in various ways. These include the compression of multibyte
7sequences in UTF-8 to UTF-16 transcoding, the compression
8of CR-LF sequences to single LF characters in line break
9normalization and the compression of character and entity
10references of the form \verb:": or \verb:<: to
11the single `\verb:":' and `\verb:<:' characters, respectively.
12Within icXML, a further compression step reduces all
13markup data within the input document to a single tag
14byte preceding each significant XML transition as described
15in section \ref{sec:contentbuffer}.
16
17\begin{figure}[htbp]
18\begin{tabular}{ll}
19source text & \begin{CJK*}{UTF8}{gbsn}䜠奜\end{CJK*}\&\#33; \\
20UTF8  & E4 BD A0 E5 A5 BD 26 23 33 33 3B 0D 0A\\
21UTF16 & 4F60 597D 0026 0023 0033 0033 003B 000D 000A\\
22CRLF to LF& 4F60 597D 0026 0023 0033 0033 003B 000A\\
23Ref Replace & 4F60 597D 0021 000A\\
24\end{tabular}
25\caption{}
26\label{fig:compression1}
27\end{figure}
28
29\begin{figure}[htbp]
30\begin{tabular}{ll}
31source text & \begin{CJK*}{UTF8}{gbsn}䜠奜\end{CJK*}\&\#33; \\
32UTF8  & E4 BD A0 E5 A5 BD 26 23 33 33 3B 0D 0A\\
33delmask for UTF16 & 1101100000000\\
34delmask for CR& 0000000000010\\
35delmask for Ref & 0000000111100\\
36delmask & 1101100111110
37\end{tabular}
38\caption{}
39\label{fig:compression2}
40\end{figure}
41
42In Xerces and other sequential XML processors, each of these compression
43steps is typically implemented as a byte-at-a-time string-buffer
44to string-buffer operation.  This involves substantial
45copying and shifting of data at each step. Figure \ref{fig:compression1}
46illustrates.   Line 1 shows the input code units in UTF-8,
47with two 3-byte sequences for the Chinese characters ni3 hao3
48represented in hexadecimal form.   In Line 2, the UTF-16
49code units are shown, with a reduction in the total number of
50code units by 4, although each code unit is  now a 16-bit
51quantity.  This compression is carried out by the UTF-8 to UTF-16
52transcoder in Xerces. Line 3 then shows the reduction of the
53two CR-LF sequences to simple LF values. 
54% This is the line break normalization that takes place within the Reader???
55Line 4 then shows the further compression by replacement of
56the character and general entity references by their corresponding
57character values.
58
59In icXML, separate buffer copying operations for each of these
60steps is avoided by taking advantage of a parallel deletion
61algorithm to do compression.   The basic concept is to mark
62each position to be deleted using a 1-bit and then implement
63deletion using parallel block-at-a-time operations, following
64for example, the parallel compress algorithm\cite{HackersDelight}.
65Figure \ref{fig:compression2} shows three deletion masks corresponding
66to the three compressions of Figure \ref{fig:compression1}.
67In the fourth line of this figure, the bitwise-or of these
68three masks is computed.   This mask is then used to perform
69the deletions in a single application of our parallel deletion
70algorithm.  By combining the three steps, we reduce the cost
71of three separate buffer transformations to that of a single deletion
72operation.  The performance of the parallel deletion algorithm
73is independent of the number of positions deleted.  As a
74further note, Intel's recent Haswell architecture has
75introduced a parallel extract operation\cite{} that can directly
76perform parallel deletion.
77
78
79
80
81
Note: See TracBrowser for help on using the repository browser.