source: docs/Working/icXML/parfilter.tex @ 3633

Last change on this file since 3633 was 2872, checked in by nmedfort, 7 years ago

edits

File size: 5.7 KB
Line 
1\subsection{Combined Parallel Filtering}
2\label{sec:parfilter}
3
4As just mentioned, UTF-8 to UTF-16 transcoding involves marking
5all but the last bytes of multi-byte UTF-8 sequences as
6positions for deletion.   For example, the two
7Chinese characters \begin{CJK*}{UTF8}{gbsn}䜠奜\end{CJK*}
8are represented as two three-byte UTF-8 sequences \verb'E4 BD A0'
9and \verb'E5 A5 BD' while the UTF-16 representation must be
10compressed down to the two code units \verb'4F60' and \verb'597D'.
11In the bit parallel representation, this corresponds to a reduction
12from six bit positions representing UTF-8 code units (bytes)
13down to just two bit positions representing UTF-16 code units
14(double bytes).   This compression may be achieved by
15arranging to calculate the correct UTF-16 bits at the
16final position of each sequence and creating a deletion
17mask to mark the first two bytes of each 3-byte sequence
18for deletion.   In this case, the portion of the mask
19corresponding to these input bytes is the bit sequence
20\verb'110110'.  Using this approach, transcoding may then be
21completed by applying parallel deletion and inverse transposition of the
22UTF-16 \bitstream{}s\cite{Cameron2008}.
23
24\begin{figure*}[tbh]
25\begin{center}
26\begin{tabular}{rr}\\
27Source Data & \verb`<document>fee<element a1='fie' a2 = 'foe'></element>fum</document>`\\
28% Tag Openers & \verb`1____________1____________________________1____________1__________`\\
29% Start Tag Marks & \verb`_1____________1___________________________________________________`\\
30% End Tag Marks & \verb`___________________________________________1____________1_________`\\
31% Empty Tag Marks & \verb`__________________________________________________________________`\\
32% Element Names & \verb`_11111111_____1111111_____________________________________________`\\
33% Attribute Names & \verb`______________________11_______11_________________________________`\\
34% Attribute Values & \verb`__________________________111________111__________________________`\\
35String Ends & \verb`1____________1_______________1__________1_1____________1__________`\\
36Markup Identifiers & \verb`_________1______________1_________1______1_1____________1_________`\\
37Deletion Mask & \verb`_11111111_____1111111111_1____1111_11_______11111111_____111111111`\\
38Undeleted Data & \verb``{\tt\it 0}\verb`________>fee`{\tt\it 0}\verb`__________=_fie`{\tt\it 0}\verb`____=__foe`{\tt\it 0}\verb`>`{\tt\it 0}\verb`/________fum`{\tt\it 0}\verb`/_________`
39\end{tabular}
40\end{center}
41\caption{XML Source Data and Derived Parallel Bit Streams}
42\label{fig:parabix2}
43\end{figure*}
44
45Rather than immediately paying the
46costs of deletion and transposition just for transcoding,
47however, \icXML{} defers these steps so that the deletion
48masks for several stages of processing may be combined.
49In particular, this includes core XML requirements
50to normalize line breaks and to replace character
51reference and entity references by their corresponding
52text.   In the case of line break normalization,
53all forms of line breaks, including bare carriage
54returns (CR), line feeds (LF) and CR-LF combinations
55must be normalized to a single LF character in
56each case.   In \icXML{}, this is achieved by
57first marking CR positions, performing two
58bit parallel operations to transform the marked
59CRs into LFs, and then marking for deletion any
60LF that is found immediately after the marked CR
61as shown by the Pablo source code in Figure \ref{fig:LBnormalization}.
62\begin{figure}
63\begin{verbatim}
64# XML 1.0 line-break normalization rules.
65if lex.CR:
66# Modify CR (#x0D) to LF (#x0A)
67  u16lo.bit_5 ^= lex.CR
68  u16lo.bit_6 ^= lex.CR
69  u16lo.bit_7 ^= lex.CR
70  CRLF = pablo.Advance(lex.CR) & lex.LF
71  callouts.delmask |= CRLF
72# Adjust LF streams for line/column tracker
73  lex.LF |= lex.CR
74  lex.LF ^= CRLF
75\end{verbatim}
76\caption{Line Break Normalization Logic}\label{fig:LBnormalization}
77\end{figure}
78In essence, the deletion masks for transcoding and
79for line break normalization each represent a bitwise
80filter; these filters can be combined using bitwise-or
81so that the parallel deletion algorithm need only be
82applied once.
83
84A further application of combined filtering
85is the processing of XML character and entity
86references.   Consider, for example, the references \verb:&amp;: or \verb:&#x3C;:.
87which must be replaced in XML processing with 
88the single `\verb:&:' and `\verb:<:' characters, respectively.
89The approach in \icXML{} is to mark all but the first character
90positions of each reference for deletion, leaving a
91single character position unmodified.  Thus, for the
92references \verb:&amp;: or \verb:&#x3C;: the
93masks \verb:01111: and \verb:011111: are formed and
94combined into the overall deletion mask.   After the
95deletion and inverse transposition operations are finally
96applied, a post-processing step inserts the proper character
97at these positions.   One note about this process is
98that it is speculative; references are assumed to generally
99be replaced by a single UTF-16 code unit.   In the case,
100that this is not true, it is addressed in post-processing.
101
102The final step of combined filtering occurs during
103the process of reducing markup data to tag bytes
104preceding each significant XML transition as described
105in section~\ref{section:arch:contentstream}.  Overall, \icXML{}
106avoids separate buffer copying operations for each of the
107these filtering steps, paying the cost of parallel
108deletion and inverse transposition only once. 
109Currently, \icXML{} employs the parallel-prefix compress algorithm
110of Steele~\cite{HackersDelight}  Performance
111is independent of the number of positions deleted.
112Future versions of \icXML{} are expected to
113take advantage of the parallel extract operation~\cite{HilewitzLee2006}
114that Intel is now providing in its Haswell architecture.
Note: See TracBrowser for help on using the repository browser.