source: docs/Working/icXML/parfilter.tex @ 2529

Last change on this file since 2529 was 2523, checked in by lindanl, 7 years ago

minor changes

File size: 4.3 KB
1\subsection{Combined Parallel Filtering}
4As just mentioned, UTF-8 to UTF-16 transcoding involves marking
5all but the last bytes of multibyte UTF-8 sequences as
6positions for deletion.   For example, the two
7Chinese characters \begin{CJK*}{UTF8}{gbsn}䜠奜\end{CJK*}
8are represented as two three-byte UTF-8 sequences \verb'E4 BD A0'
9and \verb'E5 A5 BD' while the UTF-16 representation must be
10compressed down to the two code units \verb'4F60' and \verb'597D'.
11In the bit parallel representation, this corresponds to a reduction
12from six bit positions representing UTF-8 code units (bytes)
13down to just two bit positions representing UTF-16 code units
14(doublebytes).   This compression may be achieved by
15arranging to calculate the correct UTF-16 bits at the
16final position of each sequence and creating a deletion
17mask to mark the first two bytes of each 3-byte sequence
18for deletion.   In this case, the portion of the mask
19corresponding to these input bytes is the bit sequence
20\verb'110110'.  Using this approach, transcoding may then be
21completed by applying parallel deletion and inverse transposition of the
22UTF-16 bit streams\cite{Cameron2008}.
24Rather than immediately paying the
25costs of deletion and transposition just for transcoding,
26however, \icXML{} defers these steps so that the deletion
27masks for several stages of processing may be combined.
28In particular, this includes core XML requirements
29to normalize line breaks and to replace character
30reference and entity references by their corresponding
31text.   In the case of line break normalization,
32all forms of line breaks, including bare carriage
33returns (CR), line feeds (LF) and CR-LF combinations
34must be normalized to a single LF character in
35each case.   In \icXML{}, this is achieved by
36first marking CR positions, performing two
37bit parallel operations to transform the marked
38CRs into LFs, and then marking for deletion any
39LF that is found immediately after the marked CR
40as shown by the Pablo source code in Figure \ref{fig:LBnormalization}.
43# XML 1.0 line-break normalization rules.
44if lex.CR:
45# Modify CR (#x0D) to LF (#x0A)
46  u16lo.bit_5 ^= lex.CR
47  u16lo.bit_6 ^= lex.CR
48  u16lo.bit_7 ^= lex.CR
49  CRLF = pablo.Advance(lex.CR) & lex.LF
50  callouts.delmask |= CRLF
51# Adjust LF streams for newline/column tracker
52  lex.LF |= lex.CR
53  lex.LF ^= CRLF
55\caption{Line Break Normalization Logic}\label{fig:LBnormalization}
57In essence, the deletion masks for transcoding and
58for line break normalization each represent a bitwise
59filter; these filters can be combined using bitwise-or
60so that the parallel deletion algorithm need only be
61applied once.
63A further application of combined filtering
64is the processing of XML character and entity
65references.   Consider, for example, the references \verb:&: or \verb:<:.
66which must be replaced in XML processing with 
67the single `\verb:&:' and `\verb:<:' characters, respectively.
68The approach in \icXML{} is to mark all but the first character
69positions of each reference for deletion, leaving a
70single character position unmodified.  Thus, for the
71references \verb:&amp;: or \verb:&#x3C;: the
72masks \verb:01111: and \verb:011111: are formed and
73combined into the overall deletion mask.   After the
74deletion and inverse transposition operations are finally
75applied, a postprocessing step inserts the proper character
76at these positions.   One note about this process is
77that it is speculative; references are assumed to generally
78be replaced by a single UTF-16 code unit.   In the case,
79that this is not true, it is addressed in postprocessing.
81The final step of combined filtering occurs during
82the process of reducing markup data to tag bytes
83preceding each significant XML transition as described
84in section \ref{sec:contentbuffer}.  Overall, \icXML{}
85avoids separate buffer copying operations for each of the
86these filtering steps, paying the cost of parallel
87deletion and inverse transposition only once. 
88Currently, \icXML{} employs the parallel-prefix compress algorithm
89of Steele\cite{HackersDelight}  Performance
90is independent of the number of positions deleted.  As a
91further note, future versions of \icXML{} are expected to
92take advantage of the parallel extract operation\cite{HilewitzLee2006}
93that Intel is now providing in its Haswell architecture.
Note: See TracBrowser for help on using the repository browser.