Changeset 2520

Oct 20, 2012, 12:04:06 PM (7 years ago)

Parallel filtering

2 edited


  • docs/Working/icXML/arch-charactersetadapters.tex

    r2517 r2520  
    1212In \icXML{}, however,  the concept of Character Set Adapters (CSAs) is used to minimize transcoding costs.
    13 Given a specified input encoding, a Character Set Adapter is responsible for validating that
     13Given a specified input encoding, a Character Set Adapter is responsible for checking that
    1414input code units represent valid characters, mapping the characters of the incoding into
    1515the appropriate bit streams for XML parsing actions (i.e., producing the lexical item
    3535A third observation is that repeated transcoding of the names of XML
    3636elements, attributes and so on can be avoided by using a lookup mechanism.
    37 That is, the first occurrence of each symbol is store in a lookup
     37That is, the first occurrence of each symbol is stored in a lookup
    3838table mapping the input encoding to a numeric symbol ID.   Transcoding
    3939of the symbol is applied at this time.  Subsequent lookup operations
    5050case of UTF-8 to UTF-16 transcoding, the corresponding UTF-16 bit streams
    5151can be calculated in bit parallel fashion based on UTF-8 streams \cite{Cameron2008},
    52 and all but the final bytes of multibyte sequences can be marked for deletion.
     52and all but the final bytes of multibyte sequences can be marked for deletion as
     53discussed in the following subsection.
    5354In other cases, transcoding within a block only need be applied for non-ASCII
    5455bytes, which are conveniently identified by iterating through the bit 0 stream
  • docs/Working/icXML/parfilter.tex

    r2429 r2520  
    4 Several stages of standard XML processing include compressions
    5 that filter out data from certain input byte positions
    6 in various ways. These include the compression of multibyte
    7 sequences in UTF-8 to UTF-16 transcoding, the compression
    8 of CR-LF sequences to single LF characters in line break
    9 normalization and the compression of character and entity
    10 references of the form \verb:": or \verb:<: to
    11 the single `\verb:":' and `\verb:<:' characters, respectively.
    12 Within icXML, a further compression step reduces all
    13 markup data within the input document to a single tag
    14 byte preceding each significant XML transition as described
    15 in section \ref{sec:contentbuffer}.
     4As just mentioned, UTF-8 to UTF-16 transcoding involves marking
     5all but the last bytes of multibyte UTF-8 sequences as
     6positions for deletion.   For example, the two
     7Chinese characters \begin{CJK*}{UTF8}{gbsn}䜠奜\end{CJK*}
     8are represented as two three-byte UTF-8 sequences \verb'E4 BD A0'
     9and \verb'E5 A5 BD' while the UTF-16 representation must be
     10compressed down to the two code units \verb'4F60' and \verb'597D'.
     11In the bit parallel representation, this corresponds to a reduction
     12from six bit positions representing UTF-8 code units (bytes)
     13down to just two bit positions representing UTF-16 code units
     14(doublebytes).   This compression may be achieved by
     15arranging to calculate the correct UTF-16 bits at the
     16final position of each sequence and creating a deletion
     17mask to mark the first two bytes of each 3-byte sequence
     18for deletion.   In this case, the portion of the mask
     19corresponding to these input bytes is the bit sequence
     20\verb'110110'.  Using this approach, transcoding may then be
     21completed by applying parallel deletion and inverse transposition of the
     22UTF-16 bit streams\cite{Cameron2008}.
    17 \begin{figure}[htbp]
    18 \begin{tabular}{ll}
    19 source text & \begin{CJK*}{UTF8}{gbsn}䜠奜\end{CJK*}\&\#33; \\
    20 UTF8  & E4 BD A0 E5 A5 BD 26 23 33 33 3B 0D 0A\\
    21 UTF16 & 4F60 597D 0026 0023 0033 0033 003B 000D 000A\\
    22 CRLF to LF& 4F60 597D 0026 0023 0033 0033 003B 000A\\
    23 Ref Replace & 4F60 597D 0021 000A\\
    24 \end{tabular}
    25 \caption{}
    26 \label{fig:compression1}
     24Rather than immediately paying the
     25costs of deletion and transposition just for transcoding,
     26however, \icXML{} defers these steps so that the deletion
     27masks for several stages of processing may be combined.
     28In particular, this includes core XML requirements
     29to normalize line breaks and to replace character
     30reference and entity references by their corresponding
     31text.   In the case of line break normalization,
     32all forms of line breaks, including bare carriage
     33returns (CR), line feeds (LF) and CR-LF combinations
     34must be normalized to a single LF character in
     35each case.   In \icXML{}, this is achieved by
     36first marking CR positions, performing two
     37bit parallel operations to transform the marked
     38CRs into LFs, and then marking for deletion any
     39LF that is found immediately after the marked CR
     40as shown by the Pablo source code in Figure \ref{fig:LBnormalization}.
     43# XML 1.0 line-break normalization rules.
     44if lex.CR:
     45# Modify CR (#x0D) to LF (#x0A)
     46  u16lo.bit_5 ^= lex.CR
     47  u16lo.bit_6 ^= lex.CR
     48  u16lo.bit_7 ^= lex.CR
     49  CRLF = pablo.Advance(lex.CR) & lex.LF
     50  callouts.delmask |= CRLF
     51# Adjust LF streams for newline/column tracker
     52  lex.LF |= lex.CR
     53  lex.LF ^= CRLF
     55\caption{Line Break Normalization Logic}\label{fig:LBnormalization}
     57In essence, the deletion masks for transcoding and
     58for line break normalization each represent a bitwise
     59filter; these filters can be combined using bitwise-or
     60so that the parallel deletion algorithm need only be
     61applied once.
    29 \begin{figure}[htbp]
    30 \begin{tabular}{ll}
    31 source text & \begin{CJK*}{UTF8}{gbsn}䜠奜\end{CJK*}\&\#33; \\
    32 UTF8  & E4 BD A0 E5 A5 BD 26 23 33 33 3B 0D 0A\\
    33 delmask for UTF16 & 1101100000000\\
    34 delmask for CR& 0000000000010\\
    35 delmask for Ref & 0000000111100\\
    36 delmask & 1101100111110
    37 \end{tabular}
    38 \caption{}
    39 \label{fig:compression2}
    40 \end{figure}
     63A further application of combined filtering
     64is the processing of XML character and entity
     65references.   Consider, for example, the references \verb:&amp;: or \verb:&#x3C;:.
     66which must be replaced in XML processing with 
     67the single `\verb:&:' and `\verb:<:' characters, respectively.
     68The approach in \icXML{} is to mark all but the first character
     69positions of each reference for deletion, leaving a
     70single character position unmodified.  Thus, for the
     71references \verb:&amp;: or \verb:&#x3C;: the
     72masks \verb:01111: and \verb:011111: are formed and
     73combined into the overall deletion mask.   After the
     74deletion and inverse transposition operations are finally
     75applied, a postprocessing step inserts the proper character
     76at these positions.   One note about this process is
     77that it is speculative; references are assumed to generally
     78be replaced by a single UTF-16 code unit.   In the case,
     79that this is not true, it is addressed in postprocessing.
    42 In Xerces and other sequential XML processors, each of these compression
    43 steps is typically implemented as a byte-at-a-time string-buffer
    44 to string-buffer operation.  This involves substantial
    45 copying and shifting of data at each step. Figure \ref{fig:compression1}
    46 illustrates.   Line 1 shows the input code units in UTF-8,
    47 with two 3-byte sequences for the Chinese characters ni3 hao3
    48 represented in hexadecimal form.   In Line 2, the UTF-16
    49 code units are shown, with a reduction in the total number of
    50 code units by 4, although each code unit is  now a 16-bit
    51 quantity.  This compression is carried out by the UTF-8 to UTF-16
    52 transcoder in Xerces. Line 3 then shows the reduction of the
    53 two CR-LF sequences to simple LF values. 
    54 % This is the line break normalization that takes place within the Reader???
    55 Line 4 then shows the further compression by replacement of
    56 the character and general entity references by their corresponding
    57 character values.
    59 In icXML, separate buffer copying operations for each of these
    60 steps is avoided by taking advantage of a parallel deletion
    61 algorithm to do compression.   The basic concept is to mark
    62 each position to be deleted using a 1-bit and then implement
    63 deletion using parallel block-at-a-time operations, following
    64 for example, the parallel compress algorithm\cite{HackersDelight}.
    65 Figure \ref{fig:compression2} shows three deletion masks corresponding
    66 to the three compressions of Figure \ref{fig:compression1}.
    67 In the fourth line of this figure, the bitwise-or of these
    68 three masks is computed.   This mask is then used to perform
    69 the deletions in a single application of our parallel deletion
    70 algorithm.  By combining the three steps, we reduce the cost
    71 of three separate buffer transformations to that of a single deletion
    72 operation.  The performance of the parallel deletion algorithm
     81The final step of combined filtering occurs during
     82the process of reducing markup data to tag bytes
     83preceding each significant XML transition as described
     84in section \ref{sec:contentbuffer}.  Overall, \icXML{}
     85avoids separate buffer copying operations for each of the
     86these filtering steps, paying the cost of parallel
     87deletion and inverse transposition only once. 
     88Currently, \icXML{} employs the parallel-prefix compress algorithm
     89of Steele\cite{HackersDelight}  Performance
    7390is independent of the number of positions deleted.  As a
    74 further note, Intel's recent Haswell architecture has
    75 introduced a parallel extract operation\cite{} that can directly
    76 perform parallel deletion.
     91further note, future versions of \icXML{} are expected to
     92take advantage of the parallel extract operation\cite{}
     93that Intel is now providing in its Haswell architecture \cite{}.
Note: See TracChangeset for help on using the changeset viewer.