Ignore:
Timestamp:
Feb 5, 2011, 12:01:48 PM (9 years ago)
Author:
cameron
Message:

Changes/shortening by Fred.

File:
1 edited

Legend:

Unmodified
Added
Removed
  • docs/EuroPar2011/europar-cameron.tex

    r888 r893  
     1
    12
    23%%%%%%%%%%%%%%%%%%%%%%% file typeinst.tex %%%%%%%%%%%%%%%%%%%%%%%%%
     
    5051%
    5152\author{Robert D. Cameron
    52 \thanks{}
     53%\thanks{}
    5354\and Ehsan Amiri \and Kenneth S. Herdy \and Dan Lin \and Thomas C. Shermer \and Fred P. Popowich}
    5455
     
    8586processors to yield a dramatic speed-up over
    8687traditional alternatives employing byte-at-a-time parsing.
    87 \keywords{SIMD text processing, parallel bitstream technology, XML, parsing}
     88%\keywords{SIMD text processing, parallel bitstream technology, XML, parsing}
     89\keywords{SIMD text processing, parallel bitstreams, XML, parsing}
    8890\end{abstract}
    8991
     
    113115Nevertheless, there have been some notable works such as that of
    114116Scarpazza in applying the multicore and SIMD capabilities of the
    115 Cell/BE processor to regular expression matching \cite{Scarpazza:2009}
     117Cell/BE processor to regular expression matching \cite{Scarpazza:2009}.
    116118Intel has also signalled the importance of accelerated string
    117119processing to its customers through the introduction of new string processing
     
    121123
    122124Our research has been exploring a promising alternative approach, however, based on
    123 the concept of {\em parallel bit streams} \cite{Cameron2009,PPoPP08,CameronHerdyLin2008}.   
     125the concept of {\em parallel bitstreams} \cite{Cameron2009,PPoPP08,CameronHerdyLin2008}.   
    124126In this approach, byte streams
    125 are first sliced into eight basis bit streams, one for each
     127are first sliced into eight basis bitstreams, one for each
    126128bit position within the byte.  Bit stream $i$ thus comprises
    127129the $i$th bit of each byte.   Using 128-bit SIMD registers, then,
    128 bitwise logic operations on these basis bit streams allows
     130bitwise logic operations on these basis bitstreams allows
    129131byte classification operations to be carried out in parallel
    130132for 128 bytes at a time.  For example, consider a character class
    131 bit stream \verb:[<]: using a 1 bit to mark the position of
     133bitstream \verb:[<]: using a 1 bit to mark the position of
    132134opening angle brackets in a byte stream.  This stream may
    133 computed as logical combination of the basis bit streams using
     135computed as logical combination of the basis bitstreams using
    134136only seven bitwise logical operations per 128 bytes.
    135137
    136138Based on this approach, our prior work has shown how parallel
    137 bit streams may be used to accelerate XML parsing by further
     139bitstreams may be used to accelerate XML parsing by further
    138140taking advantage of processor {\em bit scan} instructions, commonly
    139 found within commodity processors\cite{CameronHerdyLin2008}.
     141found within commodity processors \cite{CameronHerdyLin2008}.
    140142On current Intel or AMD processors, for example, these instructions
    141143allow one to determine the position of the first 1 bit in a group of 64
     
    144146parsing in statistics gathering \cite{CameronHerdyLin2008} as well
    145147as GML to SVG conversion \cite{Herdy2008}.
     148Other efforts to accelerate XML parsing include the use of custom
     149XML chips \cite{Leventhal2009}, FPGAs \cite{DaiNiZhu2010}, and
     150multithread/multicore speedups based on fast preparsing \cite{ZhangPanChiu09}.
    146151
    147152In this paper, we further increase the parallelism in our methods
     
    159164addition on 64-bit words allows all such instances to be processed at once.
    160165
    161 
    162 Other efforts to accelerate XML parsing include the use of custom
    163 XML chips \cite{Leventhal2009}, FPGAs \cite{DaiNiZhu2010}, and
    164 multithread/multicore speedups based on fast preparsing \cite{ZhangPanChiu09}.
    165 
    166 The remainder of this paper is organized as follows.
    167 Section 2 reviews the basics of parallel bitstream technology
    168 and introduces our new parallel scanning primitive.
    169 Section 3 illustrates how this primitive may be used
    170 in the lexical processing of XML references including the
    171 parallel identification of errors.   Section 4 goes on
    172 to consider the more complex task of XML tag processing
    173 that goes beyond mere tokenization.
    174 Building on these methods, Section 5 describes how to
    175 construct a
    176 complete solution to the problem of XML parsing and
    177 well-formedness checking, in order
    178 to gauge the applicability and power of the techniques.
    179 Section \ref{sec:compile} then considers
    180 the translation of high-level operations on unbounded bitstreams into
    181 equivalent low-level code using SIMD intrinsics in the C programming
    182 language.
    183 Performance results are presented in section 7, comparing
    184 the results of our generated implementations in comparison with
    185 a number of existing alternatives.
    186 The paper concludes with
    187 comments on the current status of the work and directions for
    188 further research.
     166% The remainder of this paper is organized as follows.
     167% Section 2 reviews the basics of parallel bitstream technology
     168% and introduces our new parallel scanning primitive.
     169% Section 3 illustrates how this primitive may be used
     170% in the lexical processing of XML references including the
     171% parallel identification of errors.   Section 4 goes on
     172% to consider the more complex task of XML tag processing
     173% that goes beyond mere tokenization.
     174% Building on these methods, Section 5 describes how to
     175% construct a
     176% complete solution to the problem of XML parsing and
     177% well-formedness checking, in order
     178% to gauge the applicability and power of the techniques.
     179% Section \ref{sec:compile} then considers
     180% the translation of high-level operations on unbounded bitstreams into
     181% equivalent low-level code using SIMD intrinsics in the C programming
     182% language.
     183% Performance results are presented in section 7, comparing
     184% the results of our generated implementations in comparison with
     185% a number of existing alternatives.
     186% The paper concludes with
     187% comments on the current status of the work and directions for
     188% further research.
    189189
    190190\section{The Parallel Bitstream Method}\label{sec:parabit}
     
    219219of each character in the source data stream;
    220220thus each $B_k$ is dependent on the encoding of the source characters (ASCII, UTF-8, UTF-16, etc.).
    221 Given these basis bit streams, it is then possible to combine them
     221Given these basis bitstreams, it is then possible to combine them
    222222using bitwise logic in order to compute character-class
    223 bit streams, that is, streams that identify the positions at which characters belonging
     223bitstreams, that is, streams that identify the positions at which characters belonging
    224224to a particular class occur.  For example, the character class bitstream
    225225$D=$\verb:[0-9]: marks with $1$s the positions at which decimal digits
     
    235235
    236236
    237 Transposition of source data to basis bit streams and calculation
     237Transposition of source data to basis bitstreams and calculation
    238238of character-class streams in this way is an overhead on parallel bit
    239239stream applications, in general.   However, using the SIMD
     
    319319off any bits from the digit bitstream; these can never
    320320be marker positions resulting from a scan.
     321The addition and masking technique allows matching of
     322the regular expression \verb:[0-9]*: for any reasonable
     323(conflict-free) set of initial markers specified in $M_0$.
    321324
    322325\begin{figure}[tbh]
     
    335338\end{figure}
    336339
    337 The addition and masking technique allows matching of
    338 the regular expression \verb:[0-9]*: for any reasonable
    339 (conflict-free) set of initial markers specified in $M_0$.
    340 A conflict occurs when a span from one marker would run
    341 into another marker position.   However, such conflicts
    342 do not occur with the normal methods of marker bitstream
    343 formation, in which unique syntactic features of
    344 the input stream are used to specify the initial marker
    345 positions.
     340% The addition and masking technique allows matching of
     341% the regular expression \verb:[0-9]*: for any reasonable
     342% (conflict-free) set of initial markers specified in $M_0$.
     343% A conflict occurs when a span from one marker would run
     344% into another marker position.   However, such conflicts
     345% do not occur with the normal methods of marker bitstream
     346% formation, in which unique syntactic features of
     347% the input stream are used to specify the initial marker
     348% positions.
    346349
    347350In the remainder of this paper, the notation $s(M, C)$
     
    583586In this section, we consider the full application of the parsing techniques
    584587of the previous section to the problem of XML well-formedness checking \cite{TR:XML}.
    585 This application is useful as a well-defined and commercially significant
    586 example to assess the overall applicability of parallel bit stream techniques. 
    587 To what extent can the well-formedness requirements of XML be
    588 completely discharged using parallel bitstream techniques?
    589 Are those techniques worthwhile in every instance, or
    590 do better alternatives exist for certain requirements?
    591 For those requirements that cannot be fully implemented
    592 using parallel bitstream technology alone, what
    593 preprocessing support can be offered by parallel bit stream
    594 technology to the discharge of these requirements in other ways?
    595 We address each of these questions in this section,
    596 and look not only at the question of well-formedness, but also at
     588% This application is useful as a well-defined and commercially significant
     589% example to assess the overall applicability of parallel bitstream techniques.
     590% To what extent can the well-formedness requirements of XML be
     591% completely discharged using parallel bitstream techniques?
     592% Are those techniques worthwhile in every instance, or
     593% do better alternatives exist for certain requirements?
     594% For those requirements that cannot be fully implemented
     595% using parallel bitstream technology alone, what
     596% preprocessing support can be offered by parallel bitstream
     597% technology to the discharge of these requirements in other ways?
     598% We address each of these questions in this section,
     599% and look not only at the question of well-formedness, but also at
     600We look not only at the question of well-formedness, but also at
    597601the identification of error positions in documents that
    598602are not well-formed.
     
    609613decimal character references in Figure \ref{fig:decref}
    610614are error bitstreams.  One bits mark definite errors and zero bits mark the
    611 absence of error according to the requirement.   
     615absence of an error.   
     616% absence of error according to the requirement.   
    612617Thus the complete absence of errors according to the
    613618requirements listed may be determined by forming the
     
    624629
    625630In typical documents, most of these error-check streams will be quite sparse
    626 or even zero.   Many of the error conditions could
     631% or even zero.   Many of the error conditions could
     632or even zero.   Many error conditions could
    627633actually be fully implemented using bitstream techniques,
    628634but at the cost of a number of additional logical and shift
    629 operations.   In general, however, the conditions are
     635operations.   In general, the conditions are
    630636easier and more efficient to check one-at-a-time using
    631637multibyte comparisons on the original source data stream.
     
    660666\cite{DaiNiZhu2010}.
    661667
    662 In general, the use of error-check bitstreams is a straightforward,
    663 convenient and reasonably efficient mechanism for
    664 checking the well-formedness requirements.
     668% In general, the use of error-check bitstreams is a straightforward,
     669% convenient and reasonably efficient mechanism for
     670% checking the well-formedness requirements.
    665671
    666672%\subsection{Tag Matching}
     
    706712C implementation needs to process an input stream in blocks of size equal to the
    707713SIMD register width of the processor it runs on.
    708 So, to convert Python code into C, the key question becomes how
    709 to transfer information from one block to the next one.
     714To convert Python code into C, the key question becomes how
     715to transfer information from one block to the next.
    710716The answer lies in the use of {\em carry bits}, the collection of carries resulting from bitwise additions.
    711717 
    712718In fact, in the methods we have outlined, all the
    713 the information flow between blocks for parallel bit stream
     719the information flow between blocks for parallel bitstream
    714720calculations can be modeled using carry bits.   The parallel
    715721scanning primitive uses only addition and bitwise logic.
     
    724730case equivalent to the carry generated by the additon.
    725731The only other information flow requirement in the
    726 calculation of parallel bit streams occurs with the
     732calculation of parallel bitstreams occurs with the
    727733bitstream subtractions that are used to calculate span streams.
    728734In this case, the information flow is based on borrows
     
    730736
    731737Properly determining, initializing and inserting carry bits
    732 into a block-by-block implementation of parallel bit stream
     738into a block-by-block implementation of parallel bitstream
    733739code is a task too tedious for manual implementation.
    734740We have thus developed compiler technology to automatically
    735741insert declarations, initializations and carry save/restore
    736742operations into appropriate locations when translating
    737 Python operations on unbounded bit streams into the
     743Python operations on unbounded bitstreams into the
    738744equivalent low-level C code implemented on a block-by-block
    739745bases.  Our current compiler toolkit is capable of inserting
     
    802808
    803809The remaining rows of Table \ref{parsers-cpb} show performance
    804 of parallel bit stream implementations.  The first row shows
     810of parallel bitstream implementations.  The first row shows
    805811the performance of our Parabix 1 implementation using
    806812bit scan instructions.   While showing a substantial speed-up
Note: See TracChangeset for help on using the changeset viewer.