Changeset 893 for docs/EuroPar2011/europar-cameron.tex

Ignore:
Timestamp:
Feb 5, 2011, 12:01:48 PM (9 years ago)
Message:

Changes/shortening by Fred.

File:
1 edited

Legend:

Unmodified
 r888 %%%%%%%%%%%%%%%%%%%%%%% file typeinst.tex %%%%%%%%%%%%%%%%%%%%%%%%% % \author{Robert D. Cameron \thanks{} %\thanks{} \and Ehsan Amiri \and Kenneth S. Herdy \and Dan Lin \and Thomas C. Shermer \and Fred P. Popowich} processors to yield a dramatic speed-up over traditional alternatives employing byte-at-a-time parsing. \keywords{SIMD text processing, parallel bitstream technology, XML, parsing} %\keywords{SIMD text processing, parallel bitstream technology, XML, parsing} \keywords{SIMD text processing, parallel bitstreams, XML, parsing} \end{abstract} Nevertheless, there have been some notable works such as that of Scarpazza in applying the multicore and SIMD capabilities of the Cell/BE processor to regular expression matching \cite{Scarpazza:2009} Cell/BE processor to regular expression matching \cite{Scarpazza:2009}. Intel has also signalled the importance of accelerated string processing to its customers through the introduction of new string processing Our research has been exploring a promising alternative approach, however, based on the concept of {\em parallel bit streams} \cite{Cameron2009,PPoPP08,CameronHerdyLin2008}. the concept of {\em parallel bitstreams} \cite{Cameron2009,PPoPP08,CameronHerdyLin2008}. In this approach, byte streams are first sliced into eight basis bit streams, one for each are first sliced into eight basis bitstreams, one for each bit position within the byte.  Bit stream $i$ thus comprises the $i$th bit of each byte.   Using 128-bit SIMD registers, then, bitwise logic operations on these basis bit streams allows bitwise logic operations on these basis bitstreams allows byte classification operations to be carried out in parallel for 128 bytes at a time.  For example, consider a character class bit stream \verb:[<]: using a 1 bit to mark the position of bitstream \verb:[<]: using a 1 bit to mark the position of opening angle brackets in a byte stream.  This stream may computed as logical combination of the basis bit streams using computed as logical combination of the basis bitstreams using only seven bitwise logical operations per 128 bytes. Based on this approach, our prior work has shown how parallel bit streams may be used to accelerate XML parsing by further bitstreams may be used to accelerate XML parsing by further taking advantage of processor {\em bit scan} instructions, commonly found within commodity processors\cite{CameronHerdyLin2008}. found within commodity processors \cite{CameronHerdyLin2008}. On current Intel or AMD processors, for example, these instructions allow one to determine the position of the first 1 bit in a group of 64 parsing in statistics gathering \cite{CameronHerdyLin2008} as well as GML to SVG conversion \cite{Herdy2008}. Other efforts to accelerate XML parsing include the use of custom XML chips \cite{Leventhal2009}, FPGAs \cite{DaiNiZhu2010}, and multithread/multicore speedups based on fast preparsing \cite{ZhangPanChiu09}. In this paper, we further increase the parallelism in our methods addition on 64-bit words allows all such instances to be processed at once. Other efforts to accelerate XML parsing include the use of custom XML chips \cite{Leventhal2009}, FPGAs \cite{DaiNiZhu2010}, and multithread/multicore speedups based on fast preparsing \cite{ZhangPanChiu09}. The remainder of this paper is organized as follows. Section 2 reviews the basics of parallel bitstream technology and introduces our new parallel scanning primitive. Section 3 illustrates how this primitive may be used in the lexical processing of XML references including the parallel identification of errors.   Section 4 goes on to consider the more complex task of XML tag processing that goes beyond mere tokenization. Building on these methods, Section 5 describes how to construct a complete solution to the problem of XML parsing and well-formedness checking, in order to gauge the applicability and power of the techniques. Section \ref{sec:compile} then considers the translation of high-level operations on unbounded bitstreams into equivalent low-level code using SIMD intrinsics in the C programming language. Performance results are presented in section 7, comparing the results of our generated implementations in comparison with a number of existing alternatives. The paper concludes with comments on the current status of the work and directions for further research. % The remainder of this paper is organized as follows. % Section 2 reviews the basics of parallel bitstream technology % and introduces our new parallel scanning primitive. % Section 3 illustrates how this primitive may be used % in the lexical processing of XML references including the % parallel identification of errors.   Section 4 goes on % to consider the more complex task of XML tag processing % that goes beyond mere tokenization. % Building on these methods, Section 5 describes how to % construct a % complete solution to the problem of XML parsing and % well-formedness checking, in order % to gauge the applicability and power of the techniques. % Section \ref{sec:compile} then considers % the translation of high-level operations on unbounded bitstreams into % equivalent low-level code using SIMD intrinsics in the C programming % language. % Performance results are presented in section 7, comparing % the results of our generated implementations in comparison with % a number of existing alternatives. % The paper concludes with % comments on the current status of the work and directions for % further research. \section{The Parallel Bitstream Method}\label{sec:parabit} of each character in the source data stream; thus each $B_k$ is dependent on the encoding of the source characters (ASCII, UTF-8, UTF-16, etc.). Given these basis bit streams, it is then possible to combine them Given these basis bitstreams, it is then possible to combine them using bitwise logic in order to compute character-class bit streams, that is, streams that identify the positions at which characters belonging bitstreams, that is, streams that identify the positions at which characters belonging to a particular class occur.  For example, the character class bitstream $D=$\verb:[0-9]: marks with $1$s the positions at which decimal digits Transposition of source data to basis bit streams and calculation Transposition of source data to basis bitstreams and calculation of character-class streams in this way is an overhead on parallel bit stream applications, in general.   However, using the SIMD off any bits from the digit bitstream; these can never be marker positions resulting from a scan. The addition and masking technique allows matching of the regular expression \verb:[0-9]*: for any reasonable (conflict-free) set of initial markers specified in $M_0$. \begin{figure}[tbh] \end{figure} The addition and masking technique allows matching of the regular expression \verb:[0-9]*: for any reasonable (conflict-free) set of initial markers specified in $M_0$. A conflict occurs when a span from one marker would run into another marker position.   However, such conflicts do not occur with the normal methods of marker bitstream formation, in which unique syntactic features of the input stream are used to specify the initial marker positions. % The addition and masking technique allows matching of % the regular expression \verb:[0-9]*: for any reasonable % (conflict-free) set of initial markers specified in $M_0$. % A conflict occurs when a span from one marker would run % into another marker position.   However, such conflicts % do not occur with the normal methods of marker bitstream % formation, in which unique syntactic features of % the input stream are used to specify the initial marker % positions. In the remainder of this paper, the notation $s(M, C)$ In this section, we consider the full application of the parsing techniques of the previous section to the problem of XML well-formedness checking \cite{TR:XML}. This application is useful as a well-defined and commercially significant example to assess the overall applicability of parallel bit stream techniques. To what extent can the well-formedness requirements of XML be completely discharged using parallel bitstream techniques? Are those techniques worthwhile in every instance, or do better alternatives exist for certain requirements? For those requirements that cannot be fully implemented using parallel bitstream technology alone, what preprocessing support can be offered by parallel bit stream technology to the discharge of these requirements in other ways? We address each of these questions in this section, and look not only at the question of well-formedness, but also at % This application is useful as a well-defined and commercially significant % example to assess the overall applicability of parallel bitstream techniques. % To what extent can the well-formedness requirements of XML be % completely discharged using parallel bitstream techniques? % Are those techniques worthwhile in every instance, or % do better alternatives exist for certain requirements? % For those requirements that cannot be fully implemented % using parallel bitstream technology alone, what % preprocessing support can be offered by parallel bitstream % technology to the discharge of these requirements in other ways? % We address each of these questions in this section, % and look not only at the question of well-formedness, but also at We look not only at the question of well-formedness, but also at the identification of error positions in documents that are not well-formed. decimal character references in Figure \ref{fig:decref} are error bitstreams.  One bits mark definite errors and zero bits mark the absence of error according to the requirement. absence of an error. % absence of error according to the requirement. Thus the complete absence of errors according to the requirements listed may be determined by forming the In typical documents, most of these error-check streams will be quite sparse or even zero.   Many of the error conditions could % or even zero.   Many of the error conditions could or even zero.   Many error conditions could actually be fully implemented using bitstream techniques, but at the cost of a number of additional logical and shift operations.   In general, however, the conditions are operations.   In general, the conditions are easier and more efficient to check one-at-a-time using multibyte comparisons on the original source data stream. \cite{DaiNiZhu2010}. In general, the use of error-check bitstreams is a straightforward, convenient and reasonably efficient mechanism for checking the well-formedness requirements. % In general, the use of error-check bitstreams is a straightforward, % convenient and reasonably efficient mechanism for % checking the well-formedness requirements. %\subsection{Tag Matching} C implementation needs to process an input stream in blocks of size equal to the SIMD register width of the processor it runs on. So, to convert Python code into C, the key question becomes how to transfer information from one block to the next one. To convert Python code into C, the key question becomes how to transfer information from one block to the next. The answer lies in the use of {\em carry bits}, the collection of carries resulting from bitwise additions. In fact, in the methods we have outlined, all the the information flow between blocks for parallel bit stream the information flow between blocks for parallel bitstream calculations can be modeled using carry bits.   The parallel scanning primitive uses only addition and bitwise logic. case equivalent to the carry generated by the additon. The only other information flow requirement in the calculation of parallel bit streams occurs with the calculation of parallel bitstreams occurs with the bitstream subtractions that are used to calculate span streams. In this case, the information flow is based on borrows Properly determining, initializing and inserting carry bits into a block-by-block implementation of parallel bit stream into a block-by-block implementation of parallel bitstream code is a task too tedious for manual implementation. We have thus developed compiler technology to automatically insert declarations, initializations and carry save/restore operations into appropriate locations when translating Python operations on unbounded bit streams into the Python operations on unbounded bitstreams into the equivalent low-level C code implemented on a block-by-block bases.  Our current compiler toolkit is capable of inserting The remaining rows of Table \ref{parsers-cpb} show performance of parallel bit stream implementations.  The first row shows of parallel bitstream implementations.  The first row shows the performance of our Parabix 1 implementation using bit scan instructions.   While showing a substantial speed-up