 Timestamp:
 Sep 14, 2013, 7:50:36 AM (6 years ago)
 Location:
 docs/Working/re
 Files:

 1 added
 1 edited
Legend:
 Unmodified
 Added
 Removed

docs/Working/re/ppoppre.tex
r3473 r3475 163 163 This approach relies on a bitwise data parallel view of text 164 164 streams as well as a surprising use of addition to match 165 runs of characters in a sin gle step. The closest previous165 runs of characters in a sin`gle step. The closest previous 166 166 work is that underlying bitparallel XML parsing using 128bit SSE2 SIMD 167 167 technology together with a parallel scanning primitive also … … 317 317 \section{BlockataTime Processing}\label{sec:blockwise} 318 318 319 319 The unbounded stream model of the previous section must of course 320 be translated an implementation that proceeds blockatatime for 321 realistic application. In this, we primarily rely on the Pablo 322 compiler of the Parabix toolchain \cite{lin2012parabix}. Given input 323 statements expressed as arbitrarylength bitstream equations, Pablo 324 produces blockatatime C++ code that initializes and maintains all the necessary 325 carry bits for each of the additions and shifts involved in the 326 bitstream calculations. 327 328 In the present work, our principal contribution to the blockatatime 329 model is the technique of longstream addition described below. 330 Otherwise, we were able to use Pablo directly in compiling our 331 SSE2 and AVX2 implementations. Our GPU implementation required 332 some scripting to modify the output of the Pablo compiler for our 333 purpose. 320 334 321 335 \paragraph*{LongStream Addition.} The maximum word size for … … 336 350 \item \verb#simd<64>::eq(X, 1)# : comparison of the 64bit fields 337 351 of \verb:x: each with the constant value 1 (all bits 1), producing 338 an $f$bit mask value 352 an $f$bit mask value, 339 353 \item \verb#hsimd<64>::mask(X)# : gathering the high bit of each 64bit 340 354 field into a single compressed $f$bit mask value, and … … 360 374 361 375 \item Compute an $f$bit mask of all fields of {\tt R} that will overflow with 362 an incoming carry bit. 376 an incoming carry bit. This is the {\em bubble mask}. 363 377 \[\text{\tt b} = \text{\tt simd<64>::eq(R, 1)}\] 364 378 365 379 \item Determine an $f$bit mask identifying the fields of {\tt R} that need to be 366 incremented to produce the final sum. This is just MatchStar. 380 incremented to produce the final sum. Here we find a new application of 381 MatchStar! 367 382 \[\text{\tt i} = \text{\tt MatchStar(c*2, b)}\] 368 383 369 \item Computed the final result {\tt Z}. 384 This is the key step. The mask {\tt c} of outgoing carries must be 385 shifted one position ({\tt c*2}) so that each outgoing carry bit becomes associated 386 with the next digit. At the incoming position, the carry will 387 increment the 64bit digit. However, if this digit is all ones (as 388 signalled by the corresponding bit of bubble mask {\tt b}, then the addition 389 will generate another carry. In fact, if there is a sequence of 390 digits that are all ones, then the carry must bubble through 391 each of them. This is just MatchStar! 392 393 \item Compute the final result {\tt Z}. 370 394 \[\text{\tt Z} = \text{\tt simd<64>::add(R, simd<64>::spread(i))}\] 371 395 372 396 \end{enumerate} 373 374 Step 5 is the key step to determine which fieldvalues of {\tt R} 375 must be incremented due to incoming carries. Note that {\tt c} 376 is the bit mask of outgoing carries, so the multiplication by 2 377 is necessary to move each carry to its incoming position. 378 At the incoming position, the carry will require an increment 379 to the corresponding field in {\tt R}, as well as all further 380 fields that are reachable by carry propagation through a 381 consecutive run of ones. 382 383 Figure \ref{fig:longadd} shows the operation of longstream addition 384 through an example. 385 386 387 388 389 397 \begin{figure} 398 \begin{center} 399 \begin{tabular}{crrrrrrrr}\cline{29} 400 {\tt X} & {\tt 19} & {\tt 31} & {\tt BA} & {\tt 4C} & {\tt 3D} & {\tt 45} & {\tt 21} & {\tt F1} \\ \cline{29} 401 {\tt Y} & {\tt 22} & {\tt 12} & {\tt 45} & {\tt B3} & {\tt E2} & {\tt 16} & {\tt 17} & {\tt 36} \\ \cline{29} 402 {\tt R} & {\tt 3B} & {\tt 43} & {\tt FF} & {\tt FF} & {\tt 1F} & {\tt 5B} & {\tt 38} & {\tt 27} \\ \cline{29} 403 {\tt x} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 1} \\ \cline{29} 404 {\tt y} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 1} & {\tt 0} & {\tt 0} & {\tt 0} \\ \cline{29} 405 {\tt r} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 1} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 0} \\ \cline{29} 406 {\tt c} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 0} & {\tt 0} & {\tt 1} \\ \cline{29} 407 {\tt c*2} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 0} \\ \cline{29} 408 {\tt b} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 1} & {\tt 0} & {\tt 0} & {\tt 0} & {\tt 0} \\ \cline{29} 409 {\tt i} & {\tt 0} & {\tt 1} & {\tt 1} & {\tt 1} & {\tt 0} & {\tt 0} & {\tt 1} & {\tt 0} \\ \cline{29} 410 {\tt Z} & {\tt 3B} & {\tt 44} & {\tt 0} & {\tt 0} & {\tt 1F} & {\tt 5B} & {\tt 39} & {\tt 27} \\ \cline{29} 411 \end{tabular} 412 \end{center} 413 \caption{Long Stream Addition}\label{fig:longadd} 414 \end{figure} 415 416 Figure \ref{fig:longadd} illustrates the process. In the figure, 417 we illustrate the process with 8bit fields rather than 64bit fields 418 and show all field values in hexadecimal notation. Note that 419 two of the individual 8bit additions produce carries, while two 420 others produce {\tt FF} values that generate bubble bits. The 421 net result is that four of the original 8bit sums must be 422 incremented to produce the long stream result. 423 424 A slight extension to the process produces a longstream full adder 425 that can be used in chained addition. In this case, the 426 adder must take an additional carryin bit 427 {\tt p} and produce a carryout bit {\tt q}. 428 This may be accomplished by incorporating {\tt p} 429 in calculating the increment mask in the low bit position, 430 and then extracting the carryout {\tt q} from the high bit position. 431 \[\text{\tt i} = \text{\tt MatchStar(c*2+p, b)}\] 432 \[\text{\tt q} = \text{\tt i >> f}\] 433 \raggedbottom 390 434 \section{Analytical Comparison with DFA and NFA Implementations}\label{sec:analysis} 391 435 … … 411 455 \begin{figure*}[tbh] 412 456 \begin{center} 413 \begin{ code}414 static IDISA_ALWAYS_INLINEvoid add_ci_co(bitblock256_t x, bitblock256_t y, carry_t carry_in, carry_t & carry_out, bitblock256_t & sum) {457 \begin{verbatim} 458 void add_ci_co(bitblock256_t x, bitblock256_t y, carry_t carry_in, carry_t & carry_out, bitblock256_t & sum) { 415 459 bitblock256_t all_ones = simd256<1>::constant<1>(); 416 460 bitblock256_t gen = simd_and(x, y); … … 428 472 } 429 473 430 \end{ code}474 \end{verbatim} 431 475 432 476 \end{center}
Note: See TracChangeset
for help on using the changeset viewer.