Changeset 224
 Timestamp:
 Dec 1, 2008, 8:50:35 AM (11 years ago)
 File:

 1 edited
Legend:
 Unmodified
 Added
 Removed

docs/ASPLOS09/asplos094cameron.tex
r223 r224 1 1 % 2 2 % 3 % Template for LaTeX Class/Style File 3 % asplos094cameron.tex 4 % Robert D. Cameron and Dan Lin 4 5 % 5 % Name: sigplanconftemplate.tex 6 % Purpose: A template for sigplanconf.cls, which is a LaTeX 2e class 7 % file for SIGPLAN conference proceedings. 8 % 9 % Author: Paul C. Anagnostopoulos 10 % Windfall Software 11 % 978 3712316 12 % paul@windfall.com 13 % 14 % Created: 15 February 2005 6 % Based on sigplanconftemplate.tex (20050215), by Paul C. Anagnostopoulos 15 7 % 16 8 % … … 33 25 \preprintfooter{short description of paper} % 'preprint' option specified. 34 26 35 \title{Architectural Support for S IMDText Processing with Parallel Bit27 \title{Architectural Support for SWAR Text Processing with Parallel Bit 36 28 Streams: The Inductive Doubling Principle} 37 29 %\subtitle{Subtitle Text, if any} … … 39 31 \authorinfo{Robert D. Cameron \and Dan Lin} 40 32 {School of Computing Science, Simon Fraser University} 41 {\{cameron, lindanl\}@ sfu.ca}33 {\{cameron, lindanl\}@cs.sfu.ca} 42 34 43 35 … … 45 37 46 38 \begin{abstract} 47 Parallel bit stream algorithms exploit the SIMD capabilities of commodity 39 Parallel bit stream algorithms exploit the SWAR (SIMD within a 40 register) capabilities of commodity 48 41 processors in highperformance text processing applications such as 49 42 UTF8 to UTF16 transcoding, XML parsing, string search and regular expression 50 43 matching. Direct architectural support for these algorithms in future SIMD 51 44 instruction sets could further increase performance as well as simplifying the 52 programming task. A set of simple S IMDinstruction set extensions are proposed45 programming task. A set of simple SWAR instruction set extensions are proposed 53 46 for this purpose based on the principle of systematic support for inductive 54 47 doubling as an algorithmic technique. These extensions are shown to … … 75 68 76 69 In the landscape of parallel computing research, finding ways to 77 exploit intrachip (multicore) and intraregister (S IMD) parallelism70 exploit intrachip (multicore) and intraregister (SWAR) parallelism 78 71 for text processing and other nonnumeric applications is particularly 79 72 challenging. Indeed, in documenting this landscape, a widely cited Berkeley … … 420 413 \begin{center} 421 414 \begin{verbatim} 422 int pop1(unsigned x) { 423 x = x  ((x >> 1) & 0x55555555); 424 x = (x & 0x33333333) + ((x >> 2) & 0x33333333); 425 x = (x + (x >> 4)) & 0x0F0F0F0F; 426 x = x + (x >> 8); 427 x = x + (x >> 16); 428 return x & 0x0000003F; 429 } 415 c = (x & 0x55555555) + ((x >> 1) & 0x55555555); 416 c = (c & 0x33333333) + ((c >> 2) & 0x33333333); 417 c = (c & 0x0F0F0F0F) + ((c >> 4) & 0x0F0F0F0F); 418 c = (c & 0x00FF00FF) + ((c >> 8) & 0x00FF00FF); 419 c = (c & 0x0000FFFF) + ((c >>16) & 0x0000FFFF); 430 420 \end{verbatim} 431 421 \end{center} … … 437 427 \begin{center} 438 428 \begin{verbatim} 439 int pop4x32(BitBlock x) {440 429 c = simd<2>::add<l,h>(x, x); 441 430 c = simd<4>::add<l,h>(c, c); … … 443 432 c = simd<16>::add<l,h>(c, c); 444 433 c = simd<32>::add<l,h>(c, c); 445 return c;446 }447 434 \end{verbatim} 448 435 \end{center} … … 451 438 \end{figure} 452 439 453 Counting the number of 1 bits in particular words or fields is known as 454 population count. It is important enough for such operations as 440 As an initial example to illustrate the principle of inductive doubling 441 in practice, consider the problem of {\em population count}: determining 442 the number of one bits within a particular bit field. It is important 443 enough for such operations as 455 444 calculating Hamming distance to be included as a builtin instruction 456 445 on some processors. For example, the SPU of the Cell Broadband Engine 457 has a SIMD population count instruction \verb:si_cntb: for simultaneously determining the 446 has a SIMD population count instruction \verb:si_cntb: for simultaneously 447 determining the 458 448 number of 1 bits within each byte of a 16byte register. 459 460 449 In text processing with parallel bit streams, population count has direct 461 450 application to keeping track of line numbers for error reporting, for example. 462 In the case of SSE or Altivec implementations with 128bit SIMD registers, 463 blocks of characters are processed 128 at a time. To keep track 464 of line numbers a bitblock of 128 bits identifying the positions of newline 465 characters is formed and loaded into a register. When processing for the 466 block is finished, the line count may be efficiently updated by adding the 467 population count of the register, rather than examining each position 468 sequentially and selectively incrementing. 469 470 Figure \ref{HDpop} shows an optimized divideandconquer 471 population count implementation requiring 21 instructions 472 for 32bit integers, as documented by Warren \cite{HackersDelight}. 473 It effectively implements an inductive doubling strategy with shifts 474 and masking. Adaptation of the code for SIMD computation of 475 4 simultaneous population counts of 32bit fields within 476 128 bit registers is straightforward. 477 478 However, Figure \ref{inductivepopcount} shows a SIMD implementation 479 using a mere 5 instructions based on the inductive doubling 480 instruction set architecture proposed herein. The \verb#simd<2>::add<l,h># operation 481 extracts the individual bits of each 2bit 482 field and adds them together to initialize the count register 483 \verb:c: with 64 2bit counts. The operation \verb#simd<4>::add<l,h># is 484 then applied to sum pairs of 2bit counts to produce 32 4bit sums. 485 In three more operations, the required population counts within 4 32bit fields 486 are computed. 487 488 This initial example illustrates the elegance and efficiency of 489 the direct support for inductive doubling algorithms with the proposed 490 instruction set architecture. Note that the details of masking and shifting 491 are completely subsumed in the architectural support. 451 Given a bit block identifying the positions of newline characters within 452 a block of characters being processed, the population count of the 453 bit block can be used to efficiently and conveniently be used to update 454 the line number upon completion of block processing. 455 456 Figure \ref{HDpop} presents a traditional divideandconquer 457 implementation for a 32bit integer {\tt x} adapted from 458 Warren \cite{HackersDelight}, while 459 Figure \ref{inductivepopcount} shows the corresponding SWAR 460 implementation for a vector of 32bit fields using the inductive doubling 461 instruction set architecture. Each implementation employs 462 five steps of inductive doubling to produce population counts 463 within 32 bit fields. The traditional implementation employs 464 explicit masking and shifting operations, while these 465 operations are implicit within the semantics of the inductive 466 doubling instructions used in Figure \ref{inductivepopcount}. 467 In each implementation, the first step determines the 468 the population counts within 2bit fields 469 by adding the high bit of each such field to the low bit 470 to produce a set of 2bit counts in {\tt c}. 471 In the second step, the counts within 4bit fields of {\tt c} are determined 472 by adding the counts of the corresponding high and low 2bit subfields. 473 Continuing in this fashion, 474 the final population counts within 32bit fields are determined in five steps. 475 476 With the substitution of longer mask constants replicated for four 477 32bit fields, the implementation of Figure \ref{HDpop} can be 478 easily adapted to SWAR processing using 128bit registers. 479 Such an implementation requires 10 operations to load or generate 480 mask constants, 10 bitwiseand operations, 5 shifts and 5 adds for a 481 total of 30 operations to complete the task in comparison 482 to a mere However, Warren further refines the 483 implementation to an optimized version requiring only 20 operations, 484 5 of which are required to generate mask constants. At the cost 485 of register pressure, it is possible that these constants 486 could be kept preloaded. In any case, the IDISA implementation 487 offers a 3X to 4X improvement in instruction count as well as 488 a reduction in register pressure. 489 490 492 491 493 492
Note: See TracChangeset
for help on using the changeset viewer.