Changeset 224 for docs

Dec 1, 2008, 8:50:35 AM (11 years ago)

Population count section; initiate SWAR and ISIDA terms

1 edited


  • docs/ASPLOS09/asplos094-cameron.tex

    r223 r224  
    3 %               Template for LaTeX Class/Style File
     3% asplos094-cameron.tex
     4% Robert D. Cameron and Dan Lin
    5 % Name:         sigplanconf-template.tex
    6 % Purpose:      A template for sigplanconf.cls, which is a LaTeX 2e class
    7 %               file for SIGPLAN conference proceedings.
    8 %
    9 % Author:       Paul C. Anagnostopoulos
    10 %               Windfall Software
    11 %               978 371-2316
    12 %     
    13 %
    14 % Created:      15 February 2005
     6% Based on sigplanconf-template.tex (2005-02-15), by Paul C. Anagnostopoulos
    3325\preprintfooter{short description of paper}   % 'preprint' option specified.
    35 \title{Architectural Support for SIMD Text Processing with Parallel Bit
     27\title{Architectural Support for SWAR Text Processing with Parallel Bit
    3628Streams: The Inductive Doubling Principle}
    3729%\subtitle{Subtitle Text, if any}
    3931\authorinfo{Robert D. Cameron \and Dan Lin}
    4032           {School of Computing Science, Simon Fraser University}
    41            {\{cameron, lindanl\}}
     33           {\{cameron, lindanl\}}
    47 Parallel bit stream algorithms exploit the SIMD capabilities of commodity
     39Parallel bit stream algorithms exploit the SWAR (SIMD within a
     40register) capabilities of commodity
    4841processors in high-performance text processing applications such as
    4942UTF-8 to UTF-16 transcoding, XML parsing, string search and regular expression
    5043matching.   Direct architectural support for these algorithms in future SIMD
    5144instruction sets could further increase performance as well as simplifying the
    52 programming task. A set of simple SIMD instruction set extensions are proposed
     45programming task. A set of simple SWAR instruction set extensions are proposed
    5346for this purpose based on the principle of systematic support for  inductive
    5447doubling as an algorithmic technique.  These extensions are shown to
    7669In the landscape of parallel computing research, finding ways to
    77 exploit intrachip (multicore) and intraregister (SIMD) parallelism
     70exploit intrachip (multicore) and intraregister (SWAR) parallelism
    7871for text processing and other non-numeric applications is particularly
    7972challenging.  Indeed, in documenting this landscape, a widely cited Berkeley
    422 int pop1(unsigned x) {
    423    x = x - ((x >> 1) & 0x55555555);
    424    x = (x & 0x33333333) + ((x >> 2) & 0x33333333);
    425    x = (x + (x >> 4)) & 0x0F0F0F0F;
    426    x = x + (x >> 8);
    427    x = x + (x >> 16);
    428    return x & 0x0000003F;
    429 }
     415   c = (x & 0x55555555) + ((x >> 1) & 0x55555555);
     416   c = (c & 0x33333333) + ((c >> 2) & 0x33333333);
     417   c = (c & 0x0F0F0F0F) + ((c >> 4) & 0x0F0F0F0F);
     418   c = (c & 0x00FF00FF) + ((c >> 8) & 0x00FF00FF);
     419   c = (c & 0x0000FFFF) + ((c >>16) & 0x0000FFFF);
    439 int pop4x32(BitBlock x) {
    440429   c = simd<2>::add<l,h>(x, x);
    441430   c = simd<4>::add<l,h>(c, c);
    443432   c = simd<16>::add<l,h>(c, c);
    444433   c = simd<32>::add<l,h>(c, c);
    445    return c;
    446 }
    453 Counting the number of 1 bits in particular words or fields is known as
    454 population count.  It is important enough for such operations as
     440As an initial example to illustrate the principle of inductive doubling
     441in practice, consider the problem of {\em population count}: determining
     442the number of one bits within a particular bit field.  It is important
     443enough for such operations as
    455444calculating Hamming distance to be included as a built-in instruction
    456445on some processors.  For example, the SPU of the Cell Broadband Engine
    457 has a SIMD population count instruction \verb:si_cntb: for simultaneously determining the
     446has a SIMD population count instruction \verb:si_cntb: for simultaneously
     447determining the
    458448number of 1 bits within each byte of a 16-byte register.
    460449In text processing with parallel bit streams, population count has direct
    461450application to keeping track of line numbers for error reporting, for example.
    462 In the case of SSE or Altivec implementations with 128-bit SIMD registers,
    463 blocks of characters are processed 128 at a time.  To keep track
    464 of line numbers a bitblock of 128 bits identifying the positions of newline
    465 characters is formed and loaded into a register.  When processing for the
    466 block is finished, the line count may be efficiently updated by adding the
    467 population count of the register, rather than examining each position
    468 sequentially and selectively incrementing.
    470 Figure \ref{HD-pop} shows an optimized divide-and-conquer
    471 population count implementation requiring 21 instructions
    472 for 32-bit integers, as documented by Warren \cite{HackersDelight}.
    473 It effectively implements an inductive doubling strategy with shifts
    474 and masking.  Adaptation of the code for SIMD computation of
    475 4 simultaneous population counts of 32-bit fields within
    476 128 bit registers is straightforward.
    478 However, Figure \ref{inductivepopcount} shows a SIMD implementation
    479 using a mere 5 instructions based on the inductive doubling
    480 instruction set architecture proposed herein.  The \verb#simd<2>::add<l,h># operation
    481 extracts the individual bits of each 2-bit
    482 field and adds them together to initialize the count register
    483 \verb:c: with 64 2-bit counts.  The operation \verb#simd<4>::add<l,h># is
    484 then applied to sum pairs of 2-bit counts to produce 32 4-bit sums.
    485 In three more operations, the required population counts within 4 32-bit fields
    486 are computed.
    488 This initial example illustrates the elegance and efficiency of
    489 the direct support for inductive doubling algorithms with the proposed
    490 instruction set architecture.   Note that the details of masking and shifting
    491 are completely subsumed in the architectural support.
     451Given a bit block identifying the positions of newline characters within
     452a block of characters being processed, the population count of the
     453bit block can be used to efficiently and conveniently be used to update
     454the line number upon completion of block processing.
     456Figure \ref{HD-pop} presents a traditional divide-and-conquer
     457implementation for a 32-bit integer {\tt x} adapted from
     458Warren \cite{HackersDelight}, while
     459Figure \ref{inductivepopcount} shows the corresponding SWAR
     460implementation for a vector of 32-bit fields using the inductive doubling
     461instruction set architecture.  Each implementation employs
     462five steps of inductive doubling to produce population counts
     463within 32 bit fields.  The traditional implementation employs
     464explicit masking and shifting operations, while these
     465operations are implicit within the semantics of the inductive
     466doubling instructions used in Figure \ref{inductivepopcount}.
     467In each implementation, the first step determines the
     468the population counts within 2-bit fields
     469by adding the high bit of each such field to the low bit
     470to produce a set of 2-bit counts in {\tt c}.
     471In the second step, the counts within 4-bit fields of {\tt c} are determined
     472by adding the counts of the corresponding high and low 2-bit subfields.
     473Continuing in this fashion,
     474the final population counts within 32-bit fields are determined in five steps.
     476With the substitution of longer mask constants replicated for four
     47732-bit fields, the implementation of Figure \ref{HD-pop} can be
     478easily adapted to SWAR processing using 128-bit registers.
     479Such an implementation requires 10 operations to load or generate
     480mask constants, 10 bitwise-and operations, 5 shifts and 5 adds for a
     481total of 30 operations to complete the task in comparison
     482to a mere   However, Warren further refines the
     483implementation to an optimized version requiring only 20 operations,
     4845 of which are required to generate mask constants.  At the cost
     485of register pressure, it is possible that these constants
     486could be kept preloaded.  In any case, the IDISA implementation
     487offers a 3X to 4X improvement in instruction count as well as
     488a reduction in register pressure.
Note: See TracChangeset for help on using the changeset viewer.