Feb 13, 2018, 5:05:51 PM (14 months ago)

Modified PDEP kernel

1 edited


  • icGREP/icgrep-devel/icgrep/kernels/pdep_kernel.h

    r5857 r5870  
    99#include <llvm/IR/Value.h>
    1010#include <string>
    11 namespace IDISA { class IDISA_Builder; }
    13 What this kernel does:
    15 Given a swizzled input stream set and a PDEP marker stream, apply a PDEP operation to each of the input streams in
    16 the input stream set. The PDEPed result streams are returned in a swizzled output stream set.
     14Conceptually, given an unbounded input stream set of k streams and a marker stream, this kernel uses the
     15Parallel Bits Deposit (PDEP) instruction to copy the input items from the i-th input stream to the i-th
     16output stream the positions indicated by the marker bits. All other output items are set to zero. E.g.,
    18 The length of the input stream set (in bits) must be greater than or equal to the total popcount of the PDEP marker
    19 stream, otherwise the PDEP operation will run out of source bits before the entire PDEP stream has been processed.
     18 SOURCE >  abcdefgh i0000000 00000000 00000000
     19 MARKER >  ...1.1.1 .....11. ..1...1. ...1.1..
     20 OUTPUT >  ...a.b.c .....de. ..f...g. ...h.i..
    21 How it works:
     22The complicating factor of this Kernel is that it assumes the input streams are *swizzled*. I.e., it
     23"divides" each block of the marker stream into k elements, M_1 ... M_k, and applies the PDEP operation
     24using M_i to the each of the k elements in the i-th input (swizzled) stream.
    23 You should know how the PDEP operation works before continuing (Wikipedia has a pretty good explanation.)
    25 The swizzled configuration of the input streams mean that the first blockWidth/mSwizzleFactor bits of each (unswizzled) input
    26 stream are contained in the first BitBlock of the first input StreamSetBlock. The second BitBlock contains the next
    27 blockWidth/mSwizzleFactor bits for each input stream, and so on. The key observation underpinning the action of the PDEP kernel is that we apply the PDEP operation
    28 using blockWidth/mSwizzleFactor bits of an input stream as the source bits. Since the first BitBlock (i.e. swizzle) contains blockWidth/mSwizzleFactor
    29 bits from each of the input streams, we can begin processing the input streams in the input stream set by applying the first blockWidth/mSwizzleFactor
    30 bits of the PDEP marker stream to each of the swizzle fields in the first BitBlock.
     28 STREAM 0  abcde...  fg......  hijklm..  nopqrst.     SWIZZLE 0  abcde...  uvwxy...  OPQRS...  89abc...
     29 STREAM 1  uvwxy...  zA......  BCDEFG..  HIJKLMN.     SWIZZLE 1  fg......  zA......  TU......  de......
     30 STREAM 2  OPQRS...  TU......  VWXYZ0..  1234567.     SWIZZLE 2  hijklm..  BCDEFG..  VWXYZ0..  fghijk..
     31 STREAM 3  89abc...  de......  fghijk..  lmnopqr.     SWIZZLE 3  nopqrst.  HIJKLMN.  1234567.  lmnopqr.
    32 We continue using the first blockWidth/mSwizzleFactor bits of each input stream until we have completely consumed them. This occurs
    33 when the combined popcount of the PDEP masks we've used up to this point > blockWidth/mSwizzleFactor. Once we've exhausted the first
    34 BitBlock (i.e. swizzle), we move on to the next one. This pattern continues until we've consumed
    35 the entire PDEP marker stream. Note that it's possible for the kernel to consume the entire PDEP marker
    36 stream without consuming the entirety of the first BitBlock in the first BitStreamBlock, if the PDEP marker stream has a low popcount
    37 (i.e. > blockWidth/mSwizzleFactor).
    39 There is actually a slight complication that was glossed over in the description above. Consider the following scenario: we've consumed
    40 half of a blockWidth/mSwizzleFactor segment, and we're now starting the PDEP loop again. However, this time the PDEP marker stream segment is
    41 0xffffffff. That is, the popcount is 64. That means we'll consume 64 bits from the source bit stream, but the current segment only contains 64/2 =
    42 32 bits. To get around this issue, we "look ahead" to the next segment, whether that next segment is the next BitBlock in the current StreamSetBlock
    43 or the first BitBlock in the next StreamSetBlock. Regardless of where we find the segment, we combine the current segment and the next segement in
    44 such a way that we're guarenteed to have 64 source bits to pass to the PDEP operation. The logic responsible for creating this "combined" value
    45 can be found immediately after the opening brace of the outer for loop in the definition of the generateDoBlockMethod function:
     34NOTE: this kernel does *NOT* unswizzle the output. This will eventually be the responsibility of the
     35pipeline to ensure it is done when needed.
    47 Value * current_blk_idx = kb->CreateSub(kb->CreateUDiv(updatedProcessedBits, blockWidth), base_block_idx);  // blk index == stream set block index
    49 // kb->CreateUDiv(updatedProcessedBits, blockWidth) gives us the absolute block idx of the current block.
    50 // However, getAdjustedStreamBlockPtr (used later) calculates an offseted block for us based on the processed item count
    51 // of sourceStreamSet. We want to get the index of the block we're currently processing relative to the
    52 // "base block" calculated by getAdjustedInputStreamPtr. That's why we subtract base_block_idx from
    53 // kb->CreateUDiv(updatedProcessedBits, blockWidth)
    55 Value * current_swizzle_idx = kb->CreateUDiv(kb->CreateURem(updatedProcessedBits, blockWidth), pdepWidth);
    57 // updatedProcessedBits % blockWidth is how many bits of the current block we've processed.
    58 // Divide that by pdepWidth to get which BitBlock/swizzle we're currently processing.
    60 Value * next_block_idx = kb->CreateSub(kb->CreateUDiv(kb->CreateAdd(pdepWidth, updatedProcessedBits), blockWidth), base_block_idx);
    61 Value * next_swizzle_idx = kb->CreateUDiv(kb->CreateURem(kb->CreateAdd(pdepWidth, updatedProcessedBits), blockWidth), pdepWidth);
    62 // Q: Why add pdepWidth (i.e. 64) and not 256?
    63 // A: Although it is true that each BitBlock/swizzle contains 256 bits, each swizzle only contains 64 bits from each of the streams
    64 // it is composed of. Each 64 bit field is consumed in parallel, at the same rate. Therefore, once we've consumed "64 bits"
    65 // we've actually consumed 64*4 bits, and it's time to move to the next one.
    6839namespace kernel {
    69 class PDEPkernel : public MultiBlockKernel {
     41class PDEPkernel final : public MultiBlockKernel {
    71     PDEPkernel(const std::unique_ptr<kernel::KernelBuilder> & kb, unsigned streamCount, unsigned swizzleFactor, unsigned PDEP_width = 64, std::string name = "PDEPdel");
     43    PDEPkernel(const std::unique_ptr<kernel::KernelBuilder> & b, const unsigned swizzleFactor = 4, std::string name = "PDEP");
    7244    bool isCachable() const override { return true; }
    7345    bool hasSignature() const override { return false; }
     47    void generateMultiBlockLogic(const std::unique_ptr<KernelBuilder> & b, llvm::Value * const numOfStrides) final;
    7549    const unsigned mSwizzleFactor;
    76     const unsigned mPDEPWidth;
    77     void generateMultiBlockLogic(const std::unique_ptr<KernelBuilder> & kb, llvm::Value * const numOfStrides) override;
    78     std::vector<llvm::Value *> get_PDEP_masks(const std::unique_ptr<KernelBuilder> & kb, llvm::Value * PDEP_ms_blk,
    79                                               const unsigned mask_width);
    80     std::vector<llvm::Value *> get_block_popcounts(const std::unique_ptr<KernelBuilder> & kb, llvm::Value * blk,
    81                                                    const unsigned field_width);
Note: See TracChangeset for help on using the changeset viewer.