source: icGREP/icgrep-devel/icgrep/kernels/pdep_kernel.h @ 5697

Last change on this file since 5697 was 5627, checked in by cameron, 2 years ago

PDEP kernels from Adam with pdep_width_less_1 fix

File size: 5.6 KB
2 *  Copyright (c) 2017 International Characters.
3 *  This software is licensed to the public under the Open Software License 3.0.
4 */
5#ifndef PDEP_KERNEL_H
6#define PDEP_KERNEL_H
8#include "kernel.h"
9#include <llvm/IR/Value.h>
10namespace IDISA { class IDISA_Builder; }
12What this kernel does:
14Given a swizzled input stream set and a PDEP marker stream, apply a PDEP operation to each of the input streams in
15the input stream set. The PDEPed result streams are returned in a swizzled output stream set.
17The length of the input stream set (in bits) must be greater than or equal to the total popcount of the PDEP marker
18stream, otherwise the PDEP operation will run out of source bits before the entire PDEP stream has been processed.
20How it works:
22You should know how the PDEP operation works before continuing (Wikipedia has a pretty good explanation.)
24The swizzled configuration of the input streams mean that the first blockWidth/mSwizzleFactor bits of each (unswizzled) input
25stream are contained in the first BitBlock of the first input StreamSetBlock. The second BitBlock contains the next
26blockWidth/mSwizzleFactor bits for each input stream, and so on. The key observation underpinning the action of the PDEP kernel is that we apply the PDEP operation
27using blockWidth/mSwizzleFactor bits of an input stream as the source bits. Since the first BitBlock (i.e. swizzle) contains blockWidth/mSwizzleFactor
28bits from each of the input streams, we can begin processing the input streams in the input stream set by applying the first blockWidth/mSwizzleFactor
29bits of the PDEP marker stream to each of the swizzle fields in the first BitBlock.
31We continue using the first blockWidth/mSwizzleFactor bits of each input stream until we have completely consumed them. This occurs
32when the combined popcount of the PDEP masks we've used up to this point > blockWidth/mSwizzleFactor. Once we've exhausted the first
33BitBlock (i.e. swizzle), we move on to the next one. This pattern continues until we've consumed
34the entire PDEP marker stream. Note that it's possible for the kernel to consume the entire PDEP marker
35stream without consuming the entirety of the first BitBlock in the first BitStreamBlock, if the PDEP marker stream has a low popcount
36(i.e. > blockWidth/mSwizzleFactor).
38There is actually a slight complication that was glossed over in the description above. Consider the following scenario: we've consumed
39half of a blockWidth/mSwizzleFactor segment, and we're now starting the PDEP loop again. However, this time the PDEP marker stream segment is
400xffffffff. That is, the popcount is 64. That means we'll consume 64 bits from the source bit stream, but the current segment only contains 64/2 =
4132 bits. To get around this issue, we "look ahead" to the next segment, whether that next segment is the next BitBlock in the current StreamSetBlock
42or the first BitBlock in the next StreamSetBlock. Regardless of where we find the segment, we combine the current segment and the next segement in
43such a way that we're guarenteed to have 64 source bits to pass to the PDEP operation. The logic responsible for creating this "combined" value
44can be found immediately after the opening brace of the outer for loop in the definition of the generateDoBlockMethod function:
46Value * current_blk_idx = kb->CreateSub(kb->CreateUDiv(updatedProcessedBits, blockWidth), base_block_idx);  // blk index == stream set block index
48// kb->CreateUDiv(updatedProcessedBits, blockWidth) gives us the absolute block idx of the current block.
49// However, getAdjustedStreamBlockPtr (used later) calculates an offseted block for us based on the processed item count
50// of sourceStreamSet. We want to get the index of the block we're currently processing relative to the
51// "base block" calculated by getAdjustedInputStreamPtr. That's why we subtract base_block_idx from
52// kb->CreateUDiv(updatedProcessedBits, blockWidth)
54Value * current_swizzle_idx = kb->CreateUDiv(kb->CreateURem(updatedProcessedBits, blockWidth), pdepWidth);
56// updatedProcessedBits % blockWidth is how many bits of the current block we've processed.
57// Divide that by pdepWidth to get which BitBlock/swizzle we're currently processing.
59Value * next_block_idx = kb->CreateSub(kb->CreateUDiv(kb->CreateAdd(pdepWidth, updatedProcessedBits), blockWidth), base_block_idx);
60Value * next_swizzle_idx = kb->CreateUDiv(kb->CreateURem(kb->CreateAdd(pdepWidth, updatedProcessedBits), blockWidth), pdepWidth);
61// Q: Why add pdepWidth (i.e. 64) and not 256?
62// A: Although it is true that each BitBlock/swizzle contains 256 bits, each swizzle only contains 64 bits from each of the streams
63// it is composed of. Each 64 bit field is consumed in parallel, at the same rate. Therefore, once we've consumed "64 bits"
64// we've actually consumed 64*4 bits, and it's time to move to the next one.
67namespace kernel {
68class PDEPkernel : public MultiBlockKernel {
70    PDEPkernel(const std::unique_ptr<kernel::KernelBuilder> & kb, unsigned streamCount, unsigned swizzleFactor, unsigned PDEP_width = 64);
71    bool isCachable() const override { return true; }
72    bool hasSignature() const override { return false; }
74    const unsigned mSwizzleFactor;
75    const unsigned mPDEPWidth;
76    void generateMultiBlockLogic(const std::unique_ptr<KernelBuilder> & kb) override;
77    std::vector<llvm::Value *> get_PDEP_masks(const std::unique_ptr<KernelBuilder> & kb, llvm::Value * PDEP_ms_blk,
78                                              const unsigned mask_width);
79    std::vector<llvm::Value *> get_block_popcounts(const std::unique_ptr<KernelBuilder> & kb, llvm::Value * blk,
80                                                   const unsigned field_width);
Note: See TracBrowser for help on using the repository browser.