Version 4 (modified by cameron, 3 years ago) (diff)

--

## Parallel Deletion

### Basic Problem

Given a stream of data elements and a corresponding deletion mask bit stream, the deletion problem is to compute a compressed bit stream consisting of only the nondeleted elements.

For example, consider the following bit stream and deletion mask:

```abcdefghijklmnopqrstuvwxyz
...1....11..1.1111.1......
```

The compressed bit stream resulting from deletion of positions marked by 1 bits is the following:

```abcefghklnsuvwxyz
```

### Problem Variations

1. Apply a single deletion mask to a single bit stream.
2. Apply a single deletion mask to each individual bit stream in a set of N parallel bit streams.
3. Apply a deletion mask to a stream of larger units, for example a stream of bytes.

In the event that N bit streams are processed, it may be possible to reduce the overall cost by sharing preprocessing overhead.

## Parallel Deletion Methods

### Intra-Block Deletion vs. Block-by-Block Processing

When implementing parallel deletion operations on a data stream, we first perform deletions within each block (intra-block parallel deletion) and then arrange to join together the results of adjacent blocks.

For example, working with 128-bit SSE registers, we may delete 7 positions in block A, and 23 positions in block B. Given these results, we typically then want to join the partial blocks together to form a continuous data stream.

One method is to always maintain a partial block of pending results. When a new partial block is produced, we transfer as many positions as possible into the pending block. If we fill it, then we emit the full block and move the remaining bits from the new block to become the new pending block.

### Intra-Field Deletion

When the ultimate goal is to produce streams of bytes, doublebytes, or other units, then it is often beneficial to limit the bit stream deletion work to smaller field sizes. For example, suppose that a stream of doublebytes (16-bit units) is to be produced after application of parallel deletion to a the corresponding 16 parallel bit streams. Working with 128-bit registers, each register will ultimately hold at most 8 doublebytes. Then we may proceed as follows:

1. Perform parallel deletion within 8-bit fields, so that nondeleted bits are leftmostin each such field.
2. Transpose to double byte form, producing a set of 16 register values in which the nondeleted double bytes are leftmost in each register.
3. Sequentially write the register values to the output stream, advancing the output stream pointer by the count of nondeleted elements each time.

### Intra-Block Parallel Deletion

There are several different methods that can be used to implement parallel deletion within SIMD registers. Depending on the problem variation, it is often beneficial to combine methods.

1. Parallel prefix deletion.
2. SIMD deletion by left-result induction.
3. SIMD deletion by central result induction.
4. Deletion using PEXT bit manipulation instructions.
5. Deletion by permutation vector computation.