wiki:CarryQueue

Carry Queue

A Carry Queue is queue of N carry variables held within a single integer value of at least N bits.

The main goal of the Carry Queue concept is to minimize the cost of recording carries for a set of variables when compiling unbounded bit stream processing code into block-by-block low-level implementation code. Rather than one separate variable that is saved/restored per carry variable, the carry queue only requires the save and restore of a single integer.

The second advantage of the carry queue concept is to optimize loop control for while loops over unbounded bit streams. If all the carry variables required within the loop are held in the same carry queue, then the loop condition requiring a logical-or combination of these variables can be easily implemented as a mask selection from the carry queue (or just using the carry queue value if no other variables than loop variables are stored within this carry queue).

The third advantage of the carry queue is that a single ADC can be used to both store one carry value and load the next value, replacing a LAHF/SAHF pair of operations.

The technique is this: Let accum be the carry queue variable that we want to hold all the carries. It is initialized to 0. When a carry value is needed from a previous iteration, we arrange that the carry bit is at the high end of the carry accumulator. Then the operation accum = adc(accum, accum) loads the value of this bit into the carry flag and also shifts the remaining values in the carry queue to the left one position, making the next carry variable ready for loading into the carry flag.

However, the adc operation above also stores the current value of the carry flag into the low end of the carry queue at the same time. In essence, we have a combine enqueue/dequeue operation.

This FIFO processing of carry variables requires that each iteration of block-by-block bitstream processing performs the same number of enqueue/dequeue operations in the same order.

If the number of carry variables N is less than the register width R, then a shift operation accum <<= R-N is required between each loop iteration to position the carry variables from the just completed iteration for the next iteration.


A note on implementation

The implementation of carry queue has two steps. First we need to modify the current C macro for addition. We need to pick a register the contains the carry queue and use adc on that register rather than lahf, sahf operations. This will affect the template program not the compiler.

Second step is in the compiler, where we need to take care of two things. First, when there are more than 64 carries, we will have more than one carry queue. The compiler is responsible to make sure that the right carry queue is loaded at the right time. The second thing that the compiler should take care of is a shift in carry queue that might be needed to be done at the end of processing of each block.

The code for handling this issue in the compiler might have dependency on the target language. For example in C we may need to call a macro or function that has inline assembly while in assembly we may do something slightly different. Also it has minor interaction with Factor Out pass of the compiler. If a code for loading/storing carry queues is needed to be put inside an if-then-else statement, this code must be put inside both branches of the statement. So one may prefer to put this pass after the Factor Out pass so that workload of Factor Out does not become larger, but this is not necessary.

Carry Queue and While Loop Unrolling

These new notes (August 18, 2010) elaborate an optimized implementation for while loops, involving a single unrolling.

For each block of data, there may be carries for variables within the while loop. Thus, if there are any such carries, the loop body must be executed at least once, even if the controlling bitstream is all zero for the block.

However, once the loop is executed, the input carries have all been accounted for. Subsequent iterations should use input carry values of zero for all add with carry operations. Output carries for each iteration must still be accumulated and or'd together for each iteration. However, the zeroing of input carries means that add-with-carry operations can be replaced by normal add operations for the second and subsequent operations. These are faster. Furthermore, the carry queue is needed for accumulating output carries only.

This suggests that it may be beneficial to unroll the loop in a way such that the first iteration only uses add-with-carry operation and subsequent iterations use the straight add operation.

Enqueue/Dequeue?

The enqueue/dequeue technique is difficult to implement in C-based code, because there is no way to preserve carry flag values through C code. But the following approach may work with a peephole optimization strategy.

Consider two unbounded stream additions with intervering code:

  v = t + u
  w = ...
  z = x + y

A dequeue/enqueue inbetween the pair would be ideal, provided that the dequeued carry is preserved across the calculation of w.

  v = adc(t, u)
  carry_queue = adc(carry_queue, carry_queue) /*enqueue/dequeue*/
  w = ...
  z = adc(x,y)

But the intervening code is almost certain to clobber the carry needed for the calculation of z.

Last modified 9 years ago Last modified on Aug 18, 2010, 7:59:55 AM