Index: /docs/ASPLOS09/asplos094-cameron.tex
===================================================================
--- /docs/ASPLOS09/asplos094-cameron.tex (revision 243)
+++ /docs/ASPLOS09/asplos094-cameron.tex (revision 244)
@@ -1424,6 +1424,8 @@
Only one of these values may be 1; both are 0 if
no modifier is specified.
-In addition, the SIDU supplies decoded field width information
-to both the SOFU and to the SIMD Instruction Execute Unit (SIEU).
+The SIDU also supplies decoded field width signals $w_k$
+for each field width $2^k$ to both the SOFU and to the
+SIMD Instruction Execute Unit (SIEU). Only one of the
+field width signals has the value 1.
The SIDU also supplies decoded SIMD opcode information to SIEU and
a decoded $A$-bit register address for the destination register to
@@ -1448,16 +1450,8 @@
$n=2^k$ for $0 \leq k \leq K$. This involves
additional circuitry for field widths not supported
-in existing processors. For inductive doubling
-algorithms in support of parallel bit streams,
-the principal need is for additional circuitry to
-support 2-bit and 4-bit field widths. This circuity
-is generally less complicated than that for larger
-fields. Support for circuitry at these width
-has other applications as well. For example,
-DNA sequences are frequently represented using
-packed sequences of 2-bit codes for the four possible
-nucleotides\cite{}, while the need for accurate financial
-calculation has seen a resurgence of the 4-bit
-packed BCD format for decimal floating point \cite{}.
+in existing processors. In our evaluation model,
+IDISA-A adds support for 2-bit, 4-bit and 128-bit
+field widths in comparison with the RefA architecture,
+while IDISA-B similarly extends RefB.
When execution of the SWAR instruction is
@@ -1469,5 +1463,47 @@
\subsection{Operand Fetch Unit Logic}
-Discussion of gate-level implementation.
+The SOFU is responsible for implementing the half-operand
+modification logic for each of up to two input operands fetched
+from SRF. For each operand, this logic is implemented
+using the decoded half-operand modifiers signals $h$ and $l$,
+the decoded field width signals $w_k$ and the 128-bit operand
+value $r$ fetched from SRF to produce a modified 128-bit operand
+value $s$ following the requirements of equations (4), (5) and
+(6) above. Those equations must be applied for each possible
+modifier and each field width to determine the possible values $s[i]$
+for each bit position $i$. For example, consider bit
+position 41, whose binary 7-bit address is $0101001$.
+Considering the address bits left to right, each 1 bit
+corresponds to a field width for which this bit lies in the
+lower $n/2$ bits (widths 2, 16, 64), while each 0 bit corresponds to a field
+width for which this bit lies in the high $n/2$ bits.
+In response to the half-operand modifier signal $h$,
+this bit may receive a value from the corresponding field
+of width 2, 16 or 64 whose address bit is 0, namely $r[40]$,
+$r[33]$ or $r[9]$. Otherwise, this bit receives the value $r[41]$,
+in the case of no half-operand modifier, or a low half-operand modifier
+in conjunction with a field width signal $w_2$, $w_{16}$ or $w_{64}$.
+The overall logic for determining this bit value is thus given as follows.
+\begin{eqnarray*}
+s[41] & = & h \wedge (w_2 \wedge r[40] \vee w_{16} \wedge r[33] \vee w_{64} \wedge r[9]) \\
+& & \vee \neg h \wedge (\neg l \vee w_2 \vee w_{16} \vee w_{64}) \wedge r[41]
+\end{eqnarray*}
+
+Similar logic is determined for each of the 128 bit positions.
+For each of the 7 field widths, 64 bits are in the low $n/2$ bits,
+resulting in 448 2-input and gates for the $w_k \wedge r[i]$ terms.
+For 120 of the bit positions, or gates are needed to combine these
+terms; $441 -120 = 321$ 2-input or gates are required. Another
+127 2-input and gates combine these values with the $h$ signal.
+In the case of a low-half-operand modifier, the or-gates combining $w_k$
+signals can share circuitry. For each bit position $i=2^k+j$ one
+additional or gate is required beyond that for position $j$.
+Thus 127 2-input or gates are required. Another 256 2-input and gates
+are required for combination with the $\not h$ and $r[i]$ terms. The terms for
+the low and high half-operand modifiers are then combined with an
+additional 127 2-input or gates. Thus, the circuity complexity
+for the combinational logic implementation of half-operand
+modifiers within the SOFU is 1279 2-input gates per operand,
+or 2558 gates in total.