Index: docs/ASPLOS09/asplos094-cameron.tex
===================================================================
--- docs/ASPLOS09/asplos094-cameron.tex (revision 257)
+++ docs/ASPLOS09/asplos094-cameron.tex (revision 258)
@@ -73,5 +73,5 @@
to parallelize, concluding that nothing seems to help. Indeed,
the study even speculates that applications in this area may simply be
-``embarrasingly sequential,'' easy to tackle for traditional
+``embarrassingly sequential,'' easy to tackle for traditional
sequential processing approaches suitable for uniprocessors,
but perhaps fundamentally unsuited to parallel methods.
@@ -104,5 +104,5 @@
required a mere 15 cycles per byte, while a range of other technologies
required from 25 to 200 cycles per byte \cite{Herdy}.
-Ongoing work is further applying the parallel bit stream methds to
+Ongoing work is further applying the parallel bit stream methods to
parallel hash value computation and parallel regular expression matching
for the purpose of validating XML datatype declarations in
@@ -409,5 +409,5 @@
primary concern for performance improvement of these applications.
-The additional circuity complexity to realize IDISA-A and
+The additional circuit complexity to realize IDISA-A and
IDISA-B designs over their reference models will be
addressed in the penultimate section. That discussion
@@ -663,5 +663,5 @@
by the \verb#simd<16>::pack# with 4 additional operations to
prepare operands. Essentially, the RefB implementation
-uses single permuite instructions to implement the equivalent of
+uses single permute instructions to implement the equivalent of
\verb#simd<16>::pack(s0, s1)# and \verb#simd<16>::pack(s0, s1)#.
The RefA implementation also requires 3 logic operations to implement
@@ -674,10 +674,10 @@
Using IDISA, it is possible to design
a transposition algorithm that is both easier to understand and requires
-many fewer operations than the the bytepack algorithm described above.
+many fewer operations than the the BytePack algorithm described above.
We call it the inductive halving algorithm for serial to parallel
transposition, because it proceeds by reducing byte streams to
two sets of nybble streams in a first stage, dividing the nybble
-streams into streams of bitpairs in a second stage and finally
-dividing the bitpair streams into bit streams in the third stage.
+streams into streams of bit-pairs in a second stage and finally
+dividing the bit-pair streams into bit streams in the third stage.
Figure \ref{halvingstep} shows one step in stage 1 of the inductive
@@ -685,5 +685,5 @@
The \verb#simd<8>::pack# operation extracts the high nybble of each byte
from the input registers, while the \verb#simd<8>::pack# operation extracts
-the low nybble of each byte. As in the bytepack algorithm, this step is
+the low nybble of each byte. As in the BytePack algorithm, this step is
applied 4 times in stage 1, for a total of 8 operations.
@@ -794,5 +794,5 @@
transform, to separately compute the high and low byte
streams of each UTF-16 code unit. Those two byte streams
-are subsequentely merged to form the final result.
+are subsequently merged to form the final result.
Algorithms for performing the inverse transform mirror those
@@ -847,7 +847,7 @@
operations. For each of the four pairs of consecutive
even/odd bit streams (bit0/bit1, bit2/bit3, bit4/bit5, bit6/bit7),
-two consecutive registers of bitpair data are produced.
+two consecutive registers of bit-pair data are produced.
In stage 2, \verb#simd<2>::mergeh# and \verb#simd<2>::mergel#
-are then applied to merge to bitpair streams to produce streams
+are then applied to merge to bit-pair streams to produce streams
of nybbles for the high and low nybble of each byte. Finally,
the nybble streams are merged in stage 3 to produce the
@@ -965,5 +965,5 @@
At each inductive doubling level, it requires 4 operations to compute the
-required deletion infomation and one operation per bit stream to perform deletion.
+required deletion information and one operation per bit stream to perform deletion.
Note that, if deletion is to be applied to a set of eight parallel bit streams,
the computed deletion information is used for each stream without recomputation,
@@ -1083,7 +1083,7 @@
\begin{center}\small
\begin{verbatim}
-t1=simd<8>:const(10)
-t2=simd<16>:const(100)
-t3=simd<32>:const(10000)
+t1=simd<8>:constant(10)
+t2=simd<16>:constant(100)
+t3=simd<32>:constant(10000)
b=simd<8>::add(simd<8>::mult(d,t1), d)
b=simd<16>::add(simd<16>::mult(b,t2), b)
@@ -1131,8 +1131,8 @@
Further applications of IDISA can often be found
by searching for algorithms employing the magic masks
-s \verb:0x55555555:, \verb:0x33333333:, and so on.
-Examples include the bitslice implementation of AES \cite{DBLP:conf/cans/RebeiroSD06}
+ \verb:0x55555555:, \verb:0x33333333:, and so on.
+Examples include the bit-slice implementation of AES \cite{DBLP:conf/cans/RebeiroSD06}
and integer contraction and dilation for quadtrees and
-octtrees\cite{Stocco95} and Morton-ordered arrays \cite{Raman08}.
+octrees\cite{Stocco95} and Morton-ordered arrays \cite{Raman08}.
Pixel packing from 32 bit fields into a 5:5:5 representation
is a further application of parallel bit deletion.
@@ -1203,5 +1203,5 @@
be realized through $h$ inductive double steps
in a similar fashion.
-Thus, IDISA esssentially offers systematic support
+Thus, IDISA essentially offers systematic support
for horizontal operations entirely through the
use of \verb:: half-operand modifier
@@ -1373,5 +1373,5 @@
are required for combination with the $\neg h$ and $r[i]$ terms. The terms for
the low and high half-operand modifiers are then combined with an
-additional 127 2-input or gates. Thus, the circuity complexity
+additional 127 2-input or gates. Thus, the circuitry complexity
for the combinational logic implementation of half-operand
modifiers within the SOFU is 1279 2-input gates per operand,
@@ -1402,5 +1402,5 @@
Addition for 4-bit fields in a 128-bit SWAR processor may be implemented
as a modification to that for 8-bit fields by incorporating logic to
-disable carry propagation at the 16 midfield boundaries. For 2-bit
+disable carry propagation at the 16 mid-field boundaries. For 2-bit
fields, disabling carry propagation at 32 additional boundaries suffices,
although it may be simpler to directly implement the simple logic
@@ -1465,5 +1465,5 @@
as well as half-operand modifiers and pack and merge operations
to support efficient transition between successive power-of-two
-field widths. The principle innovation is the notion of
+field widths. The principal innovation is the notion of
half-operand modifiers that eliminate the cost associated
with the explicit mask and shift operations required for
@@ -1580,5 +1580,5 @@
\bibitem{DBLP:conf/cans/RebeiroSD06}
Chester Rebeiro, A.~David Selvakumar, and A.~S.~L. Devi.
-\newblock Bitslice implementation of aes.
+\newblock Bitslice implementation of AES.
\newblock In David Pointcheval, Yi~Mu, and Kefei Chen, editors, {\em CANS},
volume 4301 of {\em Lecture Notes in Computer Science}, pages 203--212.