\documentclass{article}
\usepackage{fullpage}
\begin{document}
\title{Parallel Bitstream-Based Length Sorting}
\author{Robert D. Cameron}
\date{May 22, 2011}
\maketitle
\section{Length-Group Bitstreams}
Given streams $S$ and $E$ that mark the starts and ends of Names.
We can partition $E$ into a length-sorted set of streams.
First we can partition $E$ into the stream marking
ends of Names of length 1, $E_1$, and the stream
of ends of Names of greater than length 1, $E_{>1}$, as follows.
Here $n(S)$ is the bitstream ``next'' operation, shifting the
stream forward one position.
\begin{eqnarray*}
S_1 & = & n(S)\\
E_1 & = & S_1 \wedge E\\
E_{>1} & = & E \wedge \neg E_1\\
\end{eqnarray*}
Following this, the stream that marks the ends of Names of length 2, $E_2$,
as well as the stream $E_{>2}$ for longer Names
can be similarly determined.
\begin{eqnarray*}
S_2 & = & n(S_1)\\
E_2 & = & S_2 \wedge E_{>1}\\
E_{>2} & = & E_{>1} \wedge \neg E_2\\
\end{eqnarray*}
It may then be desirable to group the Names of length 3 and 4 together.
A stream $E_{3,4}$ marking the ends of such names can be determined
with a slight variation of the pattern.
\begin{eqnarray*}
S_{3,4} & = & n(n(S_1 | S_2))\\
E_{3,4} & = & S_{3,4} \wedge E_{>2}\\
E_{>4} & = & E_{>2} \wedge \neg E_{3,4}\\
\end{eqnarray*}
Here the stream $S_{3,4}$ is the stream consisting of all positions
3 or 4 positions past a Name start position.
Following this pattern streams for Names of length 5 through 8 are
similarly determined.
\begin{eqnarray*}
S_{5,8} & = & n(n(n(n(S_1 | S_2 | S_{3,4}))))\\
E_{5,8} & = & S_{5,8} \wedge E_{>4}\\
E_{>8} & = & E_{>4} \wedge \neg E_{5,8}\\
\end{eqnarray*}
\section{Parallel Hashing}
Given length groups identified as in the previous section,
it is possible to perform parallel hash value calculation for
names in each group as follows.
The first step is to create hash streams.
Let $B_0, B_1, \ldots, B_7$ be the {\em basis} bitstreams
determined by character stream transposition. That is,
the $k$th basis bitstream $B_k$ consists of the $k$th bit
of each character in the source data stream.
Then we may compute hash streams by some bitwise logical combination
of basis bit streams.
For example, the following two hash streams may be useful.
\begin{eqnarray*}
h_a & = & B_2 \oplus B_4 \oplus B_6 \\
h_b & = & B_3 \oplus B_5 \oplus B_7 \\
\end{eqnarray*}
\section{Follow-up}
Once the length sorting above has completed, then we can process
names using bitscan techniques.
\begin{enumerate}
\item Sequentially scan $E_1$ to process all names of length 1.
\item Sequentially scan $E_2$ to process all names of length 2.
\item Sequentially scan $E_{3,4}$ to process all names of length 3 or 4. The
actual length may be determined by a bit test of the $S$ stream.
\item Sequentially scan $E_{5,8}$ to process all names of length 5 through 8. The
actual length may be determined by a reverse bitscan from each $E_{5,8}$
to the corresponding name start position in $S$.
\item Sequentially scan $E_>8$ to process all names of length greater than 8.
This will require both a reverse bit scan to find the name start and
a forward scan to find the actual length.
\end{enumerate}
Of course, this scheme shows just one partitioning into length groups; others
are possible. Using the log-length strategy, a partition of $E_{>8}$ into
$E_{9,16}$ and $E_{>16}$ would probably be worthwhile with XML.
\section{Expected Benefits}
What is the expected benefit of this technique versus sequentially determining
lengths? This avoids allocation of length-based position arrays, the memory
operations of insertion of positions into those arrays, bounds-checking on those
arrays and maintaining an auxiliary array of lengths of each array.
A further benefit may be from the programming side: the length sorting can
be implemented using our existing Pablo language, avoiding low-level C programming.
\section{DIV2 grouping}
Given streams $S$ and $E$ that mark the starts and ends of Names.
We can partition $E$ into length-sorted groups using div2-length strategy: $G_i = n(E_{i*2-1})|E_{i*2}$
First we calculate $T$, which marks the two positions after each name.
Then we can get $G_1$ as follows.
\begin{eqnarray*}
T & = & E | n(E)\\
S_2 & = & n(n(S))\\
G_1 & = & S_2 \wedge T \\
\end{eqnarray*}
$G_i (i>1)$ can be similarly determined as follows.
\begin{eqnarray*}
S_{i*2} & = & n(n(S_{(i-1)*2} \wedge \neg T)) \\
G_i & = & S_{i*2} \wedge T \\
\end{eqnarray*}
Once the length sorting above has completed, then we can process
names using bitscan techniques.
For each $G_i (i>0)$:
\begin{enumerate}
\item Sequentially scan $G_i$ to find positions $P_i = \{p_{i,0},p_{i,1},...,p_{i,n},\}$.
\item For each $k$ in $\{1,2...,n\}$, calculate hash value for name $N[p_{i,k-i*2}]...N[p_{i,k-1}]$
\item Look up hash table. If the name exist, process next name.
Otherwise, if $N[p_{i,k-1}]$ is a delimiter (name length is odd), lookup name $N[p_{i,k-i*2}]...N[p_{i,k-2}] (i>1)$ or name $N[p_{1,k-2}]$.
If it exists, get its GID $g$ and insert $N[p_{i,k-i*2}]...N[p_{i,k-1}]$ with GID $g$.
If it dosen't exist, insert this name with a new GID and insert $N[p_{i,k-i*2}]...N[p_{i,k-1}]$ with the same GID.
If $N[p_{i,k-1}]$ is not a delimiter (name length is even), insert $N[p_{i,k-i*2}]...N[p_{i,k-1}]$ with a new GID.
\end{enumerate}
% \section{Grouping Strategies}
% \subsection{Length Grouping}
% \subsubsection{ID grouping}
%
% \subsubsection{LOG grouping}
%
% \subsubsection{DIV grouping}
%
% \subsection{Length + hash Grouping}
%
% \section{Sorting Algorithms}
% \subsection{Sequential Sorting}
%
% \subsection{Parallel Sorting}
%
% \section{Group Processing Strategies}
%
% \subsection{Linear SIMD Search}
%
% \subsection{Trie}
%
% \subsection{Hash}
\end{document}