source: docs/Working/parlensort.tex @ 3635

Last change on this file since 3635 was 1549, checked in by cameron, 8 years ago

Update parlensort doc

File size: 5.7 KB
Line 
1\documentclass{article}
2\usepackage{fullpage}
3
4\begin{document}
5\title{Parallel Bitstream-Based Length Sorting}
6\author{Robert D. Cameron}
7\date{May 22, 2011}
8
9\maketitle
10
11\section{Length-Group Bitstreams}
12
13Given streams $S$ and $E$ that mark the starts and ends of Names.
14
15We can partition $E$ into a length-sorted set of streams.
16
17First we can partition $E$ into the stream marking
18ends of Names of length 1, $E_1$, and the stream
19of ends of Names of greater than length 1, $E_{>1}$, as follows.
20Here  $n(S)$ is the bitstream ``next'' operation, shifting the
21stream forward one position.
22
23\begin{eqnarray*}
24S_1 & = & n(S)\\
25E_1 & = & S_1 \wedge E\\
26E_{>1} & = & E \wedge \neg E_1\\
27\end{eqnarray*}
28
29Following this, the stream that marks the ends of Names of length 2, $E_2$,
30as well as the stream $E_{>2}$ for longer Names
31can be similarly determined.
32
33\begin{eqnarray*}
34S_2 & = & n(S_1)\\
35E_2 & = & S_2 \wedge E_{>1}\\
36E_{>2} & = & E_{>1} \wedge \neg E_2\\
37\end{eqnarray*}
38
39It may then be desirable to group the Names of length 3 and 4 together.
40A stream $E_{3,4}$ marking the ends of such names can be determined
41with a slight variation of the pattern.
42
43\begin{eqnarray*}
44S_{3,4} & = & n(n(S_1 | S_2))\\
45E_{3,4} & = & S_{3,4} \wedge E_{>2}\\
46E_{>4} & = & E_{>2} \wedge \neg E_{3,4}\\
47\end{eqnarray*}
48
49Here the stream $S_{3,4}$ is the stream consisting of all positions
503 or 4 positions past a Name start position. 
51
52Following this pattern streams for Names of length 5 through 8 are
53similarly determined.
54
55\begin{eqnarray*}
56S_{5,8} & = & n(n(n(n(S_1 | S_2 | S_{3,4}))))\\
57E_{5,8} & = & S_{5,8} \wedge E_{>4}\\
58E_{>8} & = & E_{>4} \wedge \neg E_{5,8}\\
59\end{eqnarray*}
60
61
62\section{Parallel Hashing}
63
64Given length groups identified as in the previous section,
65it is possible to perform parallel hash value calculation for
66names in each group as follows.
67
68The first step is to create hash streams.
69Let $B_0, B_1, \ldots, B_7$ be the {\em basis} bitstreams
70determined by character stream transposition.   That is,
71the $k$th basis bitstream $B_k$ consists of the $k$th bit
72of each character in the source data stream.
73Then we may compute hash streams by some bitwise logical combination
74of basis bit streams.
75For example, the following two hash streams may be useful.
76
77\begin{eqnarray*}
78h_a & = & B_2 \oplus B_4 \oplus B_6 \\
79h_b & = & B_3 \oplus B_5 \oplus B_7 \\
80\end{eqnarray*}
81
82
83
84
85
86\section{Follow-up}
87
88Once the length sorting above has completed, then we can process
89names using bitscan techniques.
90
91\begin{enumerate}
92\item Sequentially scan $E_1$ to process all names of length 1.
93\item Sequentially scan $E_2$ to process all names of length 2.
94\item Sequentially scan $E_{3,4}$ to process all names of length 3 or 4.  The
95       actual length may be determined by a bit test of the $S$ stream.
96\item Sequentially scan $E_{5,8}$ to process all names of length 5 through 8.  The
97       actual length may be determined by a reverse bitscan from each $E_{5,8}$
98       to the corresponding name start position in $S$.
99\item Sequentially scan $E_>8$ to process all names of length greater than 8.
100       This will require both a reverse bit scan to find the name start and
101a forward scan to find the actual length.
102\end{enumerate}
103
104Of course, this scheme shows just one partitioning into length groups; others
105are possible.  Using the log-length strategy, a partition of $E_{>8}$ into
106$E_{9,16}$ and $E_{>16}$ would probably be worthwhile with XML.
107
108\section{Expected Benefits}
109
110What is the expected benefit of this technique versus sequentially determining
111lengths?   This avoids allocation of length-based position arrays, the memory
112operations of insertion of positions into those arrays, bounds-checking on those
113arrays and maintaining an auxiliary array of lengths of each array.
114
115A further benefit may be from the programming side: the length sorting can
116be implemented using our existing Pablo language, avoiding low-level C programming.
117
118\section{DIV2 grouping}
119Given streams $S$ and $E$ that mark the starts and ends of Names.
120We can partition $E$ into length-sorted groups using div2-length strategy: $G_i = n(E_{i*2-1})|E_{i*2}$
121
122First we calculate $T$, which marks the two positions after each name.
123Then we can get $G_1$ as follows.
124\begin{eqnarray*}
125T & = & E | n(E)\\
126S_2 & = & n(n(S))\\
127G_1 & = & S_2 \wedge T \\
128\end{eqnarray*}
129
130$G_i (i>1)$ can be similarly determined as follows.
131
132\begin{eqnarray*}
133S_{i*2} & = & n(n(S_{(i-1)*2} \wedge \neg T)) \\
134G_i & = & S_{i*2} \wedge T \\
135\end{eqnarray*}
136
137Once the length sorting above has completed, then we can process
138names using bitscan techniques.
139
140For each $G_i (i>0)$:
141\begin{enumerate}
142\item Sequentially scan $G_i$ to find positions $P_i = \{p_{i,0},p_{i,1},...,p_{i,n},\}$.
143\item For each $k$ in $\{1,2...,n\}$, calculate hash value for name $N[p_{i,k-i*2}]...N[p_{i,k-1}]$
144\item Look up hash table. If the name exist, process next name.
145Otherwise, if $N[p_{i,k-1}]$ is a delimiter (name length is odd), lookup name $N[p_{i,k-i*2}]...N[p_{i,k-2}] (i>1)$ or name $N[p_{1,k-2}]$.
146If it exists, get its GID $g$ and insert $N[p_{i,k-i*2}]...N[p_{i,k-1}]$ with GID $g$.
147If it dosen't exist, insert this name with a new GID and insert $N[p_{i,k-i*2}]...N[p_{i,k-1}]$ with the same GID.
148If $N[p_{i,k-1}]$ is not a delimiter (name length is even), insert $N[p_{i,k-i*2}]...N[p_{i,k-1}]$ with a new GID.
149\end{enumerate}
150
151% \section{Grouping Strategies}
152% \subsection{Length Grouping}
153% \subsubsection{ID grouping}
154%
155% \subsubsection{LOG grouping}
156%
157% \subsubsection{DIV grouping}
158%
159% \subsection{Length + hash Grouping}
160%
161% \section{Sorting Algorithms}
162% \subsection{Sequential Sorting}
163%
164% \subsection{Parallel Sorting}
165%
166% \section{Group Processing Strategies}
167%
168% \subsection{Linear SIMD Search}
169%
170% \subsection{Trie}
171%
172% \subsection{Hash}
173
174\end{document}
175
Note: See TracBrowser for help on using the repository browser.