source: docs/Working/parlensort.tex @ 1302

Last change on this file since 1302 was 1302, checked in by lindanl, 8 years ago

Create a directory for HPCA

File size: 4.9 KB
Line 
1\documentclass{article}
2\usepackage{fullpage}
3
4\begin{document}
5\title{Parallel Bitstream-Based Length Sorting}
6\author{Robert D. Cameron}
7\date{May 22, 2011}
8
9\maketitle
10
11\section{The Idea}
12
13Given streams $S$ and $E$ that mark the starts and ends of Names.
14
15We can partition $E$ into a length-sorted set of streams.
16
17First we can partition $E$ into the stream marking
18ends of Names of length 1, $E_1$, and the stream
19of ends of Names of greater than length 1, $E_{>1}$, as follows.
20
21\begin{eqnarray*}
22S_1 & = & n(S)\\
23E_1 & = & S_1 \wedge E\\
24E_{>1} & = & E \wedge \neg E_1\\
25\end{eqnarray*}
26
27Following this, the stream that marks the ends of Names of length 2, $E_2$,
28as well as the stream $E_{>2}$ for longer Names
29can be similarly determined.
30
31\begin{eqnarray*}
32S_2 & = & n(S_1)\\
33E_2 & = & S_2 \wedge E_{>1}\\
34E_{>2} & = & E_{>1} \wedge \neg E_2\\
35\end{eqnarray*}
36
37It may then be desirable to group the Names of length 3 and 4 together.
38A stream $E_{3,4}$ marking the ends of such names can be determined
39with a slight variation of the pattern.
40
41\begin{eqnarray*}
42S_{3,4} & = & n(n(S_1 | S_2))\\
43E_{3,4} & = & S_{3,4} \wedge E_{>2}\\
44E_{>4} & = & E_{>2} \wedge \neg E_{3,4}\\
45\end{eqnarray*}
46
47Here the stream $S_{3,4}$ is the stream consisting of all positions
483 or 4 positions past a Name start position. 
49
50Following this pattern streams for Names of length 5 through 8 are
51similarly determined.
52
53\begin{eqnarray*}
54S_{5,8} & = & n(n(n(n(S_1 | S_2 | S_{3,4}))))\\
55E_{5,8} & = & S_{5,8} \wedge E_{>4}\\
56E_{>8} & = & E_{>4} \wedge \neg E_{5,8}\\
57\end{eqnarray*}
58
59\section{Follow-up}
60
61Once the length sorting above has completed, then we can process
62names using bitscan techniques.
63
64\begin{enumerate}
65\item Sequentially scan $E_1$ to process all names of length 1.
66\item Sequentially scan $E_2$ to process all names of length 2.
67\item Sequentially scan $E_{3,4}$ to process all names of length 3 or 4.  The
68       actual length may be determined by a bit test of the $S$ stream.
69\item Sequentially scan $E_{5,8}$ to process all names of length 5 through 8.  The
70       actual length may be determined by a reverse bitscan from each $E_{5,8}$
71       to the corresponding name start position in $S$.
72\item Sequentially scan $E_>8$ to process all names of length greater than 8.
73       This will require both a reverse bit scan to find the name start and
74a forward scan to find the actual length.
75\end{enumerate}
76
77Of course, this scheme shows just one partitioning into length groups; others
78are possible.  Using the log-length strategy, a partition of $E_{>8}$ into
79$E_{9,16}$ and $E_{>16}$ would probably be worthwhile with XML.
80
81\section{Expected Benefits}
82
83What is the expected benefit of this technique versus sequentially determining
84lengths?   This avoids allocation of length-based position arrays, the memory
85operations of insertion of positions into those arrays, bounds-checking on those
86arrays and maintaining an auxiliary array of lengths of the each array.
87
88A further benefit may be from the programming side: the length sorting can
89be implemented using our existing Pablo language, avoiding low-level C programming.
90
91\section{DIV2 grouping}
92Given streams $S$ and $E$ that mark the starts and ends of Names.
93We can partition $E$ into length-sorted groups using div2-length strategy: $G_i = n(E_{i*2-1})|E_{i*2}$
94
95First we calculate $T$, which marks the two positions after each name.
96Then we can get $G_1$ as follows.
97\begin{eqnarray*}
98T & = & E | n(E)\\
99S_2 & = & n(n(S))\\
100G_1 & = & S_2 \wedge T \\
101\end{eqnarray*}
102
103$G_i (i>1)$ can be similarly determined as follows.
104
105\begin{eqnarray*}
106S_{i*2} & = & n(n(S_{(i-1)*2} \wedge \neg T)) \\
107G_i & = & S_{i*2} \wedge T \\
108\end{eqnarray*}
109
110Once the length sorting above has completed, then we can process
111names using bitscan techniques.
112
113For each $G_i (i>0)$:
114\begin{enumerate}
115\item Sequentially scan $G_i$ to find positions $P_i = \{p_{i,0},p_{i,1},...,p_{i,n},\}$.
116\item For each $k$ in $\{1,2...,n\}$, calculate hash value for name $N[p_{i,k-i*2}]...N[p_{i,k-1}]$
117\item Look up hash table. If the name exist, process next name.
118Otherwise, if $N[p_{i,k-1}]$ is a delimiter (name length is odd), lookup name $N[p_{i,k-i*2}]...N[p_{i,k-2}] (i>1)$ or name $N[p_{1,k-2}]$.
119If it exists, get its GID $g$ and insert $N[p_{i,k-i*2}]...N[p_{i,k-1}]$ with GID $g$.
120If it dosen't exist, insert this name with a new GID and insert $N[p_{i,k-i*2}]...N[p_{i,k-1}]$ with the same GID.
121If $N[p_{i,k-1}]$ is not a delimiter (name length is even), insert $N[p_{i,k-i*2}]...N[p_{i,k-1}]$ with a new GID.
122\end{enumerate}
123
124% \section{Grouping Strategies}
125% \subsection{Length Grouping}
126% \subsubsection{ID grouping}
127%
128% \subsubsection{LOG grouping}
129%
130% \subsubsection{DIV grouping}
131%
132% \subsection{Length + hash Grouping}
133%
134% \section{Sorting Algorithms}
135% \subsection{Sequential Sorting}
136%
137% \subsection{Parallel Sorting}
138%
139% \section{Group Processing Strategies}
140%
141% \subsection{Linear SIMD Search}
142%
143% \subsection{Trie}
144%
145% \subsection{Hash}
146
147\end{document}
148
Note: See TracBrowser for help on using the repository browser.