source: docs/Working/parlensort.tex @ 1186

Last change on this file since 1186 was 1186, checked in by cameron, 8 years ago

Notes on parallel bitstream-based length sorting.

File size: 2.9 KB
Line 
1\documentclass{article}
2\usepackage{fullpage}
3
4\begin{document}
5\title{Parallel Bitstream-Based Length Sorting}
6\author{Robert D. Cameron}
7\date{May 22, 2011}
8
9\maketitle
10
11\section{The Idea}
12
13Given streams $S$ and $E$ that mark the starts and ends of Names.
14
15We can partition $E$ into a length-sorted set of streams.
16
17First we can partition $E$ into the stream marking
18ends of Names of length 1, $E_1$, and the stream
19of ends of Names of greater than length 1, $E_{>1}$, as follows.
20
21\begin{eqnarray*}
22S_1 & = & n(s)\\
23E_1 & = & S_1 \wedge E\\
24E_{>1} & = & E \wedge \neg E_1\\
25\end{eqnarray*}
26
27Following this, the stream that marks the ends of Names of length 2, $E_2$,
28as well as the stream $E_{>2}$ for longer Names
29can be similarly determined.
30
31\begin{eqnarray*}
32S_2 & = & n(S_1)\\
33E_2 & = & S_2 \wedge E_{>1}\\
34E_{>2} & = & E_{>1} \wedge \neg E_2\\
35\end{eqnarray*}
36
37It may then be desirable to group the Names of length 3 and 4 together.
38A stream $E_{3,4}$ marking the ends of such names can be determined
39with a slight variation of the pattern.
40
41\begin{eqnarray*}
42S_{3,4} & = & n(n(S_1 | S_2))\\
43E_{3,4} & = & S_{3,4} \wedge E_{>2}\\
44E_{>4} & = & E_{>2} \wedge \neg E_{3,4}\\
45\end{eqnarray*}
46
47Here the stream $S_{3,4}$ is the stream consisting of all positions
483 or 4 positions past a Name start position. 
49
50Following this pattern streams for Names of length 5 through 8 are
51similarly determined.
52
53\begin{eqnarray*}
54S_{5,8} & = & n(n(n(n(S_1 | S_2 | S_{3,4}))))\\
55E_{5,8} & = & S_{5,8} \wedge E_{>4}\\
56E_{>8} & = & E_{>4} \wedge \neg E_{5,8}\\
57\end{eqnarray*}
58
59\section{Follow-up}
60
61Once the length sorting above has completed, then we can process
62names using bitscan techniques.
63
64\begin{enumerate}
65\item Sequentially scan $E_1$ to process all names of length 1.
66\item Sequentially scan $E_2$ to process all names of length 2.
67\item Sequentially scan $E_{3,4}$ to process all names of length 3 or 4.  The
68       actual length may be determined by a bit test of the $S$ stream.
69\item Sequentially scan $E_{5,8}$ to process all names of length 5 through 8.  The
70       actual length may be determined by a reverse bitscan from each $E_{5,8}$
71       to the corresponding name start position in $S$.
72\item Sequentially scan $E_>8$ to process all names of length greater than 8.
73       This will require both a reverse bit scan to find the name start and
74a forward scan to find the actual length.
75\end{enumerate}
76
77\section{Expected Benefits}
78
79What is the expected benefit of this technique versus sequentially determining
80lengths?   This avoids allocation of length-based position arrays, the memory
81operations of insertion of positions into those arrays, bounds-checking on those
82arrays and maintaining an auxiliary array of lengths of the each array.
83
84A further benefit may be from the programming side: the length sorting can
85be implemented using our existing Pablo language, avoiding low-level C programming.
86
87\end{document}
88
Note: See TracBrowser for help on using the repository browser.