source: docs/Working/parlensort.tex @ 1187

Last change on this file since 1187 was 1187, checked in by cameron, 8 years ago

Added notes on parallel bitstream-based length sorting.

File size: 3.1 KB
Line 
1\documentclass{article}
2\usepackage{fullpage}
3
4\begin{document}
5\title{Parallel Bitstream-Based Length Sorting}
6\author{Robert D. Cameron}
7\date{May 22, 2011}
8
9\maketitle
10
11\section{The Idea}
12
13Given streams $S$ and $E$ that mark the starts and ends of Names.
14
15We can partition $E$ into a length-sorted set of streams.
16
17First we can partition $E$ into the stream marking
18ends of Names of length 1, $E_1$, and the stream
19of ends of Names of greater than length 1, $E_{>1}$, as follows.
20
21\begin{eqnarray*}
22S_1 & = & n(S)\\
23E_1 & = & S_1 \wedge E\\
24E_{>1} & = & E \wedge \neg E_1\\
25\end{eqnarray*}
26
27Following this, the stream that marks the ends of Names of length 2, $E_2$,
28as well as the stream $E_{>2}$ for longer Names
29can be similarly determined.
30
31\begin{eqnarray*}
32S_2 & = & n(S_1)\\
33E_2 & = & S_2 \wedge E_{>1}\\
34E_{>2} & = & E_{>1} \wedge \neg E_2\\
35\end{eqnarray*}
36
37It may then be desirable to group the Names of length 3 and 4 together.
38A stream $E_{3,4}$ marking the ends of such names can be determined
39with a slight variation of the pattern.
40
41\begin{eqnarray*}
42S_{3,4} & = & n(n(S_1 | S_2))\\
43E_{3,4} & = & S_{3,4} \wedge E_{>2}\\
44E_{>4} & = & E_{>2} \wedge \neg E_{3,4}\\
45\end{eqnarray*}
46
47Here the stream $S_{3,4}$ is the stream consisting of all positions
483 or 4 positions past a Name start position. 
49
50Following this pattern streams for Names of length 5 through 8 are
51similarly determined.
52
53\begin{eqnarray*}
54S_{5,8} & = & n(n(n(n(S_1 | S_2 | S_{3,4}))))\\
55E_{5,8} & = & S_{5,8} \wedge E_{>4}\\
56E_{>8} & = & E_{>4} \wedge \neg E_{5,8}\\
57\end{eqnarray*}
58
59\section{Follow-up}
60
61Once the length sorting above has completed, then we can process
62names using bitscan techniques.
63
64\begin{enumerate}
65\item Sequentially scan $E_1$ to process all names of length 1.
66\item Sequentially scan $E_2$ to process all names of length 2.
67\item Sequentially scan $E_{3,4}$ to process all names of length 3 or 4.  The
68       actual length may be determined by a bit test of the $S$ stream.
69\item Sequentially scan $E_{5,8}$ to process all names of length 5 through 8.  The
70       actual length may be determined by a reverse bitscan from each $E_{5,8}$
71       to the corresponding name start position in $S$.
72\item Sequentially scan $E_>8$ to process all names of length greater than 8.
73       This will require both a reverse bit scan to find the name start and
74a forward scan to find the actual length.
75\end{enumerate}
76
77Of course, this scheme shows just one partitioning into length groups; others
78are possible.  Using the log-length strategy, a partition of $E_{>8}$ into
79$E_{9,16}$ and $E_{>16}$ would probably be worthwhile with XML.
80
81\section{Expected Benefits}
82
83What is the expected benefit of this technique versus sequentially determining
84lengths?   This avoids allocation of length-based position arrays, the memory
85operations of insertion of positions into those arrays, bounds-checking on those
86arrays and maintaining an auxiliary array of lengths of the each array.
87
88A further benefit may be from the programming side: the length sorting can
89be implemented using our existing Pablo language, avoiding low-level C programming.
90
91\end{document}
92
Note: See TracBrowser for help on using the repository browser.