1 | \documentclass{article} |
---|
2 | \usepackage{fullpage} |
---|
3 | |
---|
4 | \begin{document} |
---|
5 | \title{Parallel Bitstream-Based Length Sorting} |
---|
6 | \author{Robert D. Cameron} |
---|
7 | \date{May 22, 2011} |
---|
8 | |
---|
9 | \maketitle |
---|
10 | |
---|
11 | \section{The Idea} |
---|
12 | |
---|
13 | Given streams $S$ and $E$ that mark the starts and ends of Names. |
---|
14 | |
---|
15 | We can partition $E$ into a length-sorted set of streams. |
---|
16 | |
---|
17 | First we can partition $E$ into the stream marking |
---|
18 | ends of Names of length 1, $E_1$, and the stream |
---|
19 | of ends of Names of greater than length 1, $E_{>1}$, as follows. |
---|
20 | |
---|
21 | \begin{eqnarray*} |
---|
22 | S_1 & = & n(S)\\ |
---|
23 | E_1 & = & S_1 \wedge E\\ |
---|
24 | E_{>1} & = & E \wedge \neg E_1\\ |
---|
25 | \end{eqnarray*} |
---|
26 | |
---|
27 | Following this, the stream that marks the ends of Names of length 2, $E_2$, |
---|
28 | as well as the stream $E_{>2}$ for longer Names |
---|
29 | can be similarly determined. |
---|
30 | |
---|
31 | \begin{eqnarray*} |
---|
32 | S_2 & = & n(S_1)\\ |
---|
33 | E_2 & = & S_2 \wedge E_{>1}\\ |
---|
34 | E_{>2} & = & E_{>1} \wedge \neg E_2\\ |
---|
35 | \end{eqnarray*} |
---|
36 | |
---|
37 | It may then be desirable to group the Names of length 3 and 4 together. |
---|
38 | A stream $E_{3,4}$ marking the ends of such names can be determined |
---|
39 | with a slight variation of the pattern. |
---|
40 | |
---|
41 | \begin{eqnarray*} |
---|
42 | S_{3,4} & = & n(n(S_1 | S_2))\\ |
---|
43 | E_{3,4} & = & S_{3,4} \wedge E_{>2}\\ |
---|
44 | E_{>4} & = & E_{>2} \wedge \neg E_{3,4}\\ |
---|
45 | \end{eqnarray*} |
---|
46 | |
---|
47 | Here the stream $S_{3,4}$ is the stream consisting of all positions |
---|
48 | 3 or 4 positions past a Name start position. |
---|
49 | |
---|
50 | Following this pattern streams for Names of length 5 through 8 are |
---|
51 | similarly determined. |
---|
52 | |
---|
53 | \begin{eqnarray*} |
---|
54 | S_{5,8} & = & n(n(n(n(S_1 | S_2 | S_{3,4}))))\\ |
---|
55 | E_{5,8} & = & S_{5,8} \wedge E_{>4}\\ |
---|
56 | E_{>8} & = & E_{>4} \wedge \neg E_{5,8}\\ |
---|
57 | \end{eqnarray*} |
---|
58 | |
---|
59 | \section{Follow-up} |
---|
60 | |
---|
61 | Once the length sorting above has completed, then we can process |
---|
62 | names using bitscan techniques. |
---|
63 | |
---|
64 | \begin{enumerate} |
---|
65 | \item Sequentially scan $E_1$ to process all names of length 1. |
---|
66 | \item Sequentially scan $E_2$ to process all names of length 2. |
---|
67 | \item Sequentially scan $E_{3,4}$ to process all names of length 3 or 4. The |
---|
68 | actual length may be determined by a bit test of the $S$ stream. |
---|
69 | \item Sequentially scan $E_{5,8}$ to process all names of length 5 through 8. The |
---|
70 | actual length may be determined by a reverse bitscan from each $E_{5,8}$ |
---|
71 | to the corresponding name start position in $S$. |
---|
72 | \item Sequentially scan $E_>8$ to process all names of length greater than 8. |
---|
73 | This will require both a reverse bit scan to find the name start and |
---|
74 | a forward scan to find the actual length. |
---|
75 | \end{enumerate} |
---|
76 | |
---|
77 | Of course, this scheme shows just one partitioning into length groups; others |
---|
78 | are possible. Using the log-length strategy, a partition of $E_{>8}$ into |
---|
79 | $E_{9,16}$ and $E_{>16}$ would probably be worthwhile with XML. |
---|
80 | |
---|
81 | \section{Expected Benefits} |
---|
82 | |
---|
83 | What is the expected benefit of this technique versus sequentially determining |
---|
84 | lengths? This avoids allocation of length-based position arrays, the memory |
---|
85 | operations of insertion of positions into those arrays, bounds-checking on those |
---|
86 | arrays and maintaining an auxiliary array of lengths of the each array. |
---|
87 | |
---|
88 | A further benefit may be from the programming side: the length sorting can |
---|
89 | be implemented using our existing Pablo language, avoiding low-level C programming. |
---|
90 | |
---|
91 | \end{document} |
---|
92 | |
---|