1 | \documentclass{article} |
---|
2 | \usepackage{fullpage} |
---|
3 | |
---|
4 | \begin{document} |
---|
5 | \title{Parallel Bitstream-Based Length Sorting} |
---|
6 | \author{Robert D. Cameron} |
---|
7 | \date{May 22, 2011} |
---|
8 | |
---|
9 | \maketitle |
---|
10 | |
---|
11 | \section{Length-Group Bitstreams} |
---|
12 | |
---|
13 | Given streams $S$ and $E$ that mark the starts and ends of Names. |
---|
14 | |
---|
15 | We can partition $E$ into a length-sorted set of streams. |
---|
16 | |
---|
17 | First we can partition $E$ into the stream marking |
---|
18 | ends of Names of length 1, $E_1$, and the stream |
---|
19 | of ends of Names of greater than length 1, $E_{>1}$, as follows. |
---|
20 | Here $n(S)$ is the bitstream ``next'' operation, shifting the |
---|
21 | stream forward one position. |
---|
22 | |
---|
23 | \begin{eqnarray*} |
---|
24 | S_1 & = & n(S)\\ |
---|
25 | E_1 & = & S_1 \wedge E\\ |
---|
26 | E_{>1} & = & E \wedge \neg E_1\\ |
---|
27 | \end{eqnarray*} |
---|
28 | |
---|
29 | Following this, the stream that marks the ends of Names of length 2, $E_2$, |
---|
30 | as well as the stream $E_{>2}$ for longer Names |
---|
31 | can be similarly determined. |
---|
32 | |
---|
33 | \begin{eqnarray*} |
---|
34 | S_2 & = & n(S_1)\\ |
---|
35 | E_2 & = & S_2 \wedge E_{>1}\\ |
---|
36 | E_{>2} & = & E_{>1} \wedge \neg E_2\\ |
---|
37 | \end{eqnarray*} |
---|
38 | |
---|
39 | It may then be desirable to group the Names of length 3 and 4 together. |
---|
40 | A stream $E_{3,4}$ marking the ends of such names can be determined |
---|
41 | with a slight variation of the pattern. |
---|
42 | |
---|
43 | \begin{eqnarray*} |
---|
44 | S_{3,4} & = & n(n(S_1 | S_2))\\ |
---|
45 | E_{3,4} & = & S_{3,4} \wedge E_{>2}\\ |
---|
46 | E_{>4} & = & E_{>2} \wedge \neg E_{3,4}\\ |
---|
47 | \end{eqnarray*} |
---|
48 | |
---|
49 | Here the stream $S_{3,4}$ is the stream consisting of all positions |
---|
50 | 3 or 4 positions past a Name start position. |
---|
51 | |
---|
52 | Following this pattern streams for Names of length 5 through 8 are |
---|
53 | similarly determined. |
---|
54 | |
---|
55 | \begin{eqnarray*} |
---|
56 | S_{5,8} & = & n(n(n(n(S_1 | S_2 | S_{3,4}))))\\ |
---|
57 | E_{5,8} & = & S_{5,8} \wedge E_{>4}\\ |
---|
58 | E_{>8} & = & E_{>4} \wedge \neg E_{5,8}\\ |
---|
59 | \end{eqnarray*} |
---|
60 | |
---|
61 | |
---|
62 | \section{Parallel Hashing} |
---|
63 | |
---|
64 | Given length groups identified as in the previous section, |
---|
65 | it is possible to perform parallel hash value calculation for |
---|
66 | names in each group as follows. |
---|
67 | |
---|
68 | The first step is to create hash streams. |
---|
69 | Let $B_0, B_1, \ldots, B_7$ be the {\em basis} bitstreams |
---|
70 | determined by character stream transposition. That is, |
---|
71 | the $k$th basis bitstream $B_k$ consists of the $k$th bit |
---|
72 | of each character in the source data stream. |
---|
73 | Then we may compute hash streams by some bitwise logical combination |
---|
74 | of basis bit streams. |
---|
75 | For example, the following two hash streams may be useful. |
---|
76 | |
---|
77 | \begin{eqnarray*} |
---|
78 | h_a & = & B_2 \oplus B_4 \oplus B_6 \\ |
---|
79 | h_b & = & B_3 \oplus B_5 \oplus B_7 \\ |
---|
80 | \end{eqnarray*} |
---|
81 | |
---|
82 | |
---|
83 | |
---|
84 | |
---|
85 | |
---|
86 | \section{Follow-up} |
---|
87 | |
---|
88 | Once the length sorting above has completed, then we can process |
---|
89 | names using bitscan techniques. |
---|
90 | |
---|
91 | \begin{enumerate} |
---|
92 | \item Sequentially scan $E_1$ to process all names of length 1. |
---|
93 | \item Sequentially scan $E_2$ to process all names of length 2. |
---|
94 | \item Sequentially scan $E_{3,4}$ to process all names of length 3 or 4. The |
---|
95 | actual length may be determined by a bit test of the $S$ stream. |
---|
96 | \item Sequentially scan $E_{5,8}$ to process all names of length 5 through 8. The |
---|
97 | actual length may be determined by a reverse bitscan from each $E_{5,8}$ |
---|
98 | to the corresponding name start position in $S$. |
---|
99 | \item Sequentially scan $E_>8$ to process all names of length greater than 8. |
---|
100 | This will require both a reverse bit scan to find the name start and |
---|
101 | a forward scan to find the actual length. |
---|
102 | \end{enumerate} |
---|
103 | |
---|
104 | Of course, this scheme shows just one partitioning into length groups; others |
---|
105 | are possible. Using the log-length strategy, a partition of $E_{>8}$ into |
---|
106 | $E_{9,16}$ and $E_{>16}$ would probably be worthwhile with XML. |
---|
107 | |
---|
108 | \section{Expected Benefits} |
---|
109 | |
---|
110 | What is the expected benefit of this technique versus sequentially determining |
---|
111 | lengths? This avoids allocation of length-based position arrays, the memory |
---|
112 | operations of insertion of positions into those arrays, bounds-checking on those |
---|
113 | arrays and maintaining an auxiliary array of lengths of each array. |
---|
114 | |
---|
115 | A further benefit may be from the programming side: the length sorting can |
---|
116 | be implemented using our existing Pablo language, avoiding low-level C programming. |
---|
117 | |
---|
118 | \section{DIV2 grouping} |
---|
119 | Given streams $S$ and $E$ that mark the starts and ends of Names. |
---|
120 | We can partition $E$ into length-sorted groups using div2-length strategy: $G_i = n(E_{i*2-1})|E_{i*2}$ |
---|
121 | |
---|
122 | First we calculate $T$, which marks the two positions after each name. |
---|
123 | Then we can get $G_1$ as follows. |
---|
124 | \begin{eqnarray*} |
---|
125 | T & = & E | n(E)\\ |
---|
126 | S_2 & = & n(n(S))\\ |
---|
127 | G_1 & = & S_2 \wedge T \\ |
---|
128 | \end{eqnarray*} |
---|
129 | |
---|
130 | $G_i (i>1)$ can be similarly determined as follows. |
---|
131 | |
---|
132 | \begin{eqnarray*} |
---|
133 | S_{i*2} & = & n(n(S_{(i-1)*2} \wedge \neg T)) \\ |
---|
134 | G_i & = & S_{i*2} \wedge T \\ |
---|
135 | \end{eqnarray*} |
---|
136 | |
---|
137 | Once the length sorting above has completed, then we can process |
---|
138 | names using bitscan techniques. |
---|
139 | |
---|
140 | For each $G_i (i>0)$: |
---|
141 | \begin{enumerate} |
---|
142 | \item Sequentially scan $G_i$ to find positions $P_i = \{p_{i,0},p_{i,1},...,p_{i,n},\}$. |
---|
143 | \item For each $k$ in $\{1,2...,n\}$, calculate hash value for name $N[p_{i,k-i*2}]...N[p_{i,k-1}]$ |
---|
144 | \item Look up hash table. If the name exist, process next name. |
---|
145 | Otherwise, if $N[p_{i,k-1}]$ is a delimiter (name length is odd), lookup name $N[p_{i,k-i*2}]...N[p_{i,k-2}] (i>1)$ or name $N[p_{1,k-2}]$. |
---|
146 | If it exists, get its GID $g$ and insert $N[p_{i,k-i*2}]...N[p_{i,k-1}]$ with GID $g$. |
---|
147 | If it dosen't exist, insert this name with a new GID and insert $N[p_{i,k-i*2}]...N[p_{i,k-1}]$ with the same GID. |
---|
148 | If $N[p_{i,k-1}]$ is not a delimiter (name length is even), insert $N[p_{i,k-i*2}]...N[p_{i,k-1}]$ with a new GID. |
---|
149 | \end{enumerate} |
---|
150 | |
---|
151 | % \section{Grouping Strategies} |
---|
152 | % \subsection{Length Grouping} |
---|
153 | % \subsubsection{ID grouping} |
---|
154 | % |
---|
155 | % \subsubsection{LOG grouping} |
---|
156 | % |
---|
157 | % \subsubsection{DIV grouping} |
---|
158 | % |
---|
159 | % \subsection{Length + hash Grouping} |
---|
160 | % |
---|
161 | % \section{Sorting Algorithms} |
---|
162 | % \subsection{Sequential Sorting} |
---|
163 | % |
---|
164 | % \subsection{Parallel Sorting} |
---|
165 | % |
---|
166 | % \section{Group Processing Strategies} |
---|
167 | % |
---|
168 | % \subsection{Linear SIMD Search} |
---|
169 | % |
---|
170 | % \subsection{Trie} |
---|
171 | % |
---|
172 | % \subsection{Hash} |
---|
173 | |
---|
174 | \end{document} |
---|
175 | |
---|