[1186] | 1 | \documentclass{article} |
---|
| 2 | \usepackage{fullpage} |
---|
| 3 | |
---|
| 4 | \begin{document} |
---|
| 5 | \title{Parallel Bitstream-Based Length Sorting} |
---|
| 6 | \author{Robert D. Cameron} |
---|
| 7 | \date{May 22, 2011} |
---|
| 8 | |
---|
| 9 | \maketitle |
---|
| 10 | |
---|
[1549] | 11 | \section{Length-Group Bitstreams} |
---|
[1186] | 12 | |
---|
| 13 | Given streams $S$ and $E$ that mark the starts and ends of Names. |
---|
| 14 | |
---|
| 15 | We can partition $E$ into a length-sorted set of streams. |
---|
| 16 | |
---|
| 17 | First we can partition $E$ into the stream marking |
---|
| 18 | ends of Names of length 1, $E_1$, and the stream |
---|
| 19 | of ends of Names of greater than length 1, $E_{>1}$, as follows. |
---|
[1549] | 20 | Here $n(S)$ is the bitstream ``next'' operation, shifting the |
---|
| 21 | stream forward one position. |
---|
[1186] | 22 | |
---|
| 23 | \begin{eqnarray*} |
---|
[1187] | 24 | S_1 & = & n(S)\\ |
---|
[1186] | 25 | E_1 & = & S_1 \wedge E\\ |
---|
| 26 | E_{>1} & = & E \wedge \neg E_1\\ |
---|
| 27 | \end{eqnarray*} |
---|
| 28 | |
---|
| 29 | Following this, the stream that marks the ends of Names of length 2, $E_2$, |
---|
| 30 | as well as the stream $E_{>2}$ for longer Names |
---|
| 31 | can be similarly determined. |
---|
| 32 | |
---|
| 33 | \begin{eqnarray*} |
---|
| 34 | S_2 & = & n(S_1)\\ |
---|
| 35 | E_2 & = & S_2 \wedge E_{>1}\\ |
---|
| 36 | E_{>2} & = & E_{>1} \wedge \neg E_2\\ |
---|
| 37 | \end{eqnarray*} |
---|
| 38 | |
---|
| 39 | It may then be desirable to group the Names of length 3 and 4 together. |
---|
| 40 | A stream $E_{3,4}$ marking the ends of such names can be determined |
---|
| 41 | with a slight variation of the pattern. |
---|
| 42 | |
---|
| 43 | \begin{eqnarray*} |
---|
| 44 | S_{3,4} & = & n(n(S_1 | S_2))\\ |
---|
| 45 | E_{3,4} & = & S_{3,4} \wedge E_{>2}\\ |
---|
| 46 | E_{>4} & = & E_{>2} \wedge \neg E_{3,4}\\ |
---|
| 47 | \end{eqnarray*} |
---|
| 48 | |
---|
| 49 | Here the stream $S_{3,4}$ is the stream consisting of all positions |
---|
| 50 | 3 or 4 positions past a Name start position. |
---|
| 51 | |
---|
| 52 | Following this pattern streams for Names of length 5 through 8 are |
---|
| 53 | similarly determined. |
---|
| 54 | |
---|
| 55 | \begin{eqnarray*} |
---|
| 56 | S_{5,8} & = & n(n(n(n(S_1 | S_2 | S_{3,4}))))\\ |
---|
| 57 | E_{5,8} & = & S_{5,8} \wedge E_{>4}\\ |
---|
| 58 | E_{>8} & = & E_{>4} \wedge \neg E_{5,8}\\ |
---|
| 59 | \end{eqnarray*} |
---|
| 60 | |
---|
[1549] | 61 | |
---|
| 62 | \section{Parallel Hashing} |
---|
| 63 | |
---|
| 64 | Given length groups identified as in the previous section, |
---|
| 65 | it is possible to perform parallel hash value calculation for |
---|
| 66 | names in each group as follows. |
---|
| 67 | |
---|
| 68 | The first step is to create hash streams. |
---|
| 69 | Let $B_0, B_1, \ldots, B_7$ be the {\em basis} bitstreams |
---|
| 70 | determined by character stream transposition. That is, |
---|
| 71 | the $k$th basis bitstream $B_k$ consists of the $k$th bit |
---|
| 72 | of each character in the source data stream. |
---|
| 73 | Then we may compute hash streams by some bitwise logical combination |
---|
| 74 | of basis bit streams. |
---|
| 75 | For example, the following two hash streams may be useful. |
---|
| 76 | |
---|
| 77 | \begin{eqnarray*} |
---|
| 78 | h_a & = & B_2 \oplus B_4 \oplus B_6 \\ |
---|
| 79 | h_b & = & B_3 \oplus B_5 \oplus B_7 \\ |
---|
| 80 | \end{eqnarray*} |
---|
| 81 | |
---|
| 82 | |
---|
| 83 | |
---|
| 84 | |
---|
| 85 | |
---|
[1186] | 86 | \section{Follow-up} |
---|
| 87 | |
---|
| 88 | Once the length sorting above has completed, then we can process |
---|
| 89 | names using bitscan techniques. |
---|
| 90 | |
---|
| 91 | \begin{enumerate} |
---|
| 92 | \item Sequentially scan $E_1$ to process all names of length 1. |
---|
| 93 | \item Sequentially scan $E_2$ to process all names of length 2. |
---|
| 94 | \item Sequentially scan $E_{3,4}$ to process all names of length 3 or 4. The |
---|
| 95 | actual length may be determined by a bit test of the $S$ stream. |
---|
| 96 | \item Sequentially scan $E_{5,8}$ to process all names of length 5 through 8. The |
---|
| 97 | actual length may be determined by a reverse bitscan from each $E_{5,8}$ |
---|
| 98 | to the corresponding name start position in $S$. |
---|
| 99 | \item Sequentially scan $E_>8$ to process all names of length greater than 8. |
---|
| 100 | This will require both a reverse bit scan to find the name start and |
---|
| 101 | a forward scan to find the actual length. |
---|
| 102 | \end{enumerate} |
---|
| 103 | |
---|
[1187] | 104 | Of course, this scheme shows just one partitioning into length groups; others |
---|
| 105 | are possible. Using the log-length strategy, a partition of $E_{>8}$ into |
---|
| 106 | $E_{9,16}$ and $E_{>16}$ would probably be worthwhile with XML. |
---|
| 107 | |
---|
[1186] | 108 | \section{Expected Benefits} |
---|
| 109 | |
---|
| 110 | What is the expected benefit of this technique versus sequentially determining |
---|
| 111 | lengths? This avoids allocation of length-based position arrays, the memory |
---|
| 112 | operations of insertion of positions into those arrays, bounds-checking on those |
---|
[1549] | 113 | arrays and maintaining an auxiliary array of lengths of each array. |
---|
[1186] | 114 | |
---|
| 115 | A further benefit may be from the programming side: the length sorting can |
---|
| 116 | be implemented using our existing Pablo language, avoiding low-level C programming. |
---|
| 117 | |
---|
[1302] | 118 | \section{DIV2 grouping} |
---|
| 119 | Given streams $S$ and $E$ that mark the starts and ends of Names. |
---|
| 120 | We can partition $E$ into length-sorted groups using div2-length strategy: $G_i = n(E_{i*2-1})|E_{i*2}$ |
---|
| 121 | |
---|
| 122 | First we calculate $T$, which marks the two positions after each name. |
---|
| 123 | Then we can get $G_1$ as follows. |
---|
| 124 | \begin{eqnarray*} |
---|
| 125 | T & = & E | n(E)\\ |
---|
| 126 | S_2 & = & n(n(S))\\ |
---|
| 127 | G_1 & = & S_2 \wedge T \\ |
---|
| 128 | \end{eqnarray*} |
---|
| 129 | |
---|
| 130 | $G_i (i>1)$ can be similarly determined as follows. |
---|
| 131 | |
---|
| 132 | \begin{eqnarray*} |
---|
| 133 | S_{i*2} & = & n(n(S_{(i-1)*2} \wedge \neg T)) \\ |
---|
| 134 | G_i & = & S_{i*2} \wedge T \\ |
---|
| 135 | \end{eqnarray*} |
---|
| 136 | |
---|
| 137 | Once the length sorting above has completed, then we can process |
---|
| 138 | names using bitscan techniques. |
---|
| 139 | |
---|
| 140 | For each $G_i (i>0)$: |
---|
| 141 | \begin{enumerate} |
---|
| 142 | \item Sequentially scan $G_i$ to find positions $P_i = \{p_{i,0},p_{i,1},...,p_{i,n},\}$. |
---|
| 143 | \item For each $k$ in $\{1,2...,n\}$, calculate hash value for name $N[p_{i,k-i*2}]...N[p_{i,k-1}]$ |
---|
| 144 | \item Look up hash table. If the name exist, process next name. |
---|
| 145 | Otherwise, if $N[p_{i,k-1}]$ is a delimiter (name length is odd), lookup name $N[p_{i,k-i*2}]...N[p_{i,k-2}] (i>1)$ or name $N[p_{1,k-2}]$. |
---|
| 146 | If it exists, get its GID $g$ and insert $N[p_{i,k-i*2}]...N[p_{i,k-1}]$ with GID $g$. |
---|
| 147 | If it dosen't exist, insert this name with a new GID and insert $N[p_{i,k-i*2}]...N[p_{i,k-1}]$ with the same GID. |
---|
| 148 | If $N[p_{i,k-1}]$ is not a delimiter (name length is even), insert $N[p_{i,k-i*2}]...N[p_{i,k-1}]$ with a new GID. |
---|
| 149 | \end{enumerate} |
---|
| 150 | |
---|
| 151 | % \section{Grouping Strategies} |
---|
| 152 | % \subsection{Length Grouping} |
---|
| 153 | % \subsubsection{ID grouping} |
---|
| 154 | % |
---|
| 155 | % \subsubsection{LOG grouping} |
---|
| 156 | % |
---|
| 157 | % \subsubsection{DIV grouping} |
---|
| 158 | % |
---|
| 159 | % \subsection{Length + hash Grouping} |
---|
| 160 | % |
---|
| 161 | % \section{Sorting Algorithms} |
---|
| 162 | % \subsection{Sequential Sorting} |
---|
| 163 | % |
---|
| 164 | % \subsection{Parallel Sorting} |
---|
| 165 | % |
---|
| 166 | % \section{Group Processing Strategies} |
---|
| 167 | % |
---|
| 168 | % \subsection{Linear SIMD Search} |
---|
| 169 | % |
---|
| 170 | % \subsection{Trie} |
---|
| 171 | % |
---|
| 172 | % \subsection{Hash} |
---|
| 173 | |
---|
[1186] | 174 | \end{document} |
---|
| 175 | |
---|