source: docs/PACT14/avx2.tex @ 4550

Last change on this file since 4550 was 3897, checked in by cameron, 5 years ago

Little clean-ups

File size: 9.1 KB
Line 
1\section{SIMD Scalability}\label{sec:AVX2}
2
3
4Although commodity processors have provided 128-bit SIMD operations for
5more than a decade, the extension to 256-bit integer SIMD operations
6has just recently taken place with the availability of AVX2
7instructions in Intel Haswell architecture chips as of mid 2013.
8This provides an excellent opportunity to assess the scalability
9of the bitwise data-parallel approach to regular expression matching.
10
11For the most part, adapting the Parabix tool chain to the new AVX2
12instructions was straightforward.   This mostly involved regenerating
13library functions using the new AVX2 intrinsics.   There were minor
14issues in the core transposition algorithm because the doublebyte-to-byte
15pack instructions are confined to independent operations within two
16128-bit lanes. 
17
18
19\paragraph*{AVX2 256-Bit Addition} Bitstream addition
20at the 256-bit block size was implemented using the
21long-stream addition technique.   The AVX2 instruction set directly
22supports the \verb#hsimd<64>::mask(X)# operation using
23the \verb#_mm256_movemask_pd#  intrinsic, extracting
24the required 4-bit mask directly from the 256-bit vector.
25The \verb#hsimd<64>::spread(X)# is slightly more complex, requiring a short sequence of instructions
26to convert the computed 4-bit increment mask back
27into a vector of 4 64-bit values.
28
29We also compiled new versions of the {\tt egrep} and {\tt nrgrep} programs
30using the {\tt -march=core-avx2} flag in case the compiler is able
31to vectorize some of the code.
32
33\begin{figure}
34\begin{center}
35\begin{tikzpicture}
36\begin{axis}[
37xtick=data,
38ylabel=AVX2 Instruction Reduction,
39xticklabels={@,Date,Email,URI,Hex,StarHeight},
40tick label style={font=\tiny},
41enlarge x limits=0.15,
42%enlarge y limits={0.15, upper},
43ymin=0,
44legend style={at={(0.5,-0.15)},
45anchor=north,legend columns=-1},
46ybar,
47bar width=7pt,
48cycle list = {black,black!70,black!40,black!10}
49]
50\addplot+[]
51file {data/sse2-avx2-instr-red-bitstreams.dat};
52\addplot+[fill,text=black]
53file {data/sse2-avx2-instr-red-nrgrep112.dat};
54\addplot+[fill,,text=black]
55file {data/sse2-avx2-instr-red-gre2p.dat};
56
57\legend{bitstreams,nrgrep,gre2p,Annot}
58\end{axis}
59\end{tikzpicture}
60\end{center}
61\caption{AVX2/SSE2 Instruction Reduction}\label{fig:AVXInstrReduction}
62\end{figure}
63
64
65Figure \ref{fig:AVXInstrReduction} shows the reduction in instruction
66count achieved for each of the applications.   Working at a block
67size of 256 bytes at a time rather than 128 bytes at a time,
68the bitstreams implementation scaled very well with reductions in
69instruction count over a factor of two in every case except for StarHeight.   
70Although a factor
71of two would seem an outside limit, we attribute the change to
72greater instruction efficiency. 
73AVX2 instructions use a
74non destructive three-operand
75form instead of the destructive two-operand form of SSE2.
76In the two-operand form, binary instructions must always use
77one of the source registers as a destination register.   As a
78result the SSE2 object code generates many data movement operations
79that are unnecessary with the AVX2 set.
80
81As expected, there was no observable reduction in instruction
82count with the recompiled grep and nrgrep applications.
83
84
85
86\begin{figure}
87\begin{center}
88\begin{tikzpicture}
89\begin{axis}[
90xtick=data,
91ylabel=AVX2 Speedup,
92xticklabels={@,Date,Email,URI,Hex,StarHeight},
93tick label style={font=\tiny},
94enlarge x limits=0.15,
95%enlarge y limits={0.15, upper},
96ymin=0,
97legend style={at={(0.5,-0.15)},
98anchor=north,legend columns=-1},
99ybar,
100bar width=7pt,
101cycle list = {black,black!70,black!40,black!10}
102]
103\addplot+[]
104file {data/sse2-avx2-speedup-bitstreams.dat};
105\addplot+[fill,text=black]
106file {data/sse2-avx2-speedup-nrgrep112.dat};
107\addplot+[fill,,text=black]
108file {data/sse2-avx2-speedup-gre2p.dat};
109
110\legend{bitstreams,nrgrep,gre2p,Annot}
111\end{axis}
112\end{tikzpicture}
113\end{center}
114\caption{AVX2/SSE2 Speedup}\label{fig:AVXSpeedup}
115\end{figure}
116
117As shown in Figure \ref{fig:AVXSpeedup} the reduction in
118instruction count was reflected in a significant speedup
119in the bitstreams implementation in all cases except
120StarHeight.  However, the speedup was
121considerably less than expected. 
122The bitstreams code  on AVX2 has suffered from a considerable
123reduction in instructions per cycle compared to the SSE2
124implementation, likely indicating
125that our grep implementation has become memory-bound.
126However, the performance of StarHeight deserves particular
127comment, with an actual slowdown observed.   When moving
128to 256 positions at a time, the controlling while loops may
129require more iterations than working 128 positions at a time,
130because the iteration must continue as long as there are any
131pending markers in the block.   
132Nevertheless, the overall results on our AVX2 machine were quite encouraging,
133demonstrating very good scalability of the bitwise data-parallel approach.
134Significantly, the @ regular expression is matched at 0.63 cycles/byte
135using our AVX2 implementation indicating a considerable reduction
136in the overhead cost of the Parabix transform.
137
138\begin{table}
139\begin{center}
140\begin{tabular}{|c|c|c|c|} \hline
141\multirow{2}{*}{Expression} & \multicolumn{3}{c|}{Bitstream/AVX2 grep Speedup} \\ \cline{2-4}
142& vs. nrgrep & vs. gre2p & vs. GNU grep -e\\ \hline \hline
143At & 3.5X & 34X & 1.6X\\ \hline
144Date & 0.76X & 13X & 48X\\ \hline
145Email & 9.5X & 28X & 12X\\ \hline
146URI & 6.6X & 27X & 518X\\ \hline
147Hex & 8.1X & 105X & 267X\\ \hline
148StarHeight & 1.9X & 7.6X & 97X\\ \hline
149\end{tabular}
150\end{center}
151\caption{Bitsream Speedup vs. Comparators}\label{Xfactors}
152\end{table}
153
154\begin{figure}
155\begin{center}
156\begin{tikzpicture}
157\begin{axis}[
158xtick=data,
159ylabel=Running Time (ms per megabyte),
160xticklabels={@,Date,Email,URI,Hex,StarHeight},
161tick label style={font=\tiny},
162enlarge x limits=0.15,
163%enlarge y limits={0.15, upper},
164ymin=0,
165legend style={at={(0.5,-0.15)},
166anchor=north,legend columns=-1},
167ybar,
168bar width=7pt,
169cycle list = {black,black!70,black!40,black!10}
170]
171\addplot+[]
172file {data/ssetime.dat};
173\addplot+[fill,text=black]
174file {data/avxtime.dat};
175\addplot+[fill,,text=black]
176file {data/gputime.dat};
177
178\legend{SSE2,AVX2,GPU,Annot}
179\end{axis}
180\end{tikzpicture}
181\end{center}
182\caption{Running Time}\label{fig:SSE-AVX-GPU}
183\end{figure}
184
185
186
187Table \ref{Xfactors} shows the final performance results
188showing the speedup factors achieved by the bitstreams/AVX2 implementation
189vs nrgrep and gre2p.  We have also added comparison with GNU grep
190(version 2.16),
191as it is well known and sometimes used as a basis for comparisons.
192
193
194\section{GPU Implementation}\label{sec:GPU}
195
196To further assess the scalability of our regular expression matching
197using bit-parallel data streams, we implemented a GPU version
198in OpenCL.   
199We arranged for 64 work groups each having 64 threads.
200The size of work group and number of work groups is chosen
201to provide the best occupancy as calculated by the AMD App Profiler.
202Input files are divided in data parallel fashion among
203the 64 work groups.  Each work group carries out the regular
204expression matching operations 4096 bytes at a time using SIMT
205processing.   Although the GPU
206does not directly support the mask and spread operations required
207by our long-stream addition model,
208we are able to simulate them using shared memory.
209Each thread maintains
210its own carry and bubble values in shared memory and performs
211synchronized updates with the other threads using a six-step
212parallel-prefix style process.  Others have implemented
213long-stream addition on the GPU using similar techniques,
214as noted previously.
215
216We performed our test on an AMD Radeon HD A10-6800K APU machine.
217On the AMD Fusion systems, the input buffer is allocated in
218pinned memory to take advantage of the zero-copy memory regions
219where data can be read directly into this region by the CPU
220and also accessed by the GPU for further processing. Therefore,
221the expensive data transferring time that is needed by traditional
222discrete GPUs is hidden and we compare only the kernel execution
223time with our SSE2 and AVX implementations as shown in Figure
224\ref{fig:SSE-AVX-GPU}. The GPU version gives up to 55\% performance
225improvement over SSE version and up to 40\% performance
226improvement over AVX version.   However, because of
227implementation complexities of the triply-nested while loop for
228the StarHeight expression, it has been omitted.
229
230Although we intended to process
23164 work groups with 4096 bytes each at a time rather than 128 bytes
232at a time on SSE or 256 bytes at a time on AVX, the performance
233improvement is less than 60\%. The first reason is hardware
234limitations. Our kernel occupancy is limited by register usage
235and not all the work groups can be scheduled at the same time.
236The second reason is that the long-stream addition implemented
237on GPU is more expensive than the implementations on SSE or AVX.
238Another important reason is the control flow. When a possible
239match is found in one thread, the rest of the threads in the
240same work group have to execute the same instructions for
241further processing rather than jump to the next block with a
242simple IF test. Therefore, the performance of different
243regular expressions is dependent on the number of
244long-stream addition operations and the total number of matches
245of a given input.   Perhaps surprisingly, the overhead of the Parabix
246transformation was not a dominant factor, coming in at 0.08 ms/MB.
247
Note: See TracBrowser for help on using the repository browser.