source: docs/Working/re/avx2.tex @ 3880

Last change on this file since 3880 was 3880, checked in by cameron, 5 years ago

Various formatting items

File size: 6.1 KB
Line 
1\section{SIMD Scalability}\label{sec:AVX2}
2
3
4Although commodity processors have provided 128-bit SIMD operations for
5more than a decade, the extension to 256-bit integer SIMD operations
6has just recently taken place with the availability of AVX2
7instructions in Intel Haswell architecture chips as of mid 2013.
8This provides an excellent opportunity to assess the scalability
9of the bitwise data-parallel approach to regular expression matching.
10
11For the most part, adapting the Parabix tool chain to the new AVX2
12instructions was straightforward.   This mostly involved regenerating
13library functions using the new AVX2 intrinsics.   There were minor
14issues in the core transposition algorithm because the doublebyte-to-byte
15pack instructions are confined to independent operations within two
16128-bit lanes. 
17
18
19 \begin{figure}[tbh]
20
21\begin{center} \small
22\begin{verbatim}
23bitblock_t spread(uint64_t bits) {
24  uint64_t s = 0x0000200040008001 * bits;
25  uint64_t t = s & 0x0001000100010001;
26  return _mm256_cvtepu16_epi64(t);
27}
28\end{verbatim}
29\end{center}
30\caption{AVX2 256-bit Spread}
31\label{fig:AVX2spread}
32
33\end{figure}
34
35\paragraph*{AVX2 256-Bit Addition} Bitstream addition at the 256-bit block size was implemented using the
36long-stream addition technique.   The AVX2 instruction set directly
37supports the \verb#hsimd<64>::mask(X)# operation using
38the \verb#_mm256_movemask_pd#  intrinsic, extracting
39the required 4-bit mask directly from the 256-bit vector.
40The \verb#hsimd<64>::spread(X)# is slightly more
41problematic, requiring a short sequence of instructions
42to convert the computed 4-bit increment mask back
43into a vector of 4 64-bit values.   One method is to
44use the AVX2 broadcast instruction to make 4 copies
45of the mask to be spread, followed by appropriate
46bit manipulation.   Another uses multiplication to
47first spread to 16-bit fields as shown in Figure \ref{fig:AVX2spread}.
48
49We also compiled new versions of the {\tt egrep} and {\tt nrgrep} programs
50using the {\tt -march=core-avx2} flag in case the compiler is able
51to vectorize some of the code.
52
53\begin{figure}
54\begin{center}
55\begin{tikzpicture}
56\begin{axis}[
57xtick=data,
58ylabel=AVX2 Instruction Reduction,
59xticklabels={@,Date,Email,URIorEmail,HexBytes,StarHeight},
60tick label style={font=\tiny},
61enlarge x limits=0.15,
62%enlarge y limits={0.15, upper},
63ymin=0,
64legend style={at={(0.5,-0.15)},
65anchor=north,legend columns=-1},
66ybar,
67bar width=7pt,
68]
69\addplot
70file {data/sse2-avx2-instr-red-bitstreams.dat};
71\addplot
72file {data/sse2-avx2-instr-red-nrgrep112.dat};
73\addplot
74file {data/sse2-avx2-instr-red-gre2p.dat};
75
76\legend{bitstreams,nrgrep,gre2p,Annot}
77\end{axis}
78\end{tikzpicture}
79\end{center}
80\caption{Instruction Reduction}\label{fig:AVXInstrReduction}
81\end{figure}
82
83
84Figure \ref{fig:AVXInstrReduction} shows the reduction in instruction
85count achieved for each of the applications.   Working at a block
86size of 256 bytes at a time rather than 128 bytes at a time,
87the bitstreams implementation scaled very well with reductions in
88instruction count over a factor of two in every case except for StarHeight.   
89Although a factor
90of two would seem an outside limit, we attribute the change to
91greater instruction efficiency. 
92AVX2 instructions use a
93non destructive three-operand
94form instead of the destructive two-operand form of SSE2.
95In the two-operand form, binary instructions must always use
96one of the source registers as a destination register.   As a
97result the SSE2 object code generates many data movement operations
98that are unnecessary with the AVX2 set.
99
100As expected, there was no observable reduction in instruction
101count with the recompiled grep and nrgrep applications.
102
103
104
105\begin{figure}
106\begin{center}
107\begin{tikzpicture}
108\begin{axis}[
109xtick=data,
110ylabel=AVX2 Speedup,
111xticklabels={@,Date,Email,URIorEmail,HexBytes,StarHeight},
112tick label style={font=\tiny},
113enlarge x limits=0.15,
114%enlarge y limits={0.15, upper},
115ymin=0,
116legend style={at={(0.5,-0.15)},
117anchor=north,legend columns=-1},
118ybar,
119bar width=7pt,
120]
121\addplot
122file {data/sse2-avx2-speedup-bitstreams.dat};
123\addplot
124file {data/sse2-avx2-speedup-nrgrep112.dat};
125\addplot
126file {data/sse2-avx2-speedup-gre2p.dat};
127
128\legend{bitstreams,nrgrep,gre2p,Annot}
129\end{axis}
130\end{tikzpicture}
131\end{center}
132\caption{AVX Speedup}\label{fig:AVXSpeedup}
133\end{figure}
134
135As shown in Figure \ref{fig:AVXSpeedup} the reduction in
136instruction count was reflected in a significant speedup
137in the bitstreams implementation in all cases except
138StarHeight.  However, the speedup was
139considerably less than expected. 
140The bitstreams code  on AVX2 has suffered from a considerable
141reduction in instructions per cycle compared to the SSE2
142implementation, likely indicating
143that our grep implementation has become memory-bound.
144However, the performance of StarHeight deserves particular
145comment, with an actual slowdown observed.   When moving
146to 256 positions at a time, the controlling while loops may
147require more iterations than working 128 positions at a time,
148because the iteration must continue as long as there are any
149pending markers in the block.   
150Nevertheless, the overall results on our AVX2 machine were quite encouraging,
151demonstrating very good scalability of the bitwise data-parallel approach.
152Significantly, the @ regular expression is matched at 0.63 cycles/byte
153using our AVX2 implementation indicating a significant reduction
154in the overhead cost of the Parabix transform.
155
156Table \ref{Xfactors} shows the final performance results
157showing the speedup factors achieved by the Parabix/AVX2 implementation
158vs nrgrep and gre2p.  We have also added comparison with GNU grep
159(version 2.16),
160as it is well known and sometimes used as a basis for comparisons.
161\begin{table}
162\begin{center}
163\begin{tabular}{|c|c|c|c|} \hline
164\multirow{2}{*}{Expression} & \multicolumn{3}{c|}{Bitstream grep Speedup} \\ \cline{2-4}
165& vs. nrgrep & vs. gre2p & vs. GNU grep -e\\ \hline \hline
166At & 3.5X & 34X & 1.6X\\ \hline
167Date & 0.76X & 13X & 48X\\ \hline
168Email & 9.5X & 28X & 12X\\ \hline
169URIorEmail & 6.6X & 27X & 518X\\ \hline
170HexBytes & 8.1X & 105X & 267X\\ \hline
171StarHeight & 1.9X & 7.6X & 97X\\ \hline
172\end{tabular}
173\end{center}
174\caption{Speedups Obtained}\label{Xfactors}
175\end{table}
Note: See TracBrowser for help on using the repository browser.