source: docs/Working/re/avx2.tex @ 3894

Last change on this file since 3894 was 3893, checked in by linmengl, 5 years ago

bar charts black-and-white printer friendly

File size: 5.7 KB
Line 
1\section{SIMD Scalability}\label{sec:AVX2}
2
3
4Although commodity processors have provided 128-bit SIMD operations for
5more than a decade, the extension to 256-bit integer SIMD operations
6has just recently taken place with the availability of AVX2
7instructions in Intel Haswell architecture chips as of mid 2013.
8This provides an excellent opportunity to assess the scalability
9of the bitwise data-parallel approach to regular expression matching.
10
11For the most part, adapting the Parabix tool chain to the new AVX2
12instructions was straightforward.   This mostly involved regenerating
13library functions using the new AVX2 intrinsics.   There were minor
14issues in the core transposition algorithm because the doublebyte-to-byte
15pack instructions are confined to independent operations within two
16128-bit lanes. 
17
18
19\paragraph*{AVX2 256-Bit Addition} Bitstream addition
20at the 256-bit block size was implemented using the
21long-stream addition technique.   The AVX2 instruction set directly
22supports the \verb#hsimd<64>::mask(X)# operation using
23the \verb#_mm256_movemask_pd#  intrinsic, extracting
24the required 4-bit mask directly from the 256-bit vector.
25The \verb#hsimd<64>::spread(X)# is slightly more complex, requiring a short sequence of instructions
26to convert the computed 4-bit increment mask back
27into a vector of 4 64-bit values.
28
29We also compiled new versions of the {\tt egrep} and {\tt nrgrep} programs
30using the {\tt -march=core-avx2} flag in case the compiler is able
31to vectorize some of the code.
32
33\begin{figure}
34\begin{center}
35\begin{tikzpicture}
36\begin{axis}[
37xtick=data,
38ylabel=AVX2 Instruction Reduction,
39xticklabels={@,Date,Email,URI,Hex,StarHeight},
40tick label style={font=\tiny},
41enlarge x limits=0.15,
42%enlarge y limits={0.15, upper},
43ymin=0,
44legend style={at={(0.5,-0.15)},
45anchor=north,legend columns=-1},
46ybar,
47bar width=7pt,
48cycle list = {black,black!70,black!40,black!10}
49]
50\addplot+[]
51file {data/sse2-avx2-instr-red-bitstreams.dat};
52\addplot+[fill,text=black]
53file {data/sse2-avx2-instr-red-nrgrep112.dat};
54\addplot+[fill,,text=black]
55file {data/sse2-avx2-instr-red-gre2p.dat};
56
57\legend{bitstreams,nrgrep,gre2p,Annot}
58\end{axis}
59\end{tikzpicture}
60\end{center}
61\caption{Instruction Reduction}\label{fig:AVXInstrReduction}
62\end{figure}
63
64
65Figure \ref{fig:AVXInstrReduction} shows the reduction in instruction
66count achieved for each of the applications.   Working at a block
67size of 256 bytes at a time rather than 128 bytes at a time,
68the bitstreams implementation scaled very well with reductions in
69instruction count over a factor of two in every case except for StarHeight.   
70Although a factor
71of two would seem an outside limit, we attribute the change to
72greater instruction efficiency. 
73AVX2 instructions use a
74non destructive three-operand
75form instead of the destructive two-operand form of SSE2.
76In the two-operand form, binary instructions must always use
77one of the source registers as a destination register.   As a
78result the SSE2 object code generates many data movement operations
79that are unnecessary with the AVX2 set.
80
81As expected, there was no observable reduction in instruction
82count with the recompiled grep and nrgrep applications.
83
84
85
86\begin{figure}
87\begin{center}
88\begin{tikzpicture}
89\begin{axis}[
90xtick=data,
91ylabel=AVX2 Speedup,
92xticklabels={@,Date,Email,URI,Hex,StarHeight},
93tick label style={font=\tiny},
94enlarge x limits=0.15,
95%enlarge y limits={0.15, upper},
96ymin=0,
97legend style={at={(0.5,-0.15)},
98anchor=north,legend columns=-1},
99ybar,
100bar width=7pt,
101cycle list = {black,black!70,black!40,black!10}
102]
103\addplot+[]
104file {data/sse2-avx2-speedup-bitstreams.dat};
105\addplot+[fill,text=black]
106file {data/sse2-avx2-speedup-nrgrep112.dat};
107\addplot+[fill,,text=black]
108file {data/sse2-avx2-speedup-gre2p.dat};
109
110\legend{bitstreams,nrgrep,gre2p,Annot}
111\end{axis}
112\end{tikzpicture}
113\end{center}
114\caption{AVX Speedup}\label{fig:AVXSpeedup}
115\end{figure}
116
117As shown in Figure \ref{fig:AVXSpeedup} the reduction in
118instruction count was reflected in a significant speedup
119in the bitstreams implementation in all cases except
120StarHeight.  However, the speedup was
121considerably less than expected. 
122The bitstreams code  on AVX2 has suffered from a considerable
123reduction in instructions per cycle compared to the SSE2
124implementation, likely indicating
125that our grep implementation has become memory-bound.
126However, the performance of StarHeight deserves particular
127comment, with an actual slowdown observed.   When moving
128to 256 positions at a time, the controlling while loops may
129require more iterations than working 128 positions at a time,
130because the iteration must continue as long as there are any
131pending markers in the block.   
132Nevertheless, the overall results on our AVX2 machine were quite encouraging,
133demonstrating very good scalability of the bitwise data-parallel approach.
134Significantly, the @ regular expression is matched at 0.63 cycles/byte
135using our AVX2 implementation indicating a significant reduction
136in the overhead cost of the Parabix transform.
137
138Table \ref{Xfactors} shows the final performance results
139showing the speedup factors achieved by the Parabix/AVX2 implementation
140vs nrgrep and gre2p.  We have also added comparison with GNU grep
141(version 2.16),
142as it is well known and sometimes used as a basis for comparisons.
143\begin{table}
144\begin{center}
145\begin{tabular}{|c|c|c|c|} \hline
146\multirow{2}{*}{Expression} & \multicolumn{3}{c|}{Bitstream grep Speedup} \\ \cline{2-4}
147& vs. nrgrep & vs. gre2p & vs. GNU grep -e\\ \hline \hline
148At & 3.5X & 34X & 1.6X\\ \hline
149Date & 0.76X & 13X & 48X\\ \hline
150Email & 9.5X & 28X & 12X\\ \hline
151URIorEmail & 6.6X & 27X & 518X\\ \hline
152Hex & 8.1X & 105X & 267X\\ \hline
153StarHeight & 1.9X & 7.6X & 97X\\ \hline
154\end{tabular}
155\end{center}
156\caption{Speedups Obtained}\label{Xfactors}
157\end{table}
Note: See TracBrowser for help on using the repository browser.