source: docs/Working/re/avx2.tex @ 3878

Last change on this file since 3878 was 3875, checked in by cameron, 5 years ago

GNU grep version 2.16

File size: 6.1 KB
[3498]1\section{SIMD Scalability}\label{sec:AVX2}
[3504]4Although commodity processors have provided 128-bit SIMD operations for
[3500]5more than a decade, the extension to 256-bit integer SIMD operations
6has just recently taken place with the availability of AVX2
7instructions in Intel Haswell architecture chips as of mid 2013.
8This provides an excellent opportunity to assess the scalability
9of the bitwise data-parallel approach to regular expression matching.
[3500]11For the most part, adapting the Parabix tool chain to the new AVX2
12instructions was straightforward.   This mostly involved regenerating
13library functions using the new AVX2 intrinsics.   There were minor
14issues in the core transposition algorithm because the doublebyte-to-byte
[3504]15pack instructions are confined to independent operations within two
[3500]16128-bit lanes. 
[3625]19\paragraph*{AVX2 256-Bit Addition}
20 \begin{figure}[tbh]
22\begin{center} \small
[3625]24bitblock_t spread(uint64_t bits) {
25  uint64_t s = 0x0000200040008001 * bits;
26  uint64_t t = s & 0x0001000100010001;
27  return _mm256_cvtepu16_epi64(t);
[3625]31\caption{AVX2 256-bit Spread}
36Bitstream addition at the 256-bit block size was implemented using the
[3625]37long-stream addition technique.   The AVX2 instruction set directly
38supports the \verb#hsimd<64>::mask(X)# operation using
39the \verb#_mm256_movemask_pd#  intrinsic, extracting
40the required 4-bit mask directly from the 256-bit vector.
41The \verb#hsimd<64>::spread(X)# is slightly more
42problematic, requiring a short sequence of instructions
43to convert the computed 4-bit increment mask back
44into a vector of 4 64-bit values.   One method is to
45use the AVX2 broadcast instruction to make 4 copies
46of the mask to be spread, followed by appropriate
47bit manipulation.   Another uses multiplication to
48first spread to 16-bit fields as shown in Figure \ref{fig:AVX2spread}.
[3625]50We also compiled new versions of the {\tt egrep} and {\tt nrgrep} programs
[3500]51using the {\tt -march=core-avx2} flag in case the compiler is able
52to vectorize some of the code.
[3515]59ylabel=AVX2 Instruction Reduction,
[3498]61tick label style={font=\tiny},
62enlarge x limits=0.15,
[3617]63%enlarge y limits={0.15, upper},
65legend style={at={(0.5,-0.15)},
66anchor=north,legend columns=-1},
68bar width=7pt,
[3659]71file {data/sse2-avx2-instr-red-bitstreams.dat};
[3659]73file {data/sse2-avx2-instr-red-nrgrep112.dat};
[3659]75file {data/sse2-avx2-instr-red-gre2p.dat};
[3500]81\caption{Instruction Reduction}\label{fig:AVXInstrReduction}
85Figure \ref{fig:AVXInstrReduction} shows the reduction in instruction
86count achieved for each of the applications.   Working at a block
87size of 256 bytes at a time rather than 128 bytes at a time,
[3868]88the bitstreams implementation scaled very well with reductions in
89instruction count over a factor of two in every case except for StarHeight.   
90Although a factor
[3500]91of two would seem an outside limit, we attribute the change to
92greater instruction efficiency. 
93AVX2 instructions use a
94non destructive three-operand
95form instead of the destructive two-operand form of SSE2.
[3504]96In the two-operand form, binary instructions must always use
[3500]97one of the source registers as a destination register.   As a
98result the SSE2 object code generates many data movement operations
99that are unnecessary with the AVX2 set.
101As expected, there was no observable reduction in instruction
102count with the recompiled grep and nrgrep applications.
[3515]111ylabel=AVX2 Speedup,
[3498]113tick label style={font=\tiny},
114enlarge x limits=0.15,
[3617]115%enlarge y limits={0.15, upper},
117legend style={at={(0.5,-0.15)},
118anchor=north,legend columns=-1},
120bar width=7pt,
[3659]123file {data/sse2-avx2-speedup-bitstreams.dat};
[3659]125file {data/sse2-avx2-speedup-nrgrep112.dat};
[3659]127file {data/sse2-avx2-speedup-gre2p.dat};
[3500]133\caption{AVX Speedup}\label{fig:AVXSpeedup}
[3500]136As shown in Figure \ref{fig:AVXSpeedup} the reduction in
[3870]137instruction count was reflected in a significant speedup
[3868]138in the bitstreams implementation in all cases except
[3870]139StarHeight.  However, the speedup was
[3642]140considerably less than expected. 
141The bitstreams code  on AVX2 has suffered from a considerable
142reduction in instructions per cycle compared to the SSE2
[3868]143implementation, likely indicating
[3642]144that our grep implementation has become memory-bound.
[3868]145However, the performance of StarHeight deserves particular
146comment, with an actual slowdown observed.   When moving
147to 256 positions at a time, the controlling while loops may
148require more iterations than working 128 positions at a time,
149because the iteration must continue as long as there are any
150pending markers in the block.   
[3642]151Nevertheless, the overall results on our AVX2 machine were quite encouraging,
[3500]152demonstrating very good scalability of the bitwise data-parallel approach.
[3862]153Significantly, the @ regular expression is matched at 0.63 cycles/byte
154using our AVX2 implementation indicating a significant reduction
155in the overhead cost of the Parabix transform.
[3870]157Table \ref{Xfactors} shows the final performance results
158showing the speedup factors achieved by the Parabix/AVX2 implementation
[3875]159vs nrgrep and gre2p.  We have also added comparison with GNU grep
160(version 2.16),
[3874]161as it is well known and sometimes used as a basis for comparisons.
[3873]164\begin{tabular}{|c|c|c|c|} \hline
165\multirow{2}{*}{Expression} & \multicolumn{3}{c|}{Bitstream grep Speedup} \\ \cline{2-4}
[3874]166& vs. nrgrep & vs. gre2p & vs. GNU grep -e\\ \hline \hline
[3873]167At & 3.5X & 34X & 1.6X\\ \hline
[3874]168Date & 0.76X & 13X & 48X\\ \hline
[3873]169Email & 9.5X & 28X & 12X\\ \hline
170URIorEmail & 6.6X & 27X & 518X\\ \hline
171HexBytes & 8.1X & 105X & 267X\\ \hline
172StarHeight & 1.9X & 7.6X & 97X\\ \hline
175\caption{Speedups Obtained}\label{Xfactors}
Note: See TracBrowser for help on using the repository browser.