source: docs/Working/re/avx2.tex @ 3509

Last change on this file since 3509 was 3509, checked in by cameron, 6 years ago

updates for SSE2 study

File size: 5.7 KB
Line 
1\section{SIMD Scalability}\label{sec:AVX2}
2
3
4Although commodity processors have provided 128-bit SIMD operations for
5more than a decade, the extension to 256-bit integer SIMD operations
6has just recently taken place with the availability of AVX2
7instructions in Intel Haswell architecture chips as of mid 2013.
8This provides an excellent opportunity to assess the scalability
9of the bitwise data-parallel approach to regular expression matching.
10
11For the most part, adapting the Parabix tool chain to the new AVX2
12instructions was straightforward.   This mostly involved regenerating
13library functions using the new AVX2 intrinsics.   There were minor
14issues in the core transposition algorithm because the doublebyte-to-byte
15pack instructions are confined to independent operations within two
16128-bit lanes. 
17
18
19\subsection{AVX Stream Addition}
20 \begin{figure*}[tbh]
21\begin{center}
22\begin{verbatim}
23void add_ci_co(bitblock_t x, bitblock_t y, carry_t carry_in, carry_t & carry_out, bitblock_t & sum) {
24  bitblock_t all_ones = simd256<1>::constant<1>();
25  bitblock_t gen = simd_and(x, y);
26  bitblock_t prop = simd_xor(x, y);
27  bitblock_t partial_sum = simd256<64>::add(x, y);
28  bitblock_t carry = simd_or(gen, simd_andc(prop, partial_sum));
29  bitblock_t bubble = simd256<64>::eq(partial_sum, all_ones);
30  uint64_t carry_mask = hsimd256<64>::signmask(carry) * 2 + convert(carry_in);
31  uint64_t bubble_mask = hsimd256<64>::signmask(bubble);
32  uint64_t carry_scan_thru_bubbles = (carry_mask + bubble_mask) &~ bubble_mask;
33  uint64_t increments = carry_scan_thru_bubbles | (carry_scan_thru_bubbles - carry_mask);
34  carry_out = convert(increments >> 4);
35  uint64_t spread = 0x0000200040008001 * increments & 0x0001000100010001;
36  sum = simd256<64>::add(partial_sum, _mm256_cvtepu16_epi64(avx_select_lo128(convert(spread))));
37}
38
39\end{verbatim}
40
41\end{center}
42\caption{AVX2 256-bit Addition}
43\label{fig:AVX2add}
44
45\end{figure*}
46
47Bitstream addition at the 256-bit block size was implemented using the
48long-stream addition technique.   Figure \ref{fig:AVX2add} shows our
49implementation.  Spreading bits from the calculated increments mask
50was achieved somewhat awkwardly with a 64-bit multiply to spread
51into 16-bit fields followed by SIMD zero extend of the 16-bit fields
52to 64-bits each.
53
54We also compiled new versions of the {\tt grep} and {\tt nrgrep} programs
55using the {\tt -march=core-avx2} flag in case the compiler is able
56to vectorize some of the code.
57
58\begin{figure}
59\begin{center}
60\begin{tikzpicture}
61\begin{axis}[
62xtick=data,
63ylabel=AVX Instruction Reduction,
64xticklabels={@,Date,Email,URIorEmail,xquote},
65tick label style={font=\tiny},
66enlarge x limits=0.15,
67enlarge y limits={0.15, upper},
68ymin=0,
69legend style={at={(0.5,-0.15)},
70anchor=north,legend columns=-1},
71ybar,
72bar width=7pt,
73]
74\addplot[fill=black]
75file {data/avxinstructions1.dat};
76\addplot[fill=gray]
77file {data/avxinstructions2.dat};
78\addplot[fill=white]
79file {data/avxinstructions3.dat};
80
81\legend{Bitstreams,NRGrep,Grep,Annot}
82\end{axis}
83\end{tikzpicture}
84\end{center}
85\caption{Instruction Reduction}\label{fig:AVXInstrReduction}
86\end{figure}
87
88
89Figure \ref{fig:AVXInstrReduction} shows the reduction in instruction
90count achieved for each of the applications.   Working at a block
91size of 256 bytes at a time rather than 128 bytes at a time,
92the bitstreams implementation scaled dramatically well with reductions in
93instruction count over a factor of two in each case.   Although a factor
94of two would seem an outside limit, we attribute the change to
95greater instruction efficiency. 
96AVX2 instructions use a
97non destructive three-operand
98form instead of the destructive two-operand form of SSE2.
99In the two-operand form, binary instructions must always use
100one of the source registers as a destination register.   As a
101result the SSE2 object code generates many data movement operations
102that are unnecessary with the AVX2 set.
103
104As expected, there was no observable reduction in instruction
105count with the recompiled grep and nrgrep applications.
106
107
108
109\begin{figure}
110\begin{center}
111\begin{tikzpicture}
112\begin{axis}[
113xtick=data,
114ylabel=AVX Speedup,
115xticklabels={@,Date,Email,URIorEmail,xquote},
116tick label style={font=\tiny},
117enlarge x limits=0.15,
118enlarge y limits={0.15, upper},
119ymin=0,
120legend style={at={(0.5,-0.15)},
121anchor=north,legend columns=-1},
122ybar,
123bar width=7pt,
124]
125\addplot[fill=black]
126file {data/avxcycles1.dat};
127\addplot[fill=gray]
128file {data/avxcycles2.dat};
129\addplot[fill=white]
130file {data/avxcycles3.dat};
131
132\legend{Bitstreams,NRGrep,Grep,Annot}
133\end{axis}
134\end{tikzpicture}
135\end{center}
136\caption{AVX Speedup}\label{fig:AVXSpeedup}
137\end{figure}
138
139As shown in Figure \ref{fig:AVXSpeedup} the reduction in
140instruction count was reflected in a considerable speed-up
141in the bitstreams implementation.  However, the speed-up was
142considerably less than expected.  As shown in \label{fig:AVXIPC}
143the AVX2 version has lost some of the superscalar efficiency
144of the SSE2 code.   This is a performance debugging issue
145that we have yet to resolve.
146
147
148\begin{figure}
149\begin{center}
150\begin{tikzpicture}
151\begin{axis}[
152xtick=data,
153ylabel=Change in Instructions per Cycle,
154xticklabels={@,Date,Email,URIorEmail,xquote},
155tick label style={font=\tiny},
156enlarge x limits=0.15,
157enlarge y limits={0.15, upper},
158ymin=0,
159legend style={at={(0.5,-0.15)},
160anchor=north,legend columns=-1},
161ybar,
162bar width=7pt,
163]
164\addplot[fill=black]
165file {data/avxipc1.dat};
166\addplot[fill=gray]
167file {data/avxipc2.dat};
168\addplot[fill=white]
169file {data/avxipc3.dat};
170
171
172
173\legend{Bitstreams,NRGrep,Grep,Annot}
174\end{axis}
175\end{tikzpicture}
176\end{center}
177\caption{Change in Instructions Per Cycle With AVX}\label{fig:AVXIPC}
178\end{figure}
179
180Overall, the results on our AVX2 machine were quite good,
181demonstrating very good scalability of the bitwise data-parallel approach.
182
183
Note: See TracBrowser for help on using the repository browser.