source: docs/PACT14/sse2.tex @ 4561

Last change on this file since 4561 was 3896, checked in by cameron, 5 years ago

Clean-ups

File size: 9.5 KB
Line 
1\section{SSE2 Implementation}\label{sec:SSE2}
2
3\begin{table*}[htbp]
4\begin{center}
5{
6\footnotesize
7\begin{tabular}{|l|l|}
8\hline
9Name            & Expression    \\ \hline   
10@               & \verb`@`              \\ \hline     
11Date            & \verb`([0-9][0-9]?)/([0-9][0-9]?)/([0-9][0-9]([0-9][0-9])?)`          \\ \hline     
12Email           & \verb`([^ @]+)@([^ @]+)`              \\ \hline
13URI     & \verb`(([a-zA-Z][a-zA-Z0-9]*)://|mailto:)([^ /]+)(/[^ ]*)?|([^ @]+)@([^ @]+)`         \\ \hline     
14Hex             & \verb`[ ](0x)?([a-fA-F0-9][a-fA-F0-9])+[.:,?! ]`              \\ \hline
15StarHeight              & \verb`[A-Z]((([a-zA-Z]*a[a-zA-Z]*[ ])*[a-zA-Z]*e[a-zA-Z]*[ ])*[a-zA-Z]*s[a-zA-Z]*[ ])*[.?!]`          \\ \hline
16\end{tabular}
17}
18\end{center}
19\caption{Regular Expressions}
20\label{RegularExpressions}
21\end{table*}
22
23
24\paragraph*{Implementation Notes} Our regular expression compiler directly uses the Parabix tool chain
25to compile regular expression into SSE2-based implementations.
26Our compiler essentially scripts three other compilers to perform
27this work: the Parabix character class compiler to determine basic
28bit stream equations for each of the character classes encountered
29in a regular expression, the Pablo bitstream equation compiler which
30converts equations to block-at-a-time C++ code for 128-bit SIMD, and
31gcc 4.8.2 to generate the binaries.   The Pablo output is combined
32with a {\tt grep\_template.cpp} file that arranges to read input
33files, break them into segments, and print or count matches
34as they are encountered.
35
36\paragraph*{Comparative Implementations} We evaluate our bitwise data parallel implementation versus several
37alternatives.   
38We report data for two of these:
39gre2p
40%GNU grep version 2.10
41and nrgrep version 1.12.
42The gre2p program is a grep version implemented using the recently developed
43RE2 regular expression library, using a systematic DFA-based approach
44(as well as some NFA fallback techniques) \cite{cox2010RE2}.
45The NFA class is represented by nrgrep, one of the
46strongest competitors in regular expression matching performance.
47We also considered GNU grep 2.10, agrep 3.41 as an alternative NFA-based implementation
48and pcregrep 8.12 as a backtracking implementation, but do not
49report data for them. 
50GNU grep is a popular open-source implementation that is claimed to be
51primarily DFA-based with heuristics for important special cases.   
52The agrep implementation does not support
53some of the common regular expression syntax features and is limited to
54patterns of at most 32 characters.   As a backtracking implementation,
55pcregrep supports more regular expression features, but is not
56competitive in performance in any example we tested.
57
58We performed our SSE2 performance study using an Intel Core i5-4570 (Haswell) processor
59(3.2 GHz, 4 physical cores, 32+32 kB (per core) L1 cache, 256 kB (per core) L2 cache, 6 MB L3 cache)
60running the 64-bit version of Ubuntu 12.04 (Linux).
61
62Our performance evaluation focuses on the running time of
63the regular expression matching process itself, excluding the
64preprocessing time for regular expression compilation.   
65However,
66the overhead of the Parabix transform to bit stream form is
67included in our reported results.
68
69
70\paragraph*{Test Expressions} Each grep implementation was evaluated
71against the six regular expressions shown
72in Table \ref{RegularExpressions}
73@ matches the at-sign character.
74This expression demonstrates the overhead involved
75in matching the simplest possible regular expression, a single character.
76Date, Email, and URI provide examples of commonly used regular expression.
77This set of expressions were modified from the \textit{Benchmark of Regex
78Libraries}.
79Hex matches delimited byte strings in hexadecimal notation, and
80enforces the constraint that the number of hex digits is even. This
81expression illustrates the performance
82of a repetition operator implemented using
83a while loop in our system.  StarHeight is an artificial expression designed to further
84stress while loop implementation with 4 levels of Kleene closure.
85All tests were run on a version
86of a \textit{Linux 3Dfx howto} 
87file of
8839,421,555 bytes. 
89
90
91\paragraph*{Results}
92\begin{figure}
93\begin{center}
94\begin{tikzpicture}
95\begin{axis}[
96xtick=data,
97ylabel=Cycles per Byte,
98xticklabels={@,Date,Email,URI,Hex,StarHeight},
99tick label style={font=\tiny},
100enlarge x limits=0.15,
101%enlarge y limits={0.15, upper},
102ymin=0,
103legend style={at={(0.5,-0.15)},
104anchor=north,legend columns=-1},
105ymax=44,
106ybar,
107bar width=7pt,
108cycle list = {black,black!70,black!40,black!10}
109%visualization depends on=y \as \rawy,
110]
111\addplot+[]
112file {data/cycles-bitstreams.dat};
113\addplot+[fill,text=black]
114file {data/cycles-nrgrep112.dat};
115\addplot+[fill,,text=black]
116file {data/cycles-gre2p.dat};
117 
118\legend{bitstreams,nrgrep,gre2p,Annot}
119\end{axis}
120\end{tikzpicture}
121\end{center}
122\caption{Cycles per Byte}\label{fig:SSECyclesPerByte}
123\end{figure}
124
125Figure~\ref{fig:SSECyclesPerByte} compares
126each of the grep implementations,
127with relative performance reported in CPU cycles per byte.
128
129The performance in matching the @ regular expression establishes
130the base line cost for regular expression processing.   All programs
131report 15,788 matching lines of the 1,086,077 lines in the file.
132The Parabix SSE2 implementation is clearly the fastest in this case with
133a cost of 0.95 CPU cycles per byte.    The bulk of this represents
134the overhead of the Parabix transform, the bitwise logic to
135calculate the single {\tt [@]} character class stream is relatively
136trivial.   It is interesting to note that
137this example does not represent a baseline cost for either
138nrgrep or gre2p, each of these benefit from character skipping
139optimizations in their implementations.
140
141Our results for the matching the Date expression to find the
142668 lines containing dates show an increase
143from 0.95 to 1.22 cycles per byte, corresponding to the additional
144logic for the regular expression matching steps according to our
145algorithm.    For this relatively simple expression, however, nrgrep
146outperforms our implementation by taking significant advantage
147of character skipping.   Each time that nrgrep encounters a character
148that cannot appear in a date it jumps six character positions rather
149than searching every character in the input text.  gre2p also shows
150a significant benefit from the character skipping optimization.
151
152The results for the Email expression illustrate the relative
153advantage of the bitstreams method when the expression to be matched
154does not permit character skipping in the NFA- or DFA-based
155implementations.   In this example, our implementation outperforms
156nrgrep by a factor of 7X, and gre2p by 23X.   There are 15,057
157lines matching the Email regex.
158
159The URI expression illustrates the performance of the
160grep programs with additional regular expression complexity.
161All three implementations require more time than for the
162Email expression, with similar but slightly lower performance ratios maintained.
163In this example, the performance advantage of the bitstreams implementation
164drops to about 4.5X over nrgrep and 19X over gre2p.
16532557 lines are matched by the URI regex.
166
167The results for Hex expression illustrate the bitstreams performance
168in the case of a Kleene-+ operator compiled to a while loop.
169Performance is nevertheless quite good;
170our implementation uses just 1.6 cycles per byte to find the
171130,243 matching lines.    The gre2p program performs
172quite poorly here, off the chart at 114 cycles per byte.
173This is lower than the bitstreams implementation
174by about 70X.   In this example, nrgrep maintains its relative performance
175to the bitstreams implementation, about 5.5X slower.
176
177A more complex triply-nested repetition structure is required by
178the bitstreams implementation of the StarHeight expression. 
179In this case, there is a noticeable drop off in the
180performance advantage of the bitstreams implementation over
181the nrgrep and gre2p.   Nevertheless a 2X
182advantage over nrgrep is still observed.
183
184\begin{figure}
185\begin{center}
186\begin{tikzpicture}
187\begin{axis}[
188xtick=data,
189ylabel=Instructions per Cycle,
190xticklabels={@,Date,Email,URI,Hex,StarHeight},
191tick label style={font=\tiny},
192enlarge x limits=0.15,
193%enlarge y limits={0.15, upper},
194ymin=0,
195legend style={at={(0.5,-0.15)},
196anchor=north,legend columns=-1},
197ybar,
198bar width=7pt,
199cycle list = {black,black!70,black!40,black!10}
200]
201\addplot+[]
202file {data/ipc-bitstreams.dat};
203\addplot+[fill,text=black]
204file {data/ipc-nrgrep112.dat};
205\addplot+[fill,text=black]
206file {data/ipc-gre2p.dat};
207
208\legend{bitstreams,nrgrep,gre2p,Annot}
209\end{axis}
210\end{tikzpicture}
211\end{center}
212\caption{Instructions per Cycle}\label{fig:SSEInstructionsPerCycle}
213\end{figure}
214
215Figure \ref{fig:SSEInstructionsPerCycle} shows the efficiency of
216processor resource usage
217achieved by the three programs on each of the test expression in
218terms of instructions per cycle (IPC).
219For the first four expressions, in particular,
220the bitstreams implementation uses the processor resources quite efficiently,
221avoiding penalties due to cache misses and branch mispredictions.   
222However, with the while loop
223structures in processing the Hex and StarHeight expressions, branch
224mispredictions increase considerably and there is a noticeable
225drop-off in IPC.   The gre2p program suffers from significant penalties
226for the smaller expressions, but otherwise achieves a good IPC rate.
227On the other hand, nrgrep IPC drops off with expression complexity,
228suffering from significant penalties due to 
229mispredictions in the character-skipping logic and cache misses in
230table lookups.
231
232Overall, the bitstreams SSE2 implementation significantly outperformed
233both nrgrep and gre2p. In addition, the performance of bitstreams
234generally scales well with regular expression complexity, although
235nested Kleene closures are an issue.
236
Note: See TracBrowser for help on using the repository browser.