source: docs/Working/re/sse2.tex @ 3870

Last change on this file since 3870 was 3870, checked in by cameron, 5 years ago

Formatting, PACT publication template

File size: 10.1 KB
Line 
1\section{SSE2 Implementation and Evaluation}\label{sec:SSE2}
2
3\begin{table*}[htbp]
4\begin{center}
5{
6\footnotesize
7\begin{tabular}{|l|l|}
8\hline
9Name            & Expression    \\ \hline   
10@               & \verb`@`              \\ \hline     
11Date            & \verb`([0-9][0-9]?)/([0-9][0-9]?)/([0-9][0-9]([0-9][0-9])?)`          \\ \hline     
12Email           & \verb`([^ @]+)@([^ @]+)`              \\ \hline
13URIOrEmail      & \verb`(([a-zA-Z][a-zA-Z0-9]*)://|mailto:)([^ /]+)(/[^ ]*)?|([^ @]+)@([^ @]+)`         \\ \hline     
14HexBytes                & \verb`[ ](0x)?([a-fA-F0-9][a-fA-F0-9])+[.:,?! ]`              \\ \hline
15StarHeight              & \verb`[A-Z]((([a-zA-Z]*a[a-zA-Z]*[ ])*[a-zA-Z]*e[a-zA-Z]*[ ])*[a-zA-Z]*s[a-zA-Z]*[ ])*[.?!]`          \\ \hline
16\end{tabular}
17}
18\end{center}
19\caption{Regular Expressions}
20\label{RegularExpressions}
21\end{table*}
22
23
24\paragraph{Implementation Notes.}
25Our regular expression compiler directly uses the Parabix tool chain
26to compile regular expression into SSE2-based implementations.
27Our compiler essentially scripts three other compilers to perform
28this work: the Parabix character class compiler to determine basic
29bit stream equations for each of the character classes encountered
30in a regular expression, the Pablo bitstream equation compiler which
31converts equations to block-at-a-time C++ code for 128-bit SIMD, and
32gcc 4.8.2 to generate the binaries.   The Pablo output is combined
33with a {\tt grep\_template.cpp} file that arranges to read input
34files, break them into segments, and print or count matches
35as they are encountered.
36
37\paragraph{Comparative Implementations.}
38We evaluate our bitwise data parallel implementation versus several
39alternatives.   
40We report data for two of these:
41gre2p
42%GNU grep version 2.10
43and nrgrep version 1.12.
44The gre2p program is a grep version implemented using the recently developed
45RE2 regular expression library, using a systematic DFA-based approach
46(as well as some NFA fallback techniques) \cite{cox2010RE2}.
47The NFA class is represented by nrgrep, one of the
48strongest competitors in regular expression matching performance.
49We also considered GNU grep 2.10, agrep 3.41 as an alternative NFA-based implementation
50and pcregrep 8.12 as a backtracking implementation, but do not
51report data for them. 
52GNU grep is a popular open-source implementation that is claimed to be
53primarily DFA-based with heuristics for important special cases.   
54The agrep implementation does not support
55some of the common regular expression syntax feature and is limited to
56patterns of at most 32 characters.   As a backtracking implementation,
57pcregrep supports more regular expression features, but is not
58competitive in performance in any example we tested.
59
60We performed our SSE2 performance study using an Intel Core i5-4570 (Haswell) processor
61(3.2 GHz, 4 physical cores, 32+32 kB (per core) L1 cache, 256 kB (per core) L2 cache, 6 MB L3 cache)
62running the 64-bit version of Ubuntu 12.04 (Linux).
63
64Our performance evaluation focuses on the running time of
65the regular expression matching process itself, excluding the
66preprocessing time for regular expression compilation.   
67However,
68the overhead of the Parabix transform to bit stream form is
69included in our reported results.
70
71
72\paragraph{Test Expressions.}
73Each grep implementation was evaluated
74against the five regular expressions shown
75in Table \ref{RegularExpressions}
76@ matches the at-sign character.
77This expression demonstrates the overhead involved
78in matching the simplest possible regular expression, a single character.
79Date, Email, and URIOrEmail provide examples of commonly used regular expression.
80This set of expressions were drawn from the \textit{Benchmark of Regex
81Libraries} (http://lh3lh3.users.sourceforge.net/reb.shtml).
82HexBytes matches delimited byte strings in hexadecimal notation, and
83enforces the constraint that the number of hex digits is even. This
84expression illustrates the performance
85of a repetition operator implemented using
86a while loop.
87All tests were run on a version
88of a \textit{Linux 3Dfx howto} 
89file of
9039,421,555 bytes. 
91
92
93\paragraph{Results.}
94\begin{figure}
95\begin{center}
96\begin{tikzpicture}
97\begin{axis}[
98xtick=data,
99ylabel=Cycles per Byte,
100xticklabels={@,Date,Email,URIorEmail,HexBytes,StarHeight},
101tick label style={font=\tiny},
102enlarge x limits=0.15,
103%enlarge y limits={0.15, upper},
104ymin=0,
105legend style={at={(0.5,-0.15)},
106anchor=north,legend columns=-1},
107ymax=42,
108ybar,
109bar width=7pt,
110%visualization depends on=y \as \rawy,
111]
112\addplot
113file {data/cycles-bitstreams.dat};
114\addplot
115file {data/cycles-nrgrep112.dat};
116\addplot
117file {data/cycles-gre2p.dat};
118 
119\legend{bitstreams,nrgrep,gre2p,Annot}
120\end{axis}
121\end{tikzpicture}
122\end{center}
123\caption{Cycles per Byte}\label{fig:SSECyclesPerByte}
124\end{figure}
125
126Figure~\ref{fig:SSECyclesPerByte} compares
127each of the grep implementations,
128with relative performance reported in CPU cycles per byte.
129
130The performance in matching the @ regular expression establishes
131the base line cost for regular expression processing.   All programs
132report 15,788 matching lines of the 1,086,077 lines in the file.
133The Parabix SSE2 implementation is clearly the fastest in this case with
134a cost of 0.95 CPU cycles per byte.    The bulk of this represents
135the overhead of the Parabix transform, the bitwise logic to
136calculate the single {\tt [@]} character class stream is relatively
137trivial.   It is interesting to note that
138this example does not represent a baseline cost for either
139nrgrep or gre2p, each of these benefit from character skipping
140optimizations in their implementations.
141
142Our results for the matching the Date expression to find the
143668 lines containing dates show an increase
144from 0.95 to 1.22 cycles per byte, corresponding to the additional
145logic for the regular expression matching steps according to our
146algorithm.    For this relatively simple expression, however, nrgrep
147outperforms our implementation by taking significant advantage
148of character skipping.   Each time that nrgrep encounters a character
149that cannot appear in a date it jumps six character positions rather
150than searching every character in the input text.  gre2p also shows
151a significant benefit from the character skipping optimization.
152
153The results for the Email expression illustrate the relative
154advantage of the Parabix method when the expression to be matched
155does not permit character skipping in the NFA- or DFA-based
156implementations.   In this example, our implementation outperforms
157nrgrep by a factor of 7X, and gre2p by 23X.   There are 15,057
158lines matching the Email regex.
159
160The URIorEmail expression illustrates the performance of the
161grep programs with additional regular expression complexity.
162As expressions get larger, the number of steps required by
163the Parabix implementation increases, so the performance
164advantage drops to about 4.5X over nrgrep and 19X over gre2p.
16532557 lines are matched by the URIorEmail regex.
166
167The results for HexBytes expression illustrate Parabix performance
168involving a Kleene-+ operator compiled to a while loop. 
169Our implementation uses just 1.6 cycles per byte to find the
170130,243 matching lines.    The gre2p program performs
171quite poorly here, slower than the Parabix implementation
172by about 70X, while nrgrep is about 5.5X slower.
173
174StarHeight is an artificial expression created to stress the Parabix
175implementation with a triply-nested while loop.   The performance
176does drop off, maintaining only a 2X advantage over nrgrep.
177
178
179
180
181\begin{figure}
182\begin{center}
183\begin{tikzpicture}
184\begin{axis}[
185xtick=data,
186ylabel=Instructions per Byte,
187xticklabels={@,Date,Email,URIorEmail,HexBytes,StarHeight},
188tick label style={font=\tiny},
189enlarge x limits=0.15,
190%enlarge y limits={0.15, upper},
191ymin=0,
192legend style={at={(0.5,-0.15)},
193anchor=north,legend columns=-1},
194ymax=110,
195ybar,
196bar width=7pt,
197]
198\addplot
199file {data/instructions-bitstreams.dat};
200\addplot
201file {data/instructions-nrgrep112.dat};
202\addplot
203file {data/instructions-gre2p.dat};
204 
205\legend{bitstreams,nrgrep,gre2p,Annot}
206\end{axis}
207\end{tikzpicture}
208\end{center}
209\caption{Instructions per Byte}\label{fig:SSEInstructionsPerByte}
210\end{figure}
211
212Figure~\ref{fig:SSEInstructionsPerByte} illustrates
213relative performance reported in instructions per byte.
214Figure~\ref{fig:SSEInstructionsPerByte} 
215mirrors Figure~\ref{fig:SSECyclesPerByte}. The bitstreams
216implementation demonstrates stable performance
217characteristics across each of the test expressions. The
218gre2p implementation also shows
219relatively stable characteristics, whereas,
220nrgrep exhibits greater variability.
221 
222\begin{figure}
223\begin{center}
224\begin{tikzpicture}
225\begin{axis}[
226xtick=data,
227ylabel=Instructions per Cycle,
228xticklabels={@,Date,Email,URIorEmail,HexBytes,StarHeight},
229tick label style={font=\tiny},
230enlarge x limits=0.15,
231%enlarge y limits={0.15, upper},
232ymin=0,
233legend style={at={(0.5,-0.15)},
234anchor=north,legend columns=-1},
235ybar,
236bar width=7pt,
237]
238\addplot
239file {data/ipc-bitstreams.dat};
240\addplot
241file {data/ipc-nrgrep112.dat};
242\addplot
243file {data/ipc-gre2p.dat};
244
245\legend{bitstreams,nrgrep,gre2p,Annot}
246\end{axis}
247\end{tikzpicture}
248\end{center}
249\caption{Instructions per Cycle}\label{fig:SSEInstructionsPerCycle}
250\end{figure}
251
252\begin{figure}
253\begin{center}
254\begin{tikzpicture}
255\begin{axis}[
256xtick=data,
257ylabel=Branch Misses per Instruction,
258xticklabels={@,Date,Email,URIorEmail,HexBytes,StarHeight},
259tick label style={font=\tiny},
260enlarge x limits=0.15,
261%enlarge y limits={0.15, upper},
262ymin=0,
263legend style={at={(0.5,-0.15)},
264anchor=north,legend columns=-1},
265ybar,
266bar width=7pt,
267]
268\addplot
269file {data/branch-misses-bitstreams.dat};
270\addplot
271file {data/branch-misses-nrgrep112.dat};
272\addplot
273file {data/branch-misses-gre2p.dat};
274
275\legend{bitstreams,nrgrep,gre2p,Annot}
276\end{axis}
277\end{tikzpicture}
278\end{center}
279\caption{Branch Misses per Instruction}\label{fig:SSEBranchMisses}
280\end{figure}
281Figure \ref{fig:SSEBranchMisses} reports branch misses per kilobyte.
282The bitstreams and gre2p implementations remain consistent.
283Significant variation is seen with nrgrep.
284
285Overall, the bitstreams implementation significantly outperformed
286both nrgrep and gre2p. In addition, the performance of bitstreams
287scaled smoothly with regular expression complexity. As Section~\ref{sec:analysis}
288describes, we anticipate that the performance of bitstreams
289will to continue to scale smoothly with
290regular expression complexity.
291
292
Note: See TracBrowser for help on using the repository browser.