source: docs/Working/re/sse2.tex @ 3658

Last change on this file since 3658 was 3658, checked in by cameron, 5 years ago

Redo all the sse2 data based on the cs-amoeba-n5 results

File size: 7.9 KB
Line 
1\section{SSE2 Implementation and Evaluation}\label{sec:SSE2}
2
3\begin{table*}[htbp]
4\begin{center}
5{
6\footnotesize
7\begin{tabular}{|l|l|}
8\hline
9Name            & Expression    \\ \hline   
10@               & \verb`@`              \\ \hline     
11Date            & \verb`([0-9][0-9]?)/([0-9][0-9]?)/([0-9][0-9]([0-9][0-9])?)`          \\ \hline     
12Email           & \verb`([^ @]+)@([^ @]+)`              \\ \hline
13URIOrEmail      & \verb`([a-zA-Z][a-zA-Z0-9]*)://([^ /]+)(/[^ ]*)?|([^ @]+)@([^ @]+)`           \\ \hline     
14HexBytes                & \verb`(^|[ ])0x([a-fA-F0-9][a-fA-F0-9])+[.:,?!]?($|[ ])`              \\ \hline
15\end{tabular}
16}
17\end{center}
18\caption{Regular Expressions}
19\label{RegularExpressions}
20\end{table*}
21
22
23\paragraph{Implementation Notes.}
24Our regular expression compiler directly uses the Parabix tool chain
25to compile regular expression into SSE2-based implementations.
26Our compiler essentially scripts three other compilers to perform
27this work: the Parabix character class compiler to determine basic
28bit stream equations for each of the character classes encountered
29in a regular expression, the Pablo bitstream equation compiler which
30converts equations to block-at-a-time C++ code for 128-bit SIMD, and
31gcc 4.6.3 to generate the binaries.   The Pablo output is combined
32with a {\tt grep\_template.cpp} file that arranges to read input
33files, break them into segments, and print out or count matches
34as they are encountered.
35
36\paragraph{Comparative Implementations.}
37We evaluate our bitwise data parallel implementation versus several
38alternatives.   
39We report data for two of these:
40gre2p
41%GNU grep version 2.10
42and nrgrep version 1.12.
43The gre2p program is a grep version implemented using the recently developed
44RE2 regular expression library, using a systematic DFA-based approach
45(as well as some NFA fallback techniques) \cite{cox2010RE2}.
46The NFA class is represented by nrgrep, one of the
47strongest competitors in regular expression matching performance.
48We also considered GNU grep 2.10, agrep 3.41 as and alternative NFA-based implementation
49and pcregrep 8.12 as a backtracking implementation, but do not
50report data for them. 
51GNU grep is a popular open-source implementation that is claimed to be
52primarily DFA-based with heuristics for important special cases.   
53The agrep implementation does not support
54some of the common regular expression syntax feature and is limited to
55patterns of at most 32 characters.   As a backtracking implementation,
56pcregrep supports more regular expression features, but is not
57competitive in performance in any example we tested.
58
59We performed our SSE2 performance study using an Intel Core i7 quad-core (Sandy Bridge) processor (3.40GHz, 4 physical cores, 8 threads (2 per core), 32+32 kB (per core) L1 cache, 256 kB (per core) L2 cache, 8 MB L3 cache) running the 64-bit version of Ubuntu 12.04 (Linux).
60
61
62\paragraph{Test expressions.}
63Each grep implementation is tested with the five regular expressions
64in Table \ref{RegularExpressions}.  @ simply matches the "@" character.  It demonstrates the overhead involved in matching the simplest regular expression.  Date, Email, and URIOrEmail provide examples of common uses for regular expression matching.
65HexBytes matches delimited. They are taken from Benchmark of Regex
66Libraries [http://lh3lh3.users.sourceforge.net/reb.shtml].
67HexBytes matches delimited byte strings in hexadecimal notation,
68enforcing the constraint that the number of hex digits is even.   This
69expression shows performance of a repetition operator implemented with
70a while loop.
71All tests are run on a concatenated version of the linux howto which is 39,422,105 bytes in length. 
72
73
74\paragraph{Results.}
75\begin{figure}
76\begin{center}
77\begin{tikzpicture}
78\begin{axis}[
79xtick=data,
80ylabel=Cycles per Byte,
81xticklabels={@,Date,Email,URIorEmail,HexBytes},
82tick label style={font=\tiny},
83enlarge x limits=0.15,
84%enlarge y limits={0.15, upper},
85ymin=0,
86legend style={at={(0.5,-0.15)},
87anchor=north,legend columns=-1},
88ymax=42,
89ybar,
90bar width=7pt,
91%visualization depends on=y \as \rawy,
92]
93\addplot
94file {data/cycles-bitstreams.dat};
95\addplot
96file {data/cycles-nrgrep112.dat};
97\addplot
98file {data/cycles-gre2p.dat};
99 
100\legend{bitstreams,nrgrep,gre2p,Annot}
101\end{axis}
102\end{tikzpicture}
103\end{center}
104\caption{Cycles per Byte}\label{fig:SSECyclesPerByte}
105\end{figure}
106
107Figure \ref{fig:SSECyclesPerByte} shows CPU cycles per input byte for
108each expression on each implementation.
109Both the bitstreams and the nrgrep implementation performed
110much better than the DFA-based gre2p, with the bitstreams
111implementation
112consistently an order of magnitude faster.
113
114For the Date expression, nrgrep outperforms our bitstreams
115implementation here.  In this case, nrgrep takes advantage of skipping:
116every time it encounters a character that can't appear in a date, it
117can skip past six characters. 
118For the more compicated expressions, it loses this advantage.
119
120The bitstreams implementation has fairly consistent performance.  As
121the regular expression complexity increases, the cycle count increases
122slowly.  The largest difference in cycles per byte for the bitstreams
123implementation is a ratio of 2 to 1.  The same cannot be said of gre2p
124or nrgrep, with gre2p exhibiting a 4 to 1 range over these expressions
125and
126nrgrep exhibiting a 10 to 1 range.
127
128\begin{figure}
129\begin{center}
130\begin{tikzpicture}
131\begin{axis}[
132xtick=data,
133ylabel=Instructions per Byte,
134xticklabels={@,Date,Email,URIorEmail,HexBytes},
135tick label style={font=\tiny},
136enlarge x limits=0.15,
137%enlarge y limits={0.15, upper},
138ymin=0,
139legend style={at={(0.5,-0.15)},
140anchor=north,legend columns=-1},
141ymax=110,
142ybar,
143bar width=7pt,
144]
145\addplot
146file {data/instructions-bitstreams.dat};
147\addplot
148file {data/instructions-nrgrep112.dat};
149\addplot
150file {data/instructions-gre2p.dat};
151 
152\legend{bitstreams,nrgrep,gre2p,Annot}
153\end{axis}
154\end{tikzpicture}
155\end{center}
156\caption{Instructions per Byte}\label{fig:SSEInstructionsPerByte}
157\end{figure}
158
159Figure \ref{fig:SSEInstructionsPerByte} shows instructions per byte.
160The relative values mirror cycles per byte.  The bitstreams
161implementation continues to show consistent performance.  This is
162especially noticeable in Figure \ref{fig:SSEInstructionsPerCycle},
163which shows instructions per cycle.  Both the bitstreams and
164gre2p implementations show relatively stable IPC characteristics,
165while nrgrep exhibits considerably more variation.
166 
167
168\begin{figure}
169\begin{center}
170\begin{tikzpicture}
171\begin{axis}[
172xtick=data,
173ylabel=Instructions per Cycle,
174xticklabels={@,Date,Email,URIorEmail,HexBytes},
175tick label style={font=\tiny},
176enlarge x limits=0.15,
177%enlarge y limits={0.15, upper},
178ymin=0,
179legend style={at={(0.5,-0.15)},
180anchor=north,legend columns=-1},
181ybar,
182bar width=7pt,
183]
184\addplot
185file {data/ipc-bitstreams.dat};
186\addplot
187file {data/ipc-nrgrep112.dat};
188\addplot
189file {data/ipc-gre2p.dat};
190
191\legend{bitstreams,nrgrep,gre2p,Annot}
192\end{axis}
193\end{tikzpicture}
194\end{center}
195\caption{Instructions per Cycle}\label{fig:SSEInstructionsPerCycle}
196\end{figure}
197
198\begin{figure}
199\begin{center}
200\begin{tikzpicture}
201\begin{axis}[
202xtick=data,
203ylabel=Branch Misses per Instruction,
204xticklabels={@,Date,Email,URIorEmail,HexBytes},
205tick label style={font=\tiny},
206enlarge x limits=0.15,
207%enlarge y limits={0.15, upper},
208ymin=0,
209legend style={at={(0.5,-0.15)},
210anchor=north,legend columns=-1},
211ybar,
212bar width=7pt,
213]
214\addplot
215file {data/branch-misses-bitstreams.dat};
216\addplot
217file {data/branch-misses-nrgrep112.dat};
218\addplot
219file {data/branch-misses-gre2p.dat};
220
221\legend{bitstreams,nrgrep,gre2p,Annot}
222\end{axis}
223\end{tikzpicture}
224\end{center}
225\caption{Branch Misses per Instruction}\label{fig:SSEBranchMisses}
226\end{figure}
227Figure \ref{fig:SSEBranchMisses} shows the branch misses per kilobyte.
228The bitstreams and gre2p implementations remains consistent here.
229Significant variation is seen with nrgrep.
230
231Overall, our performance is considerably better than both nrgrep and gre2p for the more complicated expressions that were tested.  Also, our performance scales smoothly with regular expression complexity so it can be expected to remain better for complicated expressions in general.
232
233
Note: See TracBrowser for help on using the repository browser.