source: docs/Working/re/sse2.tex @ 3880

Last change on this file since 3880 was 3880, checked in by cameron, 5 years ago

Various formatting items

File size: 10.2 KB
Line 
1\section{SSE2 Implementation and Evaluation}\label{sec:SSE2}
2
3\begin{table*}[htbp]
4\begin{center}
5{
6\footnotesize
7\begin{tabular}{|l|l|}
8\hline
9Name            & Expression    \\ \hline   
10@               & \verb`@`              \\ \hline     
11Date            & \verb`([0-9][0-9]?)/([0-9][0-9]?)/([0-9][0-9]([0-9][0-9])?)`          \\ \hline     
12Email           & \verb`([^ @]+)@([^ @]+)`              \\ \hline
13URIOrEmail      & \verb`(([a-zA-Z][a-zA-Z0-9]*)://|mailto:)([^ /]+)(/[^ ]*)?|([^ @]+)@([^ @]+)`         \\ \hline     
14HexBytes                & \verb`[ ](0x)?([a-fA-F0-9][a-fA-F0-9])+[.:,?! ]`              \\ \hline
15StarHeight              & \verb`[A-Z]((([a-zA-Z]*a[a-zA-Z]*[ ])*[a-zA-Z]*e[a-zA-Z]*[ ])*[a-zA-Z]*s[a-zA-Z]*[ ])*[.?!]`          \\ \hline
16\end{tabular}
17}
18\end{center}
19\caption{Regular Expressions}
20\label{RegularExpressions}
21\end{table*}
22
23
24\paragraph*{Implementation Notes} Our regular expression compiler directly uses the Parabix tool chain
25to compile regular expression into SSE2-based implementations.
26Our compiler essentially scripts three other compilers to perform
27this work: the Parabix character class compiler to determine basic
28bit stream equations for each of the character classes encountered
29in a regular expression, the Pablo bitstream equation compiler which
30converts equations to block-at-a-time C++ code for 128-bit SIMD, and
31gcc 4.8.2 to generate the binaries.   The Pablo output is combined
32with a {\tt grep\_template.cpp} file that arranges to read input
33files, break them into segments, and print or count matches
34as they are encountered.
35
36\paragraph*{Comparative Implementations} We evaluate our bitwise data parallel implementation versus several
37alternatives.   
38We report data for two of these:
39gre2p
40%GNU grep version 2.10
41and nrgrep version 1.12.
42The gre2p program is a grep version implemented using the recently developed
43RE2 regular expression library, using a systematic DFA-based approach
44(as well as some NFA fallback techniques) \cite{cox2010RE2}.
45The NFA class is represented by nrgrep, one of the
46strongest competitors in regular expression matching performance.
47We also considered GNU grep 2.10, agrep 3.41 as an alternative NFA-based implementation
48and pcregrep 8.12 as a backtracking implementation, but do not
49report data for them. 
50GNU grep is a popular open-source implementation that is claimed to be
51primarily DFA-based with heuristics for important special cases.   
52The agrep implementation does not support
53some of the common regular expression syntax feature and is limited to
54patterns of at most 32 characters.   As a backtracking implementation,
55pcregrep supports more regular expression features, but is not
56competitive in performance in any example we tested.
57
58We performed our SSE2 performance study using an Intel Core i5-4570 (Haswell) processor
59(3.2 GHz, 4 physical cores, 32+32 kB (per core) L1 cache, 256 kB (per core) L2 cache, 6 MB L3 cache)
60running the 64-bit version of Ubuntu 12.04 (Linux).
61
62Our performance evaluation focuses on the running time of
63the regular expression matching process itself, excluding the
64preprocessing time for regular expression compilation.   
65However,
66the overhead of the Parabix transform to bit stream form is
67included in our reported results.
68
69
70\paragraph*{Test Expressions} Each grep implementation was evaluated
71against the five regular expressions shown
72in Table \ref{RegularExpressions}
73@ matches the at-sign character.
74This expression demonstrates the overhead involved
75in matching the simplest possible regular expression, a single character.
76Date, Email, and URIOrEmail provide examples of commonly used regular expression.
77This set of expressions were modified from the \textit{Benchmark of Regex
78Libraries} (http://lh3lh3.users.sourceforge.net/reb.shtml).
79HexBytes matches delimited byte strings in hexadecimal notation, and
80enforces the constraint that the number of hex digits is even. This
81expression illustrates the performance
82of a repetition operator implemented using
83a while loop.  StarHeight is an expression designed to further
84stress our while loop implementation with 4 levels of Kleene closure.
85All tests were run on a version
86of a \textit{Linux 3Dfx howto} 
87file of
8839,421,555 bytes. 
89
90
91\paragraph*{Results}
92\begin{figure}
93\begin{center}
94\begin{tikzpicture}
95\begin{axis}[
96xtick=data,
97ylabel=Cycles per Byte,
98xticklabels={@,Date,Email,URIorEmail,HexBytes,StarHeight},
99tick label style={font=\tiny},
100enlarge x limits=0.15,
101%enlarge y limits={0.15, upper},
102ymin=0,
103legend style={at={(0.5,-0.15)},
104anchor=north,legend columns=-1},
105ymax=42,
106ybar,
107bar width=7pt,
108%visualization depends on=y \as \rawy,
109]
110\addplot
111file {data/cycles-bitstreams.dat};
112\addplot
113file {data/cycles-nrgrep112.dat};
114\addplot
115file {data/cycles-gre2p.dat};
116 
117\legend{bitstreams,nrgrep,gre2p,Annot}
118\end{axis}
119\end{tikzpicture}
120\end{center}
121\caption{Cycles per Byte}\label{fig:SSECyclesPerByte}
122\end{figure}
123
124Figure~\ref{fig:SSECyclesPerByte} compares
125each of the grep implementations,
126with relative performance reported in CPU cycles per byte.
127
128The performance in matching the @ regular expression establishes
129the base line cost for regular expression processing.   All programs
130report 15,788 matching lines of the 1,086,077 lines in the file.
131The Parabix SSE2 implementation is clearly the fastest in this case with
132a cost of 0.95 CPU cycles per byte.    The bulk of this represents
133the overhead of the Parabix transform, the bitwise logic to
134calculate the single {\tt [@]} character class stream is relatively
135trivial.   It is interesting to note that
136this example does not represent a baseline cost for either
137nrgrep or gre2p, each of these benefit from character skipping
138optimizations in their implementations.
139
140Our results for the matching the Date expression to find the
141668 lines containing dates show an increase
142from 0.95 to 1.22 cycles per byte, corresponding to the additional
143logic for the regular expression matching steps according to our
144algorithm.    For this relatively simple expression, however, nrgrep
145outperforms our implementation by taking significant advantage
146of character skipping.   Each time that nrgrep encounters a character
147that cannot appear in a date it jumps six character positions rather
148than searching every character in the input text.  gre2p also shows
149a significant benefit from the character skipping optimization.
150
151The results for the Email expression illustrate the relative
152advantage of the Parabix method when the expression to be matched
153does not permit character skipping in the NFA- or DFA-based
154implementations.   In this example, our implementation outperforms
155nrgrep by a factor of 7X, and gre2p by 23X.   There are 15,057
156lines matching the Email regex.
157
158The URIorEmail expression illustrates the performance of the
159grep programs with additional regular expression complexity.
160As expressions get larger, the number of steps required by
161the Parabix implementation increases, so the performance
162advantage drops to about 4.5X over nrgrep and 19X over gre2p.
16332557 lines are matched by the URIorEmail regex.
164
165The results for HexBytes expression illustrate Parabix performance
166involving a Kleene-+ operator compiled to a while loop. 
167Our implementation uses just 1.6 cycles per byte to find the
168130,243 matching lines.    The gre2p program performs
169quite poorly here, slower than the Parabix implementation
170by about 70X, while nrgrep is about 5.5X slower.
171
172StarHeight is an artificial expression created to stress the Parabix
173implementation with a triply-nested while loop.   The performance
174does drop off, maintaining only a 2X advantage over nrgrep.
175
176
177
178
179\begin{figure}
180\begin{center}
181\begin{tikzpicture}
182\begin{axis}[
183xtick=data,
184ylabel=Instructions per Byte,
185xticklabels={@,Date,Email,URIorEmail,HexBytes,StarHeight},
186tick label style={font=\tiny},
187enlarge x limits=0.15,
188%enlarge y limits={0.15, upper},
189ymin=0,
190legend style={at={(0.5,-0.15)},
191anchor=north,legend columns=-1},
192ymax=110,
193ybar,
194bar width=7pt,
195]
196\addplot
197file {data/instructions-bitstreams.dat};
198\addplot
199file {data/instructions-nrgrep112.dat};
200\addplot
201file {data/instructions-gre2p.dat};
202 
203\legend{bitstreams,nrgrep,gre2p,Annot}
204\end{axis}
205\end{tikzpicture}
206\end{center}
207\caption{Instructions per Byte}\label{fig:SSEInstructionsPerByte}
208\end{figure}
209
210Figure~\ref{fig:SSEInstructionsPerByte} illustrates
211relative performance reported in instructions per byte.
212Figure~\ref{fig:SSEInstructionsPerByte} 
213mirrors Figure~\ref{fig:SSECyclesPerByte}. The bitstreams
214implementation demonstrates stable performance
215characteristics across each of the test expressions. The
216gre2p implementation also shows
217relatively stable characteristics, whereas,
218nrgrep exhibits greater variability.
219 
220\begin{figure}
221\begin{center}
222\begin{tikzpicture}
223\begin{axis}[
224xtick=data,
225ylabel=Instructions per Cycle,
226xticklabels={@,Date,Email,URIorEmail,HexBytes,StarHeight},
227tick label style={font=\tiny},
228enlarge x limits=0.15,
229%enlarge y limits={0.15, upper},
230ymin=0,
231legend style={at={(0.5,-0.15)},
232anchor=north,legend columns=-1},
233ybar,
234bar width=7pt,
235]
236\addplot
237file {data/ipc-bitstreams.dat};
238\addplot
239file {data/ipc-nrgrep112.dat};
240\addplot
241file {data/ipc-gre2p.dat};
242
243\legend{bitstreams,nrgrep,gre2p,Annot}
244\end{axis}
245\end{tikzpicture}
246\end{center}
247\caption{Instructions per Cycle}\label{fig:SSEInstructionsPerCycle}
248\end{figure}
249
250\begin{figure}
251\begin{center}
252\begin{tikzpicture}
253\begin{axis}[
254xtick=data,
255ylabel=Branch Misses per Instruction,
256xticklabels={@,Date,Email,URIorEmail,HexBytes,StarHeight},
257tick label style={font=\tiny},
258enlarge x limits=0.15,
259%enlarge y limits={0.15, upper},
260ymin=0,
261legend style={at={(0.5,-0.15)},
262anchor=north,legend columns=-1},
263ybar,
264bar width=7pt,
265]
266\addplot
267file {data/branch-misses-bitstreams.dat};
268\addplot
269file {data/branch-misses-nrgrep112.dat};
270\addplot
271file {data/branch-misses-gre2p.dat};
272
273\legend{bitstreams,nrgrep,gre2p,Annot}
274\end{axis}
275\end{tikzpicture}
276\end{center}
277\caption{Branch Misses per Instruction}\label{fig:SSEBranchMisses}
278\end{figure}
279Figure \ref{fig:SSEBranchMisses} reports branch misses per kilobyte.
280The bitstreams and gre2p implementations remain consistent.
281Significant variation is seen with nrgrep.
282
283Overall, the bitstreams implementation significantly outperformed
284both nrgrep and gre2p. In addition, the performance of bitstreams
285scaled smoothly with regular expression complexity. As Section~\ref{sec:analysis}
286describes, we anticipate that the performance of bitstreams
287will to continue to scale smoothly with
288regular expression complexity.
289
290
Note: See TracBrowser for help on using the repository browser.