source: docs/Working/re/sse2.tex @ 3893

Last change on this file since 3893 was 3893, checked in by linmengl, 5 years ago

bar charts black-and-white printer friendly

File size: 9.3 KB
Line 
1\section{SSE2 Implementation}\label{sec:SSE2}
2
3\begin{table*}[htbp]
4\begin{center}
5{
6\footnotesize
7\begin{tabular}{|l|l|}
8\hline
9Name            & Expression    \\ \hline   
10@               & \verb`@`              \\ \hline     
11Date            & \verb`([0-9][0-9]?)/([0-9][0-9]?)/([0-9][0-9]([0-9][0-9])?)`          \\ \hline     
12Email           & \verb`([^ @]+)@([^ @]+)`              \\ \hline
13URI     & \verb`(([a-zA-Z][a-zA-Z0-9]*)://|mailto:)([^ /]+)(/[^ ]*)?|([^ @]+)@([^ @]+)`         \\ \hline     
14Hex             & \verb`[ ](0x)?([a-fA-F0-9][a-fA-F0-9])+[.:,?! ]`              \\ \hline
15StarHeight              & \verb`[A-Z]((([a-zA-Z]*a[a-zA-Z]*[ ])*[a-zA-Z]*e[a-zA-Z]*[ ])*[a-zA-Z]*s[a-zA-Z]*[ ])*[.?!]`          \\ \hline
16\end{tabular}
17}
18\end{center}
19\caption{Regular Expressions}
20\label{RegularExpressions}
21\end{table*}
22
23
24\paragraph*{Implementation Notes} Our regular expression compiler directly uses the Parabix tool chain
25to compile regular expression into SSE2-based implementations.
26Our compiler essentially scripts three other compilers to perform
27this work: the Parabix character class compiler to determine basic
28bit stream equations for each of the character classes encountered
29in a regular expression, the Pablo bitstream equation compiler which
30converts equations to block-at-a-time C++ code for 128-bit SIMD, and
31gcc 4.8.2 to generate the binaries.   The Pablo output is combined
32with a {\tt grep\_template.cpp} file that arranges to read input
33files, break them into segments, and print or count matches
34as they are encountered.
35
36\paragraph*{Comparative Implementations} We evaluate our bitwise data parallel implementation versus several
37alternatives.   
38We report data for two of these:
39gre2p
40%GNU grep version 2.10
41and nrgrep version 1.12.
42The gre2p program is a grep version implemented using the recently developed
43RE2 regular expression library, using a systematic DFA-based approach
44(as well as some NFA fallback techniques) \cite{cox2010RE2}.
45The NFA class is represented by nrgrep, one of the
46strongest competitors in regular expression matching performance.
47We also considered GNU grep 2.10, agrep 3.41 as an alternative NFA-based implementation
48and pcregrep 8.12 as a backtracking implementation, but do not
49report data for them. 
50GNU grep is a popular open-source implementation that is claimed to be
51primarily DFA-based with heuristics for important special cases.   
52The agrep implementation does not support
53some of the common regular expression syntax feature and is limited to
54patterns of at most 32 characters.   As a backtracking implementation,
55pcregrep supports more regular expression features, but is not
56competitive in performance in any example we tested.
57
58We performed our SSE2 performance study using an Intel Core i5-4570 (Haswell) processor
59(3.2 GHz, 4 physical cores, 32+32 kB (per core) L1 cache, 256 kB (per core) L2 cache, 6 MB L3 cache)
60running the 64-bit version of Ubuntu 12.04 (Linux).
61
62Our performance evaluation focuses on the running time of
63the regular expression matching process itself, excluding the
64preprocessing time for regular expression compilation.   
65However,
66the overhead of the Parabix transform to bit stream form is
67included in our reported results.
68
69
70\paragraph*{Test Expressions} Each grep implementation was evaluated
71against the five regular expressions shown
72in Table \ref{RegularExpressions}
73@ matches the at-sign character.
74This expression demonstrates the overhead involved
75in matching the simplest possible regular expression, a single character.
76Date, Email, and URI provide examples of commonly used regular expression.
77This set of expressions were modified from the \textit{Benchmark of Regex
78Libraries}.
79Hex matches delimited byte strings in hexadecimal notation, and
80enforces the constraint that the number of hex digits is even. This
81expression illustrates the performance
82of a repetition operator implemented using
83a while loop in our system.  StarHeight is an artificial expression designed to further
84stress while loop implementation with 4 levels of Kleene closure.
85All tests were run on a version
86of a \textit{Linux 3Dfx howto} 
87file of
8839,421,555 bytes. 
89
90
91\paragraph*{Results}
92\begin{figure}
93\begin{center}
94\begin{tikzpicture}
95\begin{axis}[
96xtick=data,
97ylabel=Cycles per Byte,
98xticklabels={@,Date,Email,URI,Hex,StarHeight},
99tick label style={font=\tiny},
100enlarge x limits=0.15,
101%enlarge y limits={0.15, upper},
102ymin=0,
103legend style={at={(0.5,-0.15)},
104anchor=north,legend columns=-1},
105ymax=42,
106ybar,
107bar width=7pt,
108cycle list = {black,black!70,black!40,black!10}
109%visualization depends on=y \as \rawy,
110]
111\addplot+[]
112file {data/cycles-bitstreams.dat};
113\addplot+[fill,text=black]
114file {data/cycles-nrgrep112.dat};
115\addplot+[fill,,text=black]
116file {data/cycles-gre2p.dat};
117 
118\legend{bitstreams,nrgrep,gre2p,Annot}
119\end{axis}
120\end{tikzpicture}
121\end{center}
122\caption{Cycles per Byte}\label{fig:SSECyclesPerByte}
123\end{figure}
124
125Figure~\ref{fig:SSECyclesPerByte} compares
126each of the grep implementations,
127with relative performance reported in CPU cycles per byte.
128
129The performance in matching the @ regular expression establishes
130the base line cost for regular expression processing.   All programs
131report 15,788 matching lines of the 1,086,077 lines in the file.
132The Parabix SSE2 implementation is clearly the fastest in this case with
133a cost of 0.95 CPU cycles per byte.    The bulk of this represents
134the overhead of the Parabix transform, the bitwise logic to
135calculate the single {\tt [@]} character class stream is relatively
136trivial.   It is interesting to note that
137this example does not represent a baseline cost for either
138nrgrep or gre2p, each of these benefit from character skipping
139optimizations in their implementations.
140
141Our results for the matching the Date expression to find the
142668 lines containing dates show an increase
143from 0.95 to 1.22 cycles per byte, corresponding to the additional
144logic for the regular expression matching steps according to our
145algorithm.    For this relatively simple expression, however, nrgrep
146outperforms our implementation by taking significant advantage
147of character skipping.   Each time that nrgrep encounters a character
148that cannot appear in a date it jumps six character positions rather
149than searching every character in the input text.  gre2p also shows
150a significant benefit from the character skipping optimization.
151
152The results for the Email expression illustrate the relative
153advantage of the Parabix method when the expression to be matched
154does not permit character skipping in the NFA- or DFA-based
155implementations.   In this example, our implementation outperforms
156nrgrep by a factor of 7X, and gre2p by 23X.   There are 15,057
157lines matching the Email regex.
158
159The URI expression illustrates the performance of the
160grep programs with additional regular expression complexity.
161As expressions get larger, the number of steps required by
162the Parabix implementation increases, so the performance
163advantage drops to about 4.5X over nrgrep and 19X over gre2p.
16432557 lines are matched by the URI regex.
165
166The results for Hex expression illustrate the bitstreams performance
167in the case of a Kleene-+ operator compiled to a while loop.
168Performance is nevertheless quite good;
169our implementation uses just 1.6 cycles per byte to find the
170130,243 matching lines.    The gre2p program performs
171quite poorly here, slower than the Parabix implementation
172by about 70X, while nrgrep is about 5.5X slower.
173
174A more complex triply-nested repetition structure is required by
175the bitstreams implementation of the StarHeight expression. 
176In this case, there is a noticeable drop off in the
177performance advantage of the bitstreams implementation over
178the nrgrep and gre2p.   Nevertheless a 2X
179advantage over nrgrep is maintained.
180
181\begin{figure}
182\begin{center}
183\begin{tikzpicture}
184\begin{axis}[
185xtick=data,
186ylabel=Instructions per Cycle,
187xticklabels={@,Date,Email,URI,Hex,StarHeight},
188tick label style={font=\tiny},
189enlarge x limits=0.15,
190%enlarge y limits={0.15, upper},
191ymin=0,
192legend style={at={(0.5,-0.15)},
193anchor=north,legend columns=-1},
194ybar,
195bar width=7pt,
196cycle list = {black,black!70,black!40,black!10}
197]
198\addplot+[]
199file {data/ipc-bitstreams.dat};
200\addplot+[fill,text=black]
201file {data/ipc-nrgrep112.dat};
202\addplot+[fill,text=black]
203file {data/ipc-gre2p.dat};
204
205\legend{bitstreams,nrgrep,gre2p,Annot}
206\end{axis}
207\end{tikzpicture}
208\end{center}
209\caption{Instructions per Cycle}\label{fig:SSEInstructionsPerCycle}
210\end{figure}
211
212Figure \ref{fig:SSEInstructionsPerCycle} shows the efficiency of
213processor resource usage
214achieved by the three programs on each of the test expression in
215terms of instructions per cycle (IPC).
216For the first four expressions, in particular,
217the bitstreams implementation uses the processor resources quite efficiently,
218avoiding penalties due to cache misses and branch mispredictions.   
219However, with the while loop
220structures in processing the Hex and StarHeight expressions, branch
221mispredictions increase considerably and there is a noticeable
222drop-off in IPC.   The gre2p program suffers from significant penalties
223for the smaller expressions, but otherwise achieves a good IPC rate.
224On the other hand, nrgrep IPC drops off with expression complexity,
225suffering from significant penalties due to 
226mispredictions in the character-skipping logic and cache misses in
227table lookups.
228
229Overall, the bitstreams SSE2 implementation significantly outperformed
230both nrgrep and gre2p. In addition, the performance of bitstreams
231generally scales well with regular expression complexity, although
232nested Kleene closures are an issue.
233
Note: See TracBrowser for help on using the repository browser.