source: docs/Working/re/sse2.tex @ 3876

Last change on this file since 3876 was 3876, checked in by cameron, 5 years ago

Delete extra period for paragraph{} items

File size: 10.0 KB
RevLine 
[3509]1\section{SSE2 Implementation and Evaluation}\label{sec:SSE2}
[3495]2
[3642]3\begin{table*}[htbp]
4\begin{center}
5{
6\footnotesize
7\begin{tabular}{|l|l|}
8\hline
9Name            & Expression    \\ \hline   
10@               & \verb`@`              \\ \hline     
11Date            & \verb`([0-9][0-9]?)/([0-9][0-9]?)/([0-9][0-9]([0-9][0-9])?)`          \\ \hline     
12Email           & \verb`([^ @]+)@([^ @]+)`              \\ \hline
[3870]13URIOrEmail      & \verb`(([a-zA-Z][a-zA-Z0-9]*)://|mailto:)([^ /]+)(/[^ ]*)?|([^ @]+)@([^ @]+)`         \\ \hline     
14HexBytes                & \verb`[ ](0x)?([a-fA-F0-9][a-fA-F0-9])+[.:,?! ]`              \\ \hline
15StarHeight              & \verb`[A-Z]((([a-zA-Z]*a[a-zA-Z]*[ ])*[a-zA-Z]*e[a-zA-Z]*[ ])*[a-zA-Z]*s[a-zA-Z]*[ ])*[.?!]`          \\ \hline
[3642]16\end{tabular}
17}
18\end{center}
19\caption{Regular Expressions}
20\label{RegularExpressions}
21\end{table*}
22
23
[3876]24\paragraph{Implementation Notes}
[3509]25Our regular expression compiler directly uses the Parabix tool chain
26to compile regular expression into SSE2-based implementations.
27Our compiler essentially scripts three other compilers to perform
28this work: the Parabix character class compiler to determine basic
29bit stream equations for each of the character classes encountered
30in a regular expression, the Pablo bitstream equation compiler which
31converts equations to block-at-a-time C++ code for 128-bit SIMD, and
[3868]32gcc 4.8.2 to generate the binaries.   The Pablo output is combined
[3510]33with a {\tt grep\_template.cpp} file that arranges to read input
[3666]34files, break them into segments, and print or count matches
[3509]35as they are encountered.
[3495]36
[3876]37\paragraph{Comparative Implementations}
[3642]38We evaluate our bitwise data parallel implementation versus several
39alternatives.   
40We report data for two of these:
41gre2p
42%GNU grep version 2.10
43and nrgrep version 1.12.
44The gre2p program is a grep version implemented using the recently developed
45RE2 regular expression library, using a systematic DFA-based approach
46(as well as some NFA fallback techniques) \cite{cox2010RE2}.
[3509]47The NFA class is represented by nrgrep, one of the
48strongest competitors in regular expression matching performance.
[3870]49We also considered GNU grep 2.10, agrep 3.41 as an alternative NFA-based implementation
[3509]50and pcregrep 8.12 as a backtracking implementation, but do not
[3642]51report data for them. 
52GNU grep is a popular open-source implementation that is claimed to be
[3647]53primarily DFA-based with heuristics for important special cases.   
[3642]54The agrep implementation does not support
[3509]55some of the common regular expression syntax feature and is limited to
56patterns of at most 32 characters.   As a backtracking implementation,
57pcregrep supports more regular expression features, but is not
58competitive in performance in any example we tested.
[3495]59
[3662]60We performed our SSE2 performance study using an Intel Core i5-4570 (Haswell) processor
61(3.2 GHz, 4 physical cores, 32+32 kB (per core) L1 cache, 256 kB (per core) L2 cache, 6 MB L3 cache)
62running the 64-bit version of Ubuntu 12.04 (Linux).
[3495]63
[3862]64Our performance evaluation focuses on the running time of
65the regular expression matching process itself, excluding the
66preprocessing time for regular expression compilation.   
67However,
68the overhead of the Parabix transform to bit stream form is
69included in our reported results.
[3495]70
[3862]71
[3876]72\paragraph{Test Expressions}
[3666]73Each grep implementation was evaluated
74against the five regular expressions shown
75in Table \ref{RegularExpressions}
[3862]76@ matches the at-sign character.
[3666]77This expression demonstrates the overhead involved
78in matching the simplest possible regular expression, a single character.
79Date, Email, and URIOrEmail provide examples of commonly used regular expression.
80This set of expressions were drawn from the \textit{Benchmark of Regex
81Libraries} (http://lh3lh3.users.sourceforge.net/reb.shtml).
82HexBytes matches delimited byte strings in hexadecimal notation, and
83enforces the constraint that the number of hex digits is even. This
84expression illustrates the performance
85of a repetition operator implemented using
[3637]86a while loop.
[3666]87All tests were run on a version
[3868]88of a \textit{Linux 3Dfx howto} 
89file of
9039,421,555 bytes. 
[3495]91
92
[3876]93\paragraph{Results}
[3495]94\begin{figure}
95\begin{center}
96\begin{tikzpicture}
97\begin{axis}[
98xtick=data,
99ylabel=Cycles per Byte,
[3868]100xticklabels={@,Date,Email,URIorEmail,HexBytes,StarHeight},
[3495]101tick label style={font=\tiny},
102enlarge x limits=0.15,
[3617]103%enlarge y limits={0.15, upper},
[3498]104ymin=0,
[3495]105legend style={at={(0.5,-0.15)},
106anchor=north,legend columns=-1},
[3644]107ymax=42,
[3495]108ybar,
109bar width=7pt,
[3617]110%visualization depends on=y \as \rawy,
[3495]111]
[3510]112\addplot
[3658]113file {data/cycles-bitstreams.dat};
[3510]114\addplot
[3658]115file {data/cycles-nrgrep112.dat};
[3510]116\addplot
[3658]117file {data/cycles-gre2p.dat};
[3495]118 
[3647]119\legend{bitstreams,nrgrep,gre2p,Annot}
[3495]120\end{axis}
121\end{tikzpicture}
122\end{center}
[3503]123\caption{Cycles per Byte}\label{fig:SSECyclesPerByte}
[3495]124\end{figure}
[3503]125
[3666]126Figure~\ref{fig:SSECyclesPerByte} compares
127each of the grep implementations,
128with relative performance reported in CPU cycles per byte.
[3509]129
[3862]130The performance in matching the @ regular expression establishes
131the base line cost for regular expression processing.   All programs
132report 15,788 matching lines of the 1,086,077 lines in the file.
133The Parabix SSE2 implementation is clearly the fastest in this case with
134a cost of 0.95 CPU cycles per byte.    The bulk of this represents
135the overhead of the Parabix transform, the bitwise logic to
136calculate the single {\tt [@]} character class stream is relatively
137trivial.   It is interesting to note that
138this example does not represent a baseline cost for either
139nrgrep or gre2p, each of these benefit from character skipping
140optimizations in their implementations.
[3509]141
[3862]142Our results for the matching the Date expression to find the
143668 lines containing dates show an increase
144from 0.95 to 1.22 cycles per byte, corresponding to the additional
145logic for the regular expression matching steps according to our
146algorithm.    For this relatively simple expression, however, nrgrep
147outperforms our implementation by taking significant advantage
148of character skipping.   Each time that nrgrep encounters a character
149that cannot appear in a date it jumps six character positions rather
150than searching every character in the input text.  gre2p also shows
151a significant benefit from the character skipping optimization.
152
153The results for the Email expression illustrate the relative
154advantage of the Parabix method when the expression to be matched
155does not permit character skipping in the NFA- or DFA-based
156implementations.   In this example, our implementation outperforms
157nrgrep by a factor of 7X, and gre2p by 23X.   There are 15,057
158lines matching the Email regex.
159
[3868]160The URIorEmail expression illustrates the performance of the
161grep programs with additional regular expression complexity.
162As expressions get larger, the number of steps required by
163the Parabix implementation increases, so the performance
164advantage drops to about 4.5X over nrgrep and 19X over gre2p.
16532557 lines are matched by the URIorEmail regex.
[3862]166
167The results for HexBytes expression illustrate Parabix performance
168involving a Kleene-+ operator compiled to a while loop. 
[3868]169Our implementation uses just 1.6 cycles per byte to find the
170130,243 matching lines.    The gre2p program performs
171quite poorly here, slower than the Parabix implementation
172by about 70X, while nrgrep is about 5.5X slower.
[3862]173
[3868]174StarHeight is an artificial expression created to stress the Parabix
175implementation with a triply-nested while loop.   The performance
176does drop off, maintaining only a 2X advantage over nrgrep.
[3862]177
178
179
180
[3495]181\begin{figure}
182\begin{center}
183\begin{tikzpicture}
184\begin{axis}[
185xtick=data,
186ylabel=Instructions per Byte,
[3868]187xticklabels={@,Date,Email,URIorEmail,HexBytes,StarHeight},
[3495]188tick label style={font=\tiny},
[3498]189enlarge x limits=0.15,
[3617]190%enlarge y limits={0.15, upper},
[3498]191ymin=0,
[3495]192legend style={at={(0.5,-0.15)},
193anchor=north,legend columns=-1},
[3647]194ymax=110,
[3495]195ybar,
196bar width=7pt,
197]
[3510]198\addplot
[3658]199file {data/instructions-bitstreams.dat};
[3510]200\addplot
[3658]201file {data/instructions-nrgrep112.dat};
[3510]202\addplot
[3658]203file {data/instructions-gre2p.dat};
[3495]204 
[3647]205\legend{bitstreams,nrgrep,gre2p,Annot}
[3495]206\end{axis}
207\end{tikzpicture}
208\end{center}
[3503]209\caption{Instructions per Byte}\label{fig:SSEInstructionsPerByte}
[3495]210\end{figure}
211
[3666]212Figure~\ref{fig:SSEInstructionsPerByte} illustrates
213relative performance reported in instructions per byte.
214Figure~\ref{fig:SSEInstructionsPerByte} 
215mirrors Figure~\ref{fig:SSECyclesPerByte}. The bitstreams
216implementation demonstrates stable performance
217characteristics across each of the test expressions. The
218gre2p implementation also shows
219relatively stable characteristics, whereas,
220nrgrep exhibits greater variability.
[3509]221 
[3495]222\begin{figure}
223\begin{center}
224\begin{tikzpicture}
225\begin{axis}[
226xtick=data,
227ylabel=Instructions per Cycle,
[3868]228xticklabels={@,Date,Email,URIorEmail,HexBytes,StarHeight},
[3495]229tick label style={font=\tiny},
[3498]230enlarge x limits=0.15,
[3617]231%enlarge y limits={0.15, upper},
[3498]232ymin=0,
[3495]233legend style={at={(0.5,-0.15)},
234anchor=north,legend columns=-1},
235ybar,
236bar width=7pt,
237]
[3510]238\addplot
[3658]239file {data/ipc-bitstreams.dat};
[3510]240\addplot
[3658]241file {data/ipc-nrgrep112.dat};
[3510]242\addplot
[3658]243file {data/ipc-gre2p.dat};
[3495]244
[3647]245\legend{bitstreams,nrgrep,gre2p,Annot}
[3495]246\end{axis}
247\end{tikzpicture}
248\end{center}
[3503]249\caption{Instructions per Cycle}\label{fig:SSEInstructionsPerCycle}
[3495]250\end{figure}
251
252\begin{figure}
253\begin{center}
254\begin{tikzpicture}
255\begin{axis}[
256xtick=data,
[3510]257ylabel=Branch Misses per Instruction,
[3868]258xticklabels={@,Date,Email,URIorEmail,HexBytes,StarHeight},
[3495]259tick label style={font=\tiny},
[3498]260enlarge x limits=0.15,
[3617]261%enlarge y limits={0.15, upper},
[3498]262ymin=0,
[3495]263legend style={at={(0.5,-0.15)},
264anchor=north,legend columns=-1},
265ybar,
266bar width=7pt,
267]
[3510]268\addplot
[3658]269file {data/branch-misses-bitstreams.dat};
[3510]270\addplot
[3658]271file {data/branch-misses-nrgrep112.dat};
[3510]272\addplot
[3658]273file {data/branch-misses-gre2p.dat};
[3495]274
[3647]275\legend{bitstreams,nrgrep,gre2p,Annot}
[3495]276\end{axis}
277\end{tikzpicture}
278\end{center}
[3510]279\caption{Branch Misses per Instruction}\label{fig:SSEBranchMisses}
[3495]280\end{figure}
[3666]281Figure \ref{fig:SSEBranchMisses} reports branch misses per kilobyte.
282The bitstreams and gre2p implementations remain consistent.
[3647]283Significant variation is seen with nrgrep.
[3495]284
[3666]285Overall, the bitstreams implementation significantly outperformed
286both nrgrep and gre2p. In addition, the performance of bitstreams
287scaled smoothly with regular expression complexity. As Section~\ref{sec:analysis}
288describes, we anticipate that the performance of bitstreams
289will to continue to scale smoothly with
290regular expression complexity.
[3509]291
292
Note: See TracBrowser for help on using the repository browser.