source: docs/Working/re/sse2.tex @ 3889

Last change on this file since 3889 was 3889, checked in by cameron, 5 years ago

Various cleanups, drop instructions/byte, branch miss figures

File size: 9.0 KB
[3883]1\section{SSE2 Implementation}\label{sec:SSE2}
9Name            & Expression    \\ \hline   
10@               & \verb`@`              \\ \hline     
11Date            & \verb`([0-9][0-9]?)/([0-9][0-9]?)/([0-9][0-9]([0-9][0-9])?)`          \\ \hline     
12Email           & \verb`([^ @]+)@([^ @]+)`              \\ \hline
[3889]13URI     & \verb`(([a-zA-Z][a-zA-Z0-9]*)://|mailto:)([^ /]+)(/[^ ]*)?|([^ @]+)@([^ @]+)`         \\ \hline     
14Hex             & \verb`[ ](0x)?([a-fA-F0-9][a-fA-F0-9])+[.:,?! ]`              \\ \hline
[3870]15StarHeight              & \verb`[A-Z]((([a-zA-Z]*a[a-zA-Z]*[ ])*[a-zA-Z]*e[a-zA-Z]*[ ])*[a-zA-Z]*s[a-zA-Z]*[ ])*[.?!]`          \\ \hline
19\caption{Regular Expressions}
[3880]24\paragraph*{Implementation Notes} Our regular expression compiler directly uses the Parabix tool chain
[3509]25to compile regular expression into SSE2-based implementations.
26Our compiler essentially scripts three other compilers to perform
27this work: the Parabix character class compiler to determine basic
28bit stream equations for each of the character classes encountered
29in a regular expression, the Pablo bitstream equation compiler which
30converts equations to block-at-a-time C++ code for 128-bit SIMD, and
[3868]31gcc 4.8.2 to generate the binaries.   The Pablo output is combined
[3510]32with a {\tt grep\_template.cpp} file that arranges to read input
[3666]33files, break them into segments, and print or count matches
[3509]34as they are encountered.
[3880]36\paragraph*{Comparative Implementations} We evaluate our bitwise data parallel implementation versus several
38We report data for two of these:
40%GNU grep version 2.10
41and nrgrep version 1.12.
42The gre2p program is a grep version implemented using the recently developed
43RE2 regular expression library, using a systematic DFA-based approach
44(as well as some NFA fallback techniques) \cite{cox2010RE2}.
[3509]45The NFA class is represented by nrgrep, one of the
46strongest competitors in regular expression matching performance.
[3870]47We also considered GNU grep 2.10, agrep 3.41 as an alternative NFA-based implementation
[3509]48and pcregrep 8.12 as a backtracking implementation, but do not
[3642]49report data for them. 
50GNU grep is a popular open-source implementation that is claimed to be
[3647]51primarily DFA-based with heuristics for important special cases.   
[3642]52The agrep implementation does not support
[3509]53some of the common regular expression syntax feature and is limited to
54patterns of at most 32 characters.   As a backtracking implementation,
55pcregrep supports more regular expression features, but is not
56competitive in performance in any example we tested.
[3662]58We performed our SSE2 performance study using an Intel Core i5-4570 (Haswell) processor
59(3.2 GHz, 4 physical cores, 32+32 kB (per core) L1 cache, 256 kB (per core) L2 cache, 6 MB L3 cache)
60running the 64-bit version of Ubuntu 12.04 (Linux).
[3862]62Our performance evaluation focuses on the running time of
63the regular expression matching process itself, excluding the
64preprocessing time for regular expression compilation.   
66the overhead of the Parabix transform to bit stream form is
67included in our reported results.
[3880]70\paragraph*{Test Expressions} Each grep implementation was evaluated
[3666]71against the five regular expressions shown
72in Table \ref{RegularExpressions}
[3862]73@ matches the at-sign character.
[3666]74This expression demonstrates the overhead involved
75in matching the simplest possible regular expression, a single character.
[3889]76Date, Email, and URI provide examples of commonly used regular expression.
[3880]77This set of expressions were modified from the \textit{Benchmark of Regex
79Hex matches delimited byte strings in hexadecimal notation, and
[3666]80enforces the constraint that the number of hex digits is even. This
81expression illustrates the performance
82of a repetition operator implemented using
[3889]83a while loop in our system.  StarHeight is an artificial expression designed to further
84stress while loop implementation with 4 levels of Kleene closure.
[3666]85All tests were run on a version
[3868]86of a \textit{Linux 3Dfx howto} 
87file of
8839,421,555 bytes. 
97ylabel=Cycles per Byte,
[3495]99tick label style={font=\tiny},
100enlarge x limits=0.15,
[3617]101%enlarge y limits={0.15, upper},
[3495]103legend style={at={(0.5,-0.15)},
104anchor=north,legend columns=-1},
107bar width=7pt,
[3617]108%visualization depends on=y \as \rawy,
[3658]111file {data/cycles-bitstreams.dat};
[3658]113file {data/cycles-nrgrep112.dat};
[3658]115file {data/cycles-gre2p.dat};
[3503]121\caption{Cycles per Byte}\label{fig:SSECyclesPerByte}
[3666]124Figure~\ref{fig:SSECyclesPerByte} compares
125each of the grep implementations,
126with relative performance reported in CPU cycles per byte.
[3862]128The performance in matching the @ regular expression establishes
129the base line cost for regular expression processing.   All programs
130report 15,788 matching lines of the 1,086,077 lines in the file.
131The Parabix SSE2 implementation is clearly the fastest in this case with
132a cost of 0.95 CPU cycles per byte.    The bulk of this represents
133the overhead of the Parabix transform, the bitwise logic to
134calculate the single {\tt [@]} character class stream is relatively
135trivial.   It is interesting to note that
136this example does not represent a baseline cost for either
137nrgrep or gre2p, each of these benefit from character skipping
138optimizations in their implementations.
[3862]140Our results for the matching the Date expression to find the
141668 lines containing dates show an increase
142from 0.95 to 1.22 cycles per byte, corresponding to the additional
143logic for the regular expression matching steps according to our
144algorithm.    For this relatively simple expression, however, nrgrep
145outperforms our implementation by taking significant advantage
146of character skipping.   Each time that nrgrep encounters a character
147that cannot appear in a date it jumps six character positions rather
148than searching every character in the input text.  gre2p also shows
149a significant benefit from the character skipping optimization.
151The results for the Email expression illustrate the relative
152advantage of the Parabix method when the expression to be matched
153does not permit character skipping in the NFA- or DFA-based
154implementations.   In this example, our implementation outperforms
155nrgrep by a factor of 7X, and gre2p by 23X.   There are 15,057
156lines matching the Email regex.
[3889]158The URI expression illustrates the performance of the
[3868]159grep programs with additional regular expression complexity.
160As expressions get larger, the number of steps required by
161the Parabix implementation increases, so the performance
162advantage drops to about 4.5X over nrgrep and 19X over gre2p.
[3889]16332557 lines are matched by the URI regex.
[3889]165The results for Hex expression illustrate the bitstreams performance
166in the case of a Kleene-+ operator compiled to a while loop.
167Performance is nevertheless quite good;
168our implementation uses just 1.6 cycles per byte to find the
[3868]169130,243 matching lines.    The gre2p program performs
170quite poorly here, slower than the Parabix implementation
171by about 70X, while nrgrep is about 5.5X slower.
[3889]173A more complex triply-nested repetition structure is required by
174the bitstreams implementation of the StarHeight expression. 
175In this case, there is a noticeable drop off in the
176performance advantage of the bitstreams implementation over
177the nrgrep and gre2p.   Nevertheless a 2X
178advantage over nrgrep is maintained.
185ylabel=Instructions per Cycle,
[3495]187tick label style={font=\tiny},
[3498]188enlarge x limits=0.15,
[3617]189%enlarge y limits={0.15, upper},
[3495]191legend style={at={(0.5,-0.15)},
192anchor=north,legend columns=-1},
194bar width=7pt,
[3658]197file {data/ipc-bitstreams.dat};
[3658]199file {data/ipc-nrgrep112.dat};
[3658]201file {data/ipc-gre2p.dat};
[3503]207\caption{Instructions per Cycle}\label{fig:SSEInstructionsPerCycle}
[3889]210Figure \ref{fig:SSEInstructionsPerCycle} shows the efficiency of
211processor utilization
212achieved by the three programs on each of the test expression in
213terms of instructions per cycle (IPC).
214For the first four expressions, in particular,
215the bitstreams implementation uses the processor resources quite efficiently,
216avoiding penalties due to cache misses and branch mispredictions.   
217However, with the while loop
218structures in processing the Hex and StarHeight expressions, branch
219mispredictions increase considerably and there is a noticeable
220drop-off in IPC.  Both the gre2p and nrgrep suffer from significant
221penalties due to 
222mispredictions in the character-skipping logic and cache misses in
223table lookups.  The performance of nrgrep, in particular drops off
224with the growth in regulare expression complexity.
[3666]226Overall, the bitstreams implementation significantly outperformed
227both nrgrep and gre2p. In addition, the performance of bitstreams
[3889]228scales well with regular expression complexity.
Note: See TracBrowser for help on using the repository browser.