source: docs/Working/re/sse2.tex @ 3883

Last change on this file since 3883 was 3883, checked in by cameron, 5 years ago

Reference cleanups; use GPU instead of GPGPU

File size: 10.2 KB
[3883]1\section{SSE2 Implementation}\label{sec:SSE2}
9Name            & Expression    \\ \hline   
10@               & \verb`@`              \\ \hline     
11Date            & \verb`([0-9][0-9]?)/([0-9][0-9]?)/([0-9][0-9]([0-9][0-9])?)`          \\ \hline     
12Email           & \verb`([^ @]+)@([^ @]+)`              \\ \hline
[3870]13URIOrEmail      & \verb`(([a-zA-Z][a-zA-Z0-9]*)://|mailto:)([^ /]+)(/[^ ]*)?|([^ @]+)@([^ @]+)`         \\ \hline     
14HexBytes                & \verb`[ ](0x)?([a-fA-F0-9][a-fA-F0-9])+[.:,?! ]`              \\ \hline
15StarHeight              & \verb`[A-Z]((([a-zA-Z]*a[a-zA-Z]*[ ])*[a-zA-Z]*e[a-zA-Z]*[ ])*[a-zA-Z]*s[a-zA-Z]*[ ])*[.?!]`          \\ \hline
19\caption{Regular Expressions}
[3880]24\paragraph*{Implementation Notes} Our regular expression compiler directly uses the Parabix tool chain
[3509]25to compile regular expression into SSE2-based implementations.
26Our compiler essentially scripts three other compilers to perform
27this work: the Parabix character class compiler to determine basic
28bit stream equations for each of the character classes encountered
29in a regular expression, the Pablo bitstream equation compiler which
30converts equations to block-at-a-time C++ code for 128-bit SIMD, and
[3868]31gcc 4.8.2 to generate the binaries.   The Pablo output is combined
[3510]32with a {\tt grep\_template.cpp} file that arranges to read input
[3666]33files, break them into segments, and print or count matches
[3509]34as they are encountered.
[3880]36\paragraph*{Comparative Implementations} We evaluate our bitwise data parallel implementation versus several
38We report data for two of these:
40%GNU grep version 2.10
41and nrgrep version 1.12.
42The gre2p program is a grep version implemented using the recently developed
43RE2 regular expression library, using a systematic DFA-based approach
44(as well as some NFA fallback techniques) \cite{cox2010RE2}.
[3509]45The NFA class is represented by nrgrep, one of the
46strongest competitors in regular expression matching performance.
[3870]47We also considered GNU grep 2.10, agrep 3.41 as an alternative NFA-based implementation
[3509]48and pcregrep 8.12 as a backtracking implementation, but do not
[3642]49report data for them. 
50GNU grep is a popular open-source implementation that is claimed to be
[3647]51primarily DFA-based with heuristics for important special cases.   
[3642]52The agrep implementation does not support
[3509]53some of the common regular expression syntax feature and is limited to
54patterns of at most 32 characters.   As a backtracking implementation,
55pcregrep supports more regular expression features, but is not
56competitive in performance in any example we tested.
[3662]58We performed our SSE2 performance study using an Intel Core i5-4570 (Haswell) processor
59(3.2 GHz, 4 physical cores, 32+32 kB (per core) L1 cache, 256 kB (per core) L2 cache, 6 MB L3 cache)
60running the 64-bit version of Ubuntu 12.04 (Linux).
[3862]62Our performance evaluation focuses on the running time of
63the regular expression matching process itself, excluding the
64preprocessing time for regular expression compilation.   
66the overhead of the Parabix transform to bit stream form is
67included in our reported results.
[3880]70\paragraph*{Test Expressions} Each grep implementation was evaluated
[3666]71against the five regular expressions shown
72in Table \ref{RegularExpressions}
[3862]73@ matches the at-sign character.
[3666]74This expression demonstrates the overhead involved
75in matching the simplest possible regular expression, a single character.
76Date, Email, and URIOrEmail provide examples of commonly used regular expression.
[3880]77This set of expressions were modified from the \textit{Benchmark of Regex
[3666]78Libraries} (
79HexBytes matches delimited byte strings in hexadecimal notation, and
80enforces the constraint that the number of hex digits is even. This
81expression illustrates the performance
82of a repetition operator implemented using
[3880]83a while loop.  StarHeight is an expression designed to further
84stress our while loop implementation with 4 levels of Kleene closure.
[3666]85All tests were run on a version
[3868]86of a \textit{Linux 3Dfx howto} 
87file of
8839,421,555 bytes. 
97ylabel=Cycles per Byte,
[3495]99tick label style={font=\tiny},
100enlarge x limits=0.15,
[3617]101%enlarge y limits={0.15, upper},
[3495]103legend style={at={(0.5,-0.15)},
104anchor=north,legend columns=-1},
107bar width=7pt,
[3617]108%visualization depends on=y \as \rawy,
[3658]111file {data/cycles-bitstreams.dat};
[3658]113file {data/cycles-nrgrep112.dat};
[3658]115file {data/cycles-gre2p.dat};
[3503]121\caption{Cycles per Byte}\label{fig:SSECyclesPerByte}
[3666]124Figure~\ref{fig:SSECyclesPerByte} compares
125each of the grep implementations,
126with relative performance reported in CPU cycles per byte.
[3862]128The performance in matching the @ regular expression establishes
129the base line cost for regular expression processing.   All programs
130report 15,788 matching lines of the 1,086,077 lines in the file.
131The Parabix SSE2 implementation is clearly the fastest in this case with
132a cost of 0.95 CPU cycles per byte.    The bulk of this represents
133the overhead of the Parabix transform, the bitwise logic to
134calculate the single {\tt [@]} character class stream is relatively
135trivial.   It is interesting to note that
136this example does not represent a baseline cost for either
137nrgrep or gre2p, each of these benefit from character skipping
138optimizations in their implementations.
[3862]140Our results for the matching the Date expression to find the
141668 lines containing dates show an increase
142from 0.95 to 1.22 cycles per byte, corresponding to the additional
143logic for the regular expression matching steps according to our
144algorithm.    For this relatively simple expression, however, nrgrep
145outperforms our implementation by taking significant advantage
146of character skipping.   Each time that nrgrep encounters a character
147that cannot appear in a date it jumps six character positions rather
148than searching every character in the input text.  gre2p also shows
149a significant benefit from the character skipping optimization.
151The results for the Email expression illustrate the relative
152advantage of the Parabix method when the expression to be matched
153does not permit character skipping in the NFA- or DFA-based
154implementations.   In this example, our implementation outperforms
155nrgrep by a factor of 7X, and gre2p by 23X.   There are 15,057
156lines matching the Email regex.
[3868]158The URIorEmail expression illustrates the performance of the
159grep programs with additional regular expression complexity.
160As expressions get larger, the number of steps required by
161the Parabix implementation increases, so the performance
162advantage drops to about 4.5X over nrgrep and 19X over gre2p.
16332557 lines are matched by the URIorEmail regex.
165The results for HexBytes expression illustrate Parabix performance
166involving a Kleene-+ operator compiled to a while loop. 
[3868]167Our implementation uses just 1.6 cycles per byte to find the
168130,243 matching lines.    The gre2p program performs
169quite poorly here, slower than the Parabix implementation
170by about 70X, while nrgrep is about 5.5X slower.
[3868]172StarHeight is an artificial expression created to stress the Parabix
173implementation with a triply-nested while loop.   The performance
174does drop off, maintaining only a 2X advantage over nrgrep.
184ylabel=Instructions per Byte,
[3495]186tick label style={font=\tiny},
[3498]187enlarge x limits=0.15,
[3617]188%enlarge y limits={0.15, upper},
[3495]190legend style={at={(0.5,-0.15)},
191anchor=north,legend columns=-1},
194bar width=7pt,
[3658]197file {data/instructions-bitstreams.dat};
[3658]199file {data/instructions-nrgrep112.dat};
[3658]201file {data/instructions-gre2p.dat};
[3503]207\caption{Instructions per Byte}\label{fig:SSEInstructionsPerByte}
[3666]210Figure~\ref{fig:SSEInstructionsPerByte} illustrates
211relative performance reported in instructions per byte.
213mirrors Figure~\ref{fig:SSECyclesPerByte}. The bitstreams
214implementation demonstrates stable performance
215characteristics across each of the test expressions. The
216gre2p implementation also shows
217relatively stable characteristics, whereas,
218nrgrep exhibits greater variability.
225ylabel=Instructions per Cycle,
[3495]227tick label style={font=\tiny},
[3498]228enlarge x limits=0.15,
[3617]229%enlarge y limits={0.15, upper},
[3495]231legend style={at={(0.5,-0.15)},
232anchor=north,legend columns=-1},
234bar width=7pt,
[3658]237file {data/ipc-bitstreams.dat};
[3658]239file {data/ipc-nrgrep112.dat};
[3658]241file {data/ipc-gre2p.dat};
[3503]247\caption{Instructions per Cycle}\label{fig:SSEInstructionsPerCycle}
[3510]255ylabel=Branch Misses per Instruction,
[3495]257tick label style={font=\tiny},
[3498]258enlarge x limits=0.15,
[3617]259%enlarge y limits={0.15, upper},
[3495]261legend style={at={(0.5,-0.15)},
262anchor=north,legend columns=-1},
264bar width=7pt,
[3658]267file {data/branch-misses-bitstreams.dat};
[3658]269file {data/branch-misses-nrgrep112.dat};
[3658]271file {data/branch-misses-gre2p.dat};
[3510]277\caption{Branch Misses per Instruction}\label{fig:SSEBranchMisses}
[3666]279Figure \ref{fig:SSEBranchMisses} reports branch misses per kilobyte.
280The bitstreams and gre2p implementations remain consistent.
[3647]281Significant variation is seen with nrgrep.
[3666]283Overall, the bitstreams implementation significantly outperformed
284both nrgrep and gre2p. In addition, the performance of bitstreams
285scaled smoothly with regular expression complexity. As Section~\ref{sec:analysis}
286describes, we anticipate that the performance of bitstreams
287will to continue to scale smoothly with
288regular expression complexity.
Note: See TracBrowser for help on using the repository browser.