# Changeset 3889 for docs/Working/re

Ignore:
Timestamp:
Jun 24, 2014, 4:00:15 AM (5 years ago)
Message:

Various cleanups, drop instructions/byte, branch miss figures

Location:
docs/Working/re
Files:
6 edited

Unmodified
Removed
• ## docs/Working/re/analysis.tex

 r3879 The bitstream method starts with a preprocessing step: the compilation of the regular expression using the Parabix toolchain. of the regular expression using the algorithm presented above as well as the compilers of the Parabix toolchain. Compilation is an offline process whose time is not counted in our performance measures, as Parabix is a experimental research compiler that is not optimized. This leads to a bias in our results, as our timings for nrgrep and grep performance measures, as each of these are research tools that have neither been optimized nor integrated. This leads to a bias in our results, as our timings for nrgrep and gre2p include the time taken for preprocessing. We have attempted to minimize the bias by performing our tests with large inputs, so that the text-scanning costs dominate the preprocessing costs. We furthermore believe that, if a special-purpose optimized compiler for regular expressions were built, that its inclusion in bitstreams grep would not substantially increase the running time, particularly for large input texts--the compilation involved is straightforward. We minimize the bias by performing our tests with reasonably large inputs, so that the text-scanning costs dominate the preprocessing costs.   We assume that the length $m$ of regular expressions is typically less than 100 bytes and that data files are typically over 10 MB. Provided that a well-engineered implementation of our regular expression compilation algorithm together with the compilers of the Parabix tool chain requires no more than $10000m$ cycles, the overhead of compilation will not substantially increase the running time.   As our regular expression algorithm is $O(m)$, and the other steps of the Parabix tool chain require $O(m \log m)$ time, we expect that such performance is well within reason. It is important to note that our algorithms construct neither NFAs nor DFAs and so are no subject to the exponential-time behaviour of NFA-to-DFA transformation. For simplicity, we will first assume that the input regular expressions
• ## docs/Working/re/avx2.tex

 r3880 xtick=data, ylabel=AVX2 Instruction Reduction, xticklabels={@,Date,Email,URIorEmail,HexBytes,StarHeight}, xticklabels={@,Date,Email,URI,Hex,StarHeight}, tick label style={font=\tiny}, enlarge x limits=0.15, xtick=data, ylabel=AVX2 Speedup, xticklabels={@,Date,Email,URIorEmail,HexBytes,StarHeight}, xticklabels={@,Date,Email,URI,Hex,StarHeight}, tick label style={font=\tiny}, enlarge x limits=0.15, Email & 9.5X & 28X & 12X\\ \hline URIorEmail & 6.6X & 27X & 518X\\ \hline HexBytes & 8.1X & 105X & 267X\\ \hline Hex & 8.1X & 105X & 267X\\ \hline StarHeight & 1.9X & 7.6X & 97X\\ \hline \end{tabular}
• ## docs/Working/re/conclusion.tex

 r3883 of 5X or more. \paragraph*{Ongoing and Future Work} Based on the techniques presented here a fully integrated \paragraph*{Ongoing and Future Work} Based on the techniques presented here a fully integrated grep version with a dynamic code generator implemented with LLVM is being developed by another team working with the Parabix technology (Dale Denis, Nick Sumner and Rob Cameron). An initial version is available at {\tt http://parabix.costar.sfu.ca/icGREP}. An initial version is available at {\tt http://parabix.costar.sfu.ca/icGREP}. With icgrep-0.8, total compile-time overhead to translate our test expressions into executable x86 code ranges from 0.002 seconds to 0.008 seconds for our test cases. Although this represents a tolerable overhead of 0.64 cycles/byte for our 40 MB test file, we expect that a substantial reduction of this overhead is feasible. Further work includes the extending the algorithms Further work on the compilation algorithms includes the extending the algorithms to use MatchStar in Kleene-* repetitions beyond those of single characters (bytes).   Each such extension would
• ## docs/Working/re/pact051-cameron.tex

 r3885 areas for future work. \section{Regular Expression Notation and Grep}\label{sec:grep} \section{Regular Expressions and Grep}\label{sec:grep} We follow common POSIX notation for regular expressions. \end{itemize} In this model, the \verb#hsimd<64>::mask(X)# and Here, the \verb#hsimd<64>::mask(X)# and \verb#simd<64>::spread(X)# model the minimum communication requirements between the parallel processing units expression matching as shown herein, it seems reasonable to expect such instructions to become available.    Alternatively, it may be worthwhile to simply ensure that the \verb#hsimd<64>::mask(X)# and \verb#simd<64>::spread(X)# operations are efficiently supported. be worthwhile to simply ensure that the \verb#hmask# and \verb#spread# operations are efficiently supported. \ref{fig:SSE-AVX-GPU}. The GPU version gives up to 55\% performance improvement over SSE version and up to 40\% performance improvement over AVX version. Although we intended to process improvement over AVX version.   However, because of implementation complexities of the triply-nested while loop for the StarHeight expression, it has been omitted. Although we intended to process 64 work groups with 4096 bytes each at a time rather than 128 bytes at a time on SSE or 256 bytes at a time on AVX, the performance xtick=data, ylabel=Running Time (ms per megabyte), xticklabels={@,Date,Email,URIorEmail,HexBytes,StarHeight}, xticklabels={@,Date,Email,URI,Hex,StarHeight}, tick label style={font=\tiny}, enlarge x limits=0.15,
• ## docs/Working/re/sse2.tex

 r3883 Date            & \verb([0-9][0-9]?)/([0-9][0-9]?)/([0-9][0-9]([0-9][0-9])?)          \\ \hline Email           & \verb([^ @]+)@([^ @]+)              \\ \hline URIOrEmail      & \verb(([a-zA-Z][a-zA-Z0-9]*)://|mailto:)([^ /]+)(/[^ ]*)?|([^ @]+)@([^ @]+)         \\ \hline HexBytes                & \verb[ ](0x)?([a-fA-F0-9][a-fA-F0-9])+[.:,?! ]              \\ \hline URI     & \verb(([a-zA-Z][a-zA-Z0-9]*)://|mailto:)([^ /]+)(/[^ ]*)?|([^ @]+)@([^ @]+)         \\ \hline Hex             & \verb[ ](0x)?([a-fA-F0-9][a-fA-F0-9])+[.:,?! ]              \\ \hline StarHeight              & \verb[A-Z]((([a-zA-Z]*a[a-zA-Z]*[ ])*[a-zA-Z]*e[a-zA-Z]*[ ])*[a-zA-Z]*s[a-zA-Z]*[ ])*[.?!]          \\ \hline \end{tabular} This expression demonstrates the overhead involved in matching the simplest possible regular expression, a single character. Date, Email, and URIOrEmail provide examples of commonly used regular expression. Date, Email, and URI provide examples of commonly used regular expression. This set of expressions were modified from the \textit{Benchmark of Regex Libraries} (http://lh3lh3.users.sourceforge.net/reb.shtml). HexBytes matches delimited byte strings in hexadecimal notation, and Libraries}. Hex matches delimited byte strings in hexadecimal notation, and enforces the constraint that the number of hex digits is even. This expression illustrates the performance of a repetition operator implemented using a while loop.  StarHeight is an expression designed to further stress our while loop implementation with 4 levels of Kleene closure. a while loop in our system.  StarHeight is an artificial expression designed to further stress while loop implementation with 4 levels of Kleene closure. All tests were run on a version of a \textit{Linux 3Dfx howto} xtick=data, ylabel=Cycles per Byte, xticklabels={@,Date,Email,URIorEmail,HexBytes,StarHeight}, xticklabels={@,Date,Email,URI,Hex,StarHeight}, tick label style={font=\tiny}, enlarge x limits=0.15, lines matching the Email regex. The URIorEmail expression illustrates the performance of the The URI expression illustrates the performance of the grep programs with additional regular expression complexity. As expressions get larger, the number of steps required by the Parabix implementation increases, so the performance advantage drops to about 4.5X over nrgrep and 19X over gre2p. 32557 lines are matched by the URIorEmail regex. The results for HexBytes expression illustrate Parabix performance involving a Kleene-+ operator compiled to a while loop. Our implementation uses just 1.6 cycles per byte to find the 32557 lines are matched by the URI regex. The results for Hex expression illustrate the bitstreams performance in the case of a Kleene-+ operator compiled to a while loop. Performance is nevertheless quite good; our implementation uses just 1.6 cycles per byte to find the 130,243 matching lines.    The gre2p program performs quite poorly here, slower than the Parabix implementation by about 70X, while nrgrep is about 5.5X slower. StarHeight is an artificial expression created to stress the Parabix implementation with a triply-nested while loop.   The performance does drop off, maintaining only a 2X advantage over nrgrep. \begin{figure} \begin{center} \begin{tikzpicture} \begin{axis}[ xtick=data, ylabel=Instructions per Byte, xticklabels={@,Date,Email,URIorEmail,HexBytes,StarHeight}, tick label style={font=\tiny}, enlarge x limits=0.15, %enlarge y limits={0.15, upper}, ymin=0, legend style={at={(0.5,-0.15)}, anchor=north,legend columns=-1}, ymax=110, ybar, bar width=7pt, ] \addplot file {data/instructions-bitstreams.dat}; \addplot file {data/instructions-nrgrep112.dat}; \addplot file {data/instructions-gre2p.dat}; \legend{bitstreams,nrgrep,gre2p,Annot} \end{axis} \end{tikzpicture} \end{center} \caption{Instructions per Byte}\label{fig:SSEInstructionsPerByte} \end{figure} Figure~\ref{fig:SSEInstructionsPerByte} illustrates relative performance reported in instructions per byte. Figure~\ref{fig:SSEInstructionsPerByte} mirrors Figure~\ref{fig:SSECyclesPerByte}. The bitstreams implementation demonstrates stable performance characteristics across each of the test expressions. The gre2p implementation also shows relatively stable characteristics, whereas, nrgrep exhibits greater variability. A more complex triply-nested repetition structure is required by the bitstreams implementation of the StarHeight expression. In this case, there is a noticeable drop off in the performance advantage of the bitstreams implementation over the nrgrep and gre2p.   Nevertheless a 2X advantage over nrgrep is maintained. \begin{figure} \begin{center} xtick=data, ylabel=Instructions per Cycle, xticklabels={@,Date,Email,URIorEmail,HexBytes,StarHeight}, xticklabels={@,Date,Email,URI,Hex,StarHeight}, tick label style={font=\tiny}, enlarge x limits=0.15, \end{figure} \begin{figure} \begin{center} \begin{tikzpicture} \begin{axis}[ xtick=data, ylabel=Branch Misses per Instruction, xticklabels={@,Date,Email,URIorEmail,HexBytes,StarHeight}, tick label style={font=\tiny}, enlarge x limits=0.15, %enlarge y limits={0.15, upper}, ymin=0, legend style={at={(0.5,-0.15)}, anchor=north,legend columns=-1}, ybar, bar width=7pt, ] \addplot file {data/branch-misses-bitstreams.dat}; \addplot file {data/branch-misses-nrgrep112.dat}; \addplot file {data/branch-misses-gre2p.dat}; \legend{bitstreams,nrgrep,gre2p,Annot} \end{axis} \end{tikzpicture} \end{center} \caption{Branch Misses per Instruction}\label{fig:SSEBranchMisses} \end{figure} Figure \ref{fig:SSEBranchMisses} reports branch misses per kilobyte. The bitstreams and gre2p implementations remain consistent. Significant variation is seen with nrgrep. Figure \ref{fig:SSEInstructionsPerCycle} shows the efficiency of processor utilization achieved by the three programs on each of the test expression in terms of instructions per cycle (IPC). For the first four expressions, in particular, the bitstreams implementation uses the processor resources quite efficiently, avoiding penalties due to cache misses and branch mispredictions. However, with the while loop structures in processing the Hex and StarHeight expressions, branch mispredictions increase considerably and there is a noticeable drop-off in IPC.  Both the gre2p and nrgrep suffer from significant penalties due to mispredictions in the character-skipping logic and cache misses in table lookups.  The performance of nrgrep, in particular drops off with the growth in regulare expression complexity. Overall, the bitstreams implementation significantly outperformed both nrgrep and gre2p. In addition, the performance of bitstreams scaled smoothly with regular expression complexity. As Section~\ref{sec:analysis} describes, we anticipate that the performance of bitstreams will to continue to scale smoothly with regular expression complexity. scales well with regular expression complexity.
Note: See TracChangeset for help on using the changeset viewer.