Changeset 3889 for docs


Ignore:
Timestamp:
Jun 24, 2014, 4:00:15 AM (4 years ago)
Author:
cameron
Message:

Various cleanups, drop instructions/byte, branch miss figures

Location:
docs/Working/re
Files:
6 edited

Legend:

Unmodified
Added
Removed
  • docs/Working/re/analysis.tex

    r3879 r3889  
    1010
    1111The bitstream method starts with a preprocessing step: the compilation
    12 of the
    13 regular expression using the Parabix toolchain.
     12of the regular expression using the algorithm presented above as
     13well as the compilers of the Parabix toolchain.
    1414Compilation is an offline process whose time is not counted in our
    15 performance
    16 measures, as Parabix is a experimental research compiler that is not
    17 optimized.
    18 This leads to a bias in our results, as our timings for nrgrep and grep
     15performance measures, as each of these are research tools that have
     16neither been optimized nor integrated.   
     17 This leads to a bias in our results, as our timings for nrgrep and gre2p
    1918include the time taken for preprocessing.
    20 We have attempted to minimize the bias by performing our tests with large
    21 inputs, so that the text-scanning costs dominate the preprocessing costs.
    22 We furthermore believe that, if a special-purpose optimized compiler for
    23 regular expressions were built, that its inclusion in bitstreams grep
    24 would not
    25 substantially increase the running time, particularly for large input
    26 texts--the compilation involved is straightforward.
     19We minimize the bias by performing our tests with
     20reasonably large inputs, so that the text-scanning costs dominate the preprocessing
     21costs.   We assume that the length $m$ of regular expressions is typically less than
     22100 bytes and that data files are typically over 10 MB.   
     23Provided that a well-engineered implementation of our regular
     24expression compilation algorithm together with the compilers of
     25the Parabix tool chain requires no more than $10000m$ cycles,
     26the overhead of compilation will not substantially increase the
     27running time.   As our regular expression algorithm is $O(m)$,
     28and the other steps of the Parabix tool chain require $O(m \log m)$
     29time, we expect that such performance is well within reason.
     30It is important to note that our algorithms construct neither
     31NFAs nor DFAs and so are no subject to the exponential-time
     32behaviour of NFA-to-DFA transformation.
    2733
    2834For simplicity, we will first assume that the input regular expressions
  • docs/Working/re/avx2.tex

    r3880 r3889  
    5757xtick=data,
    5858ylabel=AVX2 Instruction Reduction,
    59 xticklabels={@,Date,Email,URIorEmail,HexBytes,StarHeight},
     59xticklabels={@,Date,Email,URI,Hex,StarHeight},
    6060tick label style={font=\tiny},
    6161enlarge x limits=0.15,
     
    109109xtick=data,
    110110ylabel=AVX2 Speedup,
    111 xticklabels={@,Date,Email,URIorEmail,HexBytes,StarHeight},
     111xticklabels={@,Date,Email,URI,Hex,StarHeight},
    112112tick label style={font=\tiny},
    113113enlarge x limits=0.15,
     
    168168Email & 9.5X & 28X & 12X\\ \hline
    169169URIorEmail & 6.6X & 27X & 518X\\ \hline
    170 HexBytes & 8.1X & 105X & 267X\\ \hline
     170Hex & 8.1X & 105X & 267X\\ \hline
    171171StarHeight & 1.9X & 7.6X & 97X\\ \hline
    172172\end{tabular}
  • docs/Working/re/conclusion.tex

    r3883 r3889  
    7373of 5X or more.
    7474
    75 \paragraph*{Ongoing and Future Work} Based on the techniques presented here a fully integrated
     75\paragraph*{Ongoing and Future Work} Based on the techniques presented
     76here a fully integrated
    7677grep version with a dynamic code generator implemented with LLVM
    7778is being developed by another team working with the Parabix
    7879technology (Dale Denis, Nick Sumner and Rob Cameron).
    79 An initial version is available at
    80 {\tt http://parabix.costar.sfu.ca/icGREP}.
     80An initial version is available at {\tt
     81  http://parabix.costar.sfu.ca/icGREP}.
     82With icgrep-0.8, total compile-time overhead to translate our
     83test expressions into executable x86 code ranges from 0.002
     84seconds to 0.008 seconds for our test cases.   
     85Although this represents a tolerable overhead
     86of 0.64 cycles/byte for our 40 MB test file,
     87we expect that a substantial reduction of this
     88overhead is feasible.
    8189
    82 Further work includes the extending the algorithms
     90Further work on the compilation algorithms includes the extending the algorithms
    8391to use MatchStar in Kleene-* repetitions beyond those
    8492of single characters (bytes).   Each such extension would
  • docs/Working/re/pact051-cameron.tex

    r3885 r3889  
    242242areas for future work.
    243243
    244 \section{Regular Expression Notation and Grep}\label{sec:grep}
     244\section{Regular Expressions and Grep}\label{sec:grep}
    245245
    246246We follow common POSIX notation for regular expressions.
     
    523523\end{itemize}
    524524
    525 In this model, the \verb#hsimd<64>::mask(X)# and
     525Here, the \verb#hsimd<64>::mask(X)# and
    526526\verb#simd<64>::spread(X)# model the minimum
    527527communication requirements between the parallel processing units
     
    627627expression matching as shown herein, it seems reasonable to expect
    628628such instructions to become available.    Alternatively, it may
    629 be worthwhile to simply ensure that the \verb#hsimd<64>::mask(X)# and
    630 \verb#simd<64>::spread(X)# operations are efficiently supported.
     629be worthwhile to simply ensure that the \verb#hmask# and
     630\verb#spread# operations are efficiently supported.
    631631
    632632
     
    671671\ref{fig:SSE-AVX-GPU}. The GPU version gives up to 55\% performance
    672672improvement over SSE version and up to 40\% performance
    673 improvement over AVX version. Although we intended to process
     673improvement over AVX version.   However, because of
     674implementation complexities of the triply-nested while loop for
     675the StarHeight expression, it has been omitted.
     676
     677Although we intended to process
    67467864 work groups with 4096 bytes each at a time rather than 128 bytes
    675679at a time on SSE or 256 bytes at a time on AVX, the performance
     
    696700xtick=data,
    697701ylabel=Running Time (ms per megabyte),
    698 xticklabels={@,Date,Email,URIorEmail,HexBytes,StarHeight},
     702xticklabels={@,Date,Email,URI,Hex,StarHeight},
    699703tick label style={font=\tiny},
    700704enlarge x limits=0.15,
  • docs/Working/re/sse2.tex

    r3883 r3889  
    1111Date            & \verb`([0-9][0-9]?)/([0-9][0-9]?)/([0-9][0-9]([0-9][0-9])?)`          \\ \hline     
    1212Email           & \verb`([^ @]+)@([^ @]+)`              \\ \hline
    13 URIOrEmail      & \verb`(([a-zA-Z][a-zA-Z0-9]*)://|mailto:)([^ /]+)(/[^ ]*)?|([^ @]+)@([^ @]+)`         \\ \hline     
    14 HexBytes                & \verb`[ ](0x)?([a-fA-F0-9][a-fA-F0-9])+[.:,?! ]`              \\ \hline
     13URI     & \verb`(([a-zA-Z][a-zA-Z0-9]*)://|mailto:)([^ /]+)(/[^ ]*)?|([^ @]+)@([^ @]+)`         \\ \hline     
     14Hex             & \verb`[ ](0x)?([a-fA-F0-9][a-fA-F0-9])+[.:,?! ]`              \\ \hline
    1515StarHeight              & \verb`[A-Z]((([a-zA-Z]*a[a-zA-Z]*[ ])*[a-zA-Z]*e[a-zA-Z]*[ ])*[a-zA-Z]*s[a-zA-Z]*[ ])*[.?!]`          \\ \hline
    1616\end{tabular}
     
    7474This expression demonstrates the overhead involved
    7575in matching the simplest possible regular expression, a single character.
    76 Date, Email, and URIOrEmail provide examples of commonly used regular expression.
     76Date, Email, and URI provide examples of commonly used regular expression.
    7777This set of expressions were modified from the \textit{Benchmark of Regex
    78 Libraries} (http://lh3lh3.users.sourceforge.net/reb.shtml).
    79 HexBytes matches delimited byte strings in hexadecimal notation, and
     78Libraries}.
     79Hex matches delimited byte strings in hexadecimal notation, and
    8080enforces the constraint that the number of hex digits is even. This
    8181expression illustrates the performance
    8282of a repetition operator implemented using
    83 a while loop.  StarHeight is an expression designed to further
    84 stress our while loop implementation with 4 levels of Kleene closure.
     83a while loop in our system.  StarHeight is an artificial expression designed to further
     84stress while loop implementation with 4 levels of Kleene closure.
    8585All tests were run on a version
    8686of a \textit{Linux 3Dfx howto}
     
    9696xtick=data,
    9797ylabel=Cycles per Byte,
    98 xticklabels={@,Date,Email,URIorEmail,HexBytes,StarHeight},
     98xticklabels={@,Date,Email,URI,Hex,StarHeight},
    9999tick label style={font=\tiny},
    100100enlarge x limits=0.15,
     
    156156lines matching the Email regex.
    157157
    158 The URIorEmail expression illustrates the performance of the
     158The URI expression illustrates the performance of the
    159159grep programs with additional regular expression complexity.
    160160As expressions get larger, the number of steps required by
    161161the Parabix implementation increases, so the performance
    162162advantage drops to about 4.5X over nrgrep and 19X over gre2p.
    163 32557 lines are matched by the URIorEmail regex.
    164 
    165 The results for HexBytes expression illustrate Parabix performance
    166 involving a Kleene-+ operator compiled to a while loop. 
    167 Our implementation uses just 1.6 cycles per byte to find the
     16332557 lines are matched by the URI regex.
     164
     165The results for Hex expression illustrate the bitstreams performance
     166in the case of a Kleene-+ operator compiled to a while loop.
     167Performance is nevertheless quite good;
     168our implementation uses just 1.6 cycles per byte to find the
    168169130,243 matching lines.    The gre2p program performs
    169170quite poorly here, slower than the Parabix implementation
    170171by about 70X, while nrgrep is about 5.5X slower.
    171172
    172 StarHeight is an artificial expression created to stress the Parabix
    173 implementation with a triply-nested while loop.   The performance
    174 does drop off, maintaining only a 2X advantage over nrgrep.
    175 
    176 
    177 
    178 
    179 \begin{figure}
    180 \begin{center}
    181 \begin{tikzpicture}
    182 \begin{axis}[
    183 xtick=data,
    184 ylabel=Instructions per Byte,
    185 xticklabels={@,Date,Email,URIorEmail,HexBytes,StarHeight},
    186 tick label style={font=\tiny},
    187 enlarge x limits=0.15,
    188 %enlarge y limits={0.15, upper},
    189 ymin=0,
    190 legend style={at={(0.5,-0.15)},
    191 anchor=north,legend columns=-1},
    192 ymax=110,
    193 ybar,
    194 bar width=7pt,
    195 ]
    196 \addplot
    197 file {data/instructions-bitstreams.dat};
    198 \addplot
    199 file {data/instructions-nrgrep112.dat};
    200 \addplot
    201 file {data/instructions-gre2p.dat};
    202  
    203 \legend{bitstreams,nrgrep,gre2p,Annot}
    204 \end{axis}
    205 \end{tikzpicture}
    206 \end{center}
    207 \caption{Instructions per Byte}\label{fig:SSEInstructionsPerByte}
    208 \end{figure}
    209 
    210 Figure~\ref{fig:SSEInstructionsPerByte} illustrates
    211 relative performance reported in instructions per byte.
    212 Figure~\ref{fig:SSEInstructionsPerByte}
    213 mirrors Figure~\ref{fig:SSECyclesPerByte}. The bitstreams
    214 implementation demonstrates stable performance
    215 characteristics across each of the test expressions. The
    216 gre2p implementation also shows
    217 relatively stable characteristics, whereas,
    218 nrgrep exhibits greater variability.
    219  
     173A more complex triply-nested repetition structure is required by
     174the bitstreams implementation of the StarHeight expression. 
     175In this case, there is a noticeable drop off in the
     176performance advantage of the bitstreams implementation over
     177the nrgrep and gre2p.   Nevertheless a 2X
     178advantage over nrgrep is maintained.
     179
    220180\begin{figure}
    221181\begin{center}
     
    224184xtick=data,
    225185ylabel=Instructions per Cycle,
    226 xticklabels={@,Date,Email,URIorEmail,HexBytes,StarHeight},
     186xticklabels={@,Date,Email,URI,Hex,StarHeight},
    227187tick label style={font=\tiny},
    228188enlarge x limits=0.15,
     
    248208\end{figure}
    249209
    250 \begin{figure}
    251 \begin{center}
    252 \begin{tikzpicture}
    253 \begin{axis}[
    254 xtick=data,
    255 ylabel=Branch Misses per Instruction,
    256 xticklabels={@,Date,Email,URIorEmail,HexBytes,StarHeight},
    257 tick label style={font=\tiny},
    258 enlarge x limits=0.15,
    259 %enlarge y limits={0.15, upper},
    260 ymin=0,
    261 legend style={at={(0.5,-0.15)},
    262 anchor=north,legend columns=-1},
    263 ybar,
    264 bar width=7pt,
    265 ]
    266 \addplot
    267 file {data/branch-misses-bitstreams.dat};
    268 \addplot
    269 file {data/branch-misses-nrgrep112.dat};
    270 \addplot
    271 file {data/branch-misses-gre2p.dat};
    272 
    273 \legend{bitstreams,nrgrep,gre2p,Annot}
    274 \end{axis}
    275 \end{tikzpicture}
    276 \end{center}
    277 \caption{Branch Misses per Instruction}\label{fig:SSEBranchMisses}
    278 \end{figure}
    279 Figure \ref{fig:SSEBranchMisses} reports branch misses per kilobyte.
    280 The bitstreams and gre2p implementations remain consistent.
    281 Significant variation is seen with nrgrep.
     210Figure \ref{fig:SSEInstructionsPerCycle} shows the efficiency of
     211processor utilization
     212achieved by the three programs on each of the test expression in
     213terms of instructions per cycle (IPC).
     214For the first four expressions, in particular,
     215the bitstreams implementation uses the processor resources quite efficiently,
     216avoiding penalties due to cache misses and branch mispredictions.   
     217However, with the while loop
     218structures in processing the Hex and StarHeight expressions, branch
     219mispredictions increase considerably and there is a noticeable
     220drop-off in IPC.  Both the gre2p and nrgrep suffer from significant
     221penalties due to 
     222mispredictions in the character-skipping logic and cache misses in
     223table lookups.  The performance of nrgrep, in particular drops off
     224with the growth in regulare expression complexity.
    282225
    283226Overall, the bitstreams implementation significantly outperformed
    284227both nrgrep and gre2p. In addition, the performance of bitstreams
    285 scaled smoothly with regular expression complexity. As Section~\ref{sec:analysis}
    286 describes, we anticipate that the performance of bitstreams
    287 will to continue to scale smoothly with
    288 regular expression complexity.
    289 
    290 
     228scales well with regular expression complexity.
     229
     230
Note: See TracChangeset for help on using the changeset viewer.