Changeset 4782 for docs/Working


Ignore:
Timestamp:
Sep 21, 2015, 12:42:53 PM (4 years ago)
Author:
cameron
Message:

Figure placement according to Springer

Location:
docs/Working/icGrep
Files:
6 edited

Legend:

Unmodified
Added
Removed
  • docs/Working/icGrep/Paper88.tex

    r4780 r4782  
    3030
    3131\title{Bitwise Data Parallelism with LLVM: The icGrep Case Study}
    32 \author{Robert D. Cameron \inst{1 (}\Envelope\inst{)}
    33 \and Nigel Medforth \inst{1}
    34 \and Dan Lin \inst{1}
    35 \and Dale Denis \inst{1}
    36 \and William N. Sumner \inst{1}
     32\author{Robert D. Cameron\inst{1 (}\Envelope\inst{)}
     33\and Nigel Medforth\inst{1}
     34\and Dan Lin\inst{1}
     35\and Dale Denis\inst{1}
     36\and William N. Sumner\inst{1}
    3737}
    3838\institute{School of Computing Science, Simon Fraser University}
     
    5959\input{background.tex}
    6060
    61 %\input{paradigm.tex}
    62 
    6361\input{unicode-re.tex}
    6462
  • docs/Working/icGrep/architecture.tex

    r4501 r4782  
    11\section{Architecture}\label{sec:architecture}
     2
    23\paragraph{Regular Expression Preprocessing.}
     4As shown in Fig.~\ref{fig:compiler},
     5compilation in \icGrep{} comprises three logical layers: \RegularExpression{},
     6\Pablo{} and the LLVM layer, each with their own intermediate representation
     7(IR), transformation and compilation modules.
     8As we traverse the layers, the IR becomes more complex as it begins to mirror the final machine code.
     9The layering enables further optimization based on information available at each stage.
     10The \REParser{} validates and transforms the input \RegularExpression{} into an abstract syntax tree (AST).
     11Successive \RegularExpression{} Transformations exploit domain
     12knowledge to optimize the regular expressions.
     13The aforementioned \texttt{toUTF8} transformation also applies during this phase to generate code unit classes.
    314
    415\input{fig-compiler}
    516
    6 As shown in Figure \ref{fig:compiler},
    7 compilation in \icGrep{} comprises three logical layers: \RegularExpression{},
    8 \Pablo{} and the LLVM layer, each with their own intermediate representation
    9 (IR), transformation and compilation modules.
    10 %
    11 As we traverse the layers, the IR becomes more complex as it begins to mirror the final machine code.
    12 %
    13 The layering enables further optimization based on information available at each stage.
    14 %
    15 The \REParser{} validates and transforms the input \RegularExpression{} into an abstract syntax tree (AST).
    16 %
    17 %The AST is a minimalistic representation that, unlike traditional \RegularExpression{}, is not converted into a NFA or DFA for further processing.
    18 %
    19 %Instead, \icGrep{} passes the AST into the transformation module, which includes a set of \RegularExpression{} specific optimization passes.
    20 %
    21 Successive \RegularExpression{} Transformations exploit domain
    22 knowledge to optimize the regular expressions.
    23 %
    24 %An initial \emph{Nullable} pass, determines whether the \RegularExpression{}
    25 %contains prefixes or suffixes that may be removed or
    26 %modified whilst matching the same lines of text as the original expression.
    27 %
    28 %For example, ``\verb|a*bc+|'' is equivalent to ``\verb|bc|'' because the Kleene Star (Plus) operator matches zero (one) or more instances of a
    29 %specific character.
    30 %
    31 The aforementioned \texttt{toUTF8} transformation also applies during this phase to generate code unit classes.
    32 %The \emph{toUTF8} pass converts the Unicode character classes in the input \RegularExpression{} into equivalent expression(s) that represent sequences
    33 %of 8-bit code units necessary to identify occurrences of the class.
    34 %
    35 %Since some characters have multiple logically equivalent representations, such as \textcolor{red}{\textbf{????}}, this may produce nested sequences or alternations.
    36 %
    37 %This is described in more detail in \S\ref{sec:Unicode:toUTF8}.
    38 %
    39 %A final \emph{Simplification} pass flattens nested structures into their simplest legal form.
    40 %
    41 %For example, ``\verb`a(b((c|d)|e))`'' becomes ``\verb`ab(c|d|e)`'' and ``\verb`([0-9]{3,5}){3,5}`'' becomes ``\verb`[0-9]{9,25}`''.
    42 %
    43 
    44 %% DISCUSS ANALYSIS MODULE?
    45 
    4617
    4718The next layer transforms this AST into the instructions in the \Pablo{} IR.
    48 %
    49 %has two compilers: the \CodeUnitCompiler{} and \RegularExpressionCompiler{}, both of which produce \Pablo{} IR.
    50 %
    5119Recall that the \Pablo{} layer assumes a transposed view of the input data.
    52 %
    5320The \emph{\RegularExpressionCompiler{}} first transforms all input code unit classes,
    5421analogous to non-Unicode character classes,
     
    6330These optimizations exploit redundancies that are harder to recognize in
    6431the \RegularExpression{} AST itself.
    65 %
    66 %The \emph{\CodeUnitCompiler{}} transforms the input code unit classes,
    67 %either extracted from the \RegularExpression{} or produced by the \emph{toUTF8} transformation,
    68 %into a series of bit stream equations.
    69 %
    70 %The \emph{\RegularExpressionCompiler{}}
    71 %assumes that these have been calculated and
    72 %transforms the \RegularExpression{} AST into
    73 %a sequence of instructions.
    74 %\Pablo{} instructions that use the results of these equations.
    75 %
    76 %For instance, it converts alternations into a sequence of calculations that are merged with \verb|OR|s.
    77 %
    78 %The results of these passes are combined and transformed through a series of typical optimization passes, including dead code elimination
    79 %(DCE), common subexpression elimination (CSE), and constant folding.
    80 %
    81 %These optimizations exploit redundancies that are harder to recognize in the \RegularExpression{} AST itself.
    82 %These are necessary at this stage because the \RegularExpression{} AST may include common subsequences that are costly to recognize in
    83 %that form.
    84 %
    85 %Similarly, to keep the \CodeUnitCompiler{} a linear time function, it may introduce redundant IR instructions as it applies traditional Boolean
    86 %algebra transformations, such as de Morgan's law, to the computed streams.
    87 %
    88 %An intended side-effect of these passes is that they eliminate the need to analyze the data-dependencies inherent in the carry-bit logic,
    89 %which is necessary for some \Pablo{} instructions but problematic for optimizers to reason about non-conservatively.
    90 %
    9132
    9233The \PabloCompiler{} then directly converts the \Pablo{} IR into LLVM IR.
    93 %
    94 %This is a relatively straightforward conversion:
    95 %
    96 %the only complexities it introduces is the generation of Phi nodes, linking of statically-compiled functions, and assignment of carry variables.
    97 %
    9834The LLVM Compiler framework provides flexible APIs for compilation and linking.
    9935Using these, \icGrep{} dynamically generates a match function for identifying
     
    10137
    10238\paragraph{Dynamic Grep Engine.}
    103 \input{fig-executor}
    10439Figure~\ref{fig:execution} shows the structure of the \icGrep{} matching engine.
    10540The input data is transposed into 8 parallel bit streams through the Transposition module.
     
    11146The Dynamic Matcher returns one bitstream that marks all the positions that fully match the compiled regular expression.
    11247Finally, a Match Scanner scans through the returned bitstream to select the matching lines and generate the normal grep output.
     48
     49\input{fig-executor}
    11350
    11451We can also apply a pipeline parallelism strategy to further speed up the process of \icGrep{}.
  • docs/Working/icGrep/background.tex

    r4781 r4782  
    11\section{Background} \label{sec:background}
    22
    3 %\subsection{Unicode Regular Expression Requirements}
    43\paragraph{Unicode Regular Expressions.}
    54Traditional regular expression syntax is oriented towards
     
    4140denotes the class of all non-ASCII lower case letters.
    4241
    43 %\subsection{Bitwise Data Parallel Regular Expression Matching}
    4442\paragraph{Bitwise Data Parallel Matching.}
    4543Regular expression search using bitwise data parallelism has been
     
    8987equation is applied until no new match positions result.
    9088
     89For example,
     90Figure \ref{fig:bitwisematch} shows how the regular expression {\tt d[a-z]*ed} is matched
     91against some input text using bitwise methods.  In this diagram we use periods to denote
     920 bits so that the 1 bits stand out.
     93In the first step the character class stream
     94{\tt [d]} is matched and the results shifted one position (Advance) to produce marker bitstream $M_1$.
     95Five matches indicated by marker bits are now in play simultaneously. 
     96The next step applies the  MatchStar operation to find all the matches that may then be
     97reached with the Kleene-* repetition
     98{\tt [a-z]*} ($M_2$).   This produces pending matches at many positions.  However, there
     99is no need to consider these matches one at a time using lazy or greedy matching strategies.
     100Rather, the full marker stream $M_3$ of remaining possibilities after matching {\tt [e]} is easily
     101computed using bitwise logic and shift.
     102The final step produces marker stream $M_4$ indicating the single position
     103at which the entire regular expression is matched.
     104
    91105\begin{figure}
    92106\vspace{-0.5em}
     
    105119\end{figure}
    106120
    107 For example,
    108 Figure \ref{fig:bitwisematch} shows how the regular expression {\tt d[a-z]*ed} is matched
    109 against some input text using bitwise methods.  In this diagram we use periods to denote
    110 0 bits so that the 1 bits stand out.
    111 In the first step the character class stream
    112 {\tt [d]} is matched and the results shifted one position (Advance) to produce marker bitstream $M_1$.
    113 Five matches indicated by marker bits are now in play simultaneously. 
    114 The next step applies the  MatchStar operation to find all the matches that may then be
    115 reached with the Kleene-* repetition
    116 {\tt [a-z]*} ($M_2$).   This produces pending matches at many positions.  However, there
    117 is no need to consider these matches one at a time using lazy or greedy matching strategies.
    118 Rather, the full marker stream $M_3$ of remaining possibilities after matching {\tt [e]} is easily
    119 computed using bitwise logic and shift.
    120 The final step produces marker stream $M_4$ indicating the single position
    121 at which the entire regular expression is matched.
    122 
    123121The Parabix toolchain \cite{lin2012parabix} provides a set of compilers
    124122and run-time libraries that target the SIMD instructions of commodity
     
    130128acceleration for traditional ASCII regular expression matching tasks,
    131129often 5$\times$ or better \cite{cameron2014bitwise}.
    132 
    133 
    134 % \comment{
    135 %
    136 % \subsection{LLVM}
    137 %
    138 % The LLVM compiler infrastructure is a set of modular compiler components and tools
    139 % organized around a powerful generic intermediate representation (LLVM IR) that is
    140 % agnostic with respect to source language and code-generation targets.   
    141 % Beginning as an MSc research project at the University of Illinois \cite{lattner2004llvm},
    142 % LLVM is now an open-source codebase supported by a broad community of
    143 % researchers, developers and commercial organizations.   
    144 %
    145 % LLVM features a flexible multi-stage compilation structure that can be organized in
    146 % passes in many ways, including fully dynamic just-in-time compilation.   Of particular
    147 % importance to the \icGrep{} project, the LLVM IR supports arbitrarily-sized vectors of
    148 % arbitrary-width integers, well suited to code generation
    149 % targetting the SIMD integer instructions of commodity processors.
    150 %
    151 %
    152 %
    153 % \subsection{Parabix Regular Expression Matching}
    154 % }
  • docs/Working/icGrep/evaluation.tex

    r4781 r4782  
    44
    55In this section, we report on the evaluation of \icGrep{} performance, looking at three aspects.   
    6 %
    76First, we discuss some performance aspects of \icGrep{} internal methods, looking at the impact of optimizations discussed previously.
    8 %
    97Then we move on to a systematic performance study of \icGrep{} with named Unicode property searches in comparison to two contemporary competitors,
    108namely, pcre2grep released in January 2015 and ugrep of the ICU 54.1 software distribution.
    11 %
    129Finally, we examine complex expressions and the impact of multithreading \icGrep{} on an
    1310Intel i7-2600 (3.4GHz) and i7-4700MQ (2.4GHz) processor.
     
    3431with many long tags. 
    3532
    36 %The {\tt -disable-log2-bounded-repetition} flag allows these
    37 %effectiveness of the special techniques for bounded repetition of
    38 %byte classes to be assessed.   A slowdown of 30\% was observed
    39 %with the searches using the regular expression
    40 %\verb:(^|[ ])[a-zA-Z]{11,33}([.!? ]|$):, for example.
    41 
    4233In order to short-circuit processing when no remaining matches
    4334are possible in a block, our regular expression compiler periodically inserts
     
    4637generated code, the number of pattern elements between each if-test %non-nullable
    4738can be selected with the {\tt -if-insertion-gap=} option.   
    48 %
    4939The default value in \icGrep{} is 3; setting the gap to 100 effectively
    5040turns off if-insertion. 
    51 %
    5241Eliminating if-insertion sometimes improves performance by avoiding the extra if tests and branch mispredictions.
    53 %
    5442For patterns with long strings, however, there can be a substantial slowdown.
    5543
    56 %; searching for a pattern of length 40 slows down by more
    57 %than 50\% without the if-statement short-circuiting. %%% I think we'd need to show this always true to make this claim.
    58 
    59 % \comment{
    60 % Additionally, \icGrep{} provides options that allow
    61 % various internal representations to be printed out.   
    62 % %
    63 % These can aid in understanding and/or debugging performance issues.
    64 % For example, the option
    65 % {\tt -print-REs} shows the parsed regular expression as it goes
    66 % through various transformations.   The internal \Pablo{} code generated
    67 % may be displayed with {\tt -print-pablo}.  This can be quite useful in
    68 % helping understand the match process.   It also possible to print out the
    69 % generated LLVM IR code ({\tt -dump-generated-IR}), but this may be
    70 % less useful as it includes many
    71 % details of low-level carry-handling that obscures the core logic.
    72 % }
    7344
    7445The precompiled calculations of the various Unicode properties
     
    10475expressions were removed because they involved properties not supported by pcre2grep.
    10576In the end 246 test expressions were constructed in this process.
     77
     78We selected a set of Wikimedia XML files in several major languages representing
     79most of the world's major language families as a test corpus.
     80For each program under test, we performed searches for each regular expression against each XML document.
     81Performance is reported in CPU cycles per byte on an Intel i7-2600 machine.   
     82The results are presented in Fig.~\ref{fig:property_test}.
     83They were ranked by the percentage of matching lines found in the XML document and grouped in 5\% increments. 
     84When comparing the three programs, \icGrep{} exhibits dramatically better performance, particularly when searching for rare items.
     85The performance of both pcre2grep and ugrep improves (has a reduction in CPU cycles per byte) as the percentage of matching lines increases.
     86This occurs because each match allows them to bypass processing the rest of the line.
     87On the other hand, \icGrep{} shows a slight drop-off in performance with the number of matches found.   
     88This is primarily due to property classes that include large numbers of codepoints.   
     89These classes require more bitstream equations for calculation and also have a greater probability of matching.   
     90Nevertheless, the performance of \icGrep{} in matching the defined property expressions is stable and well ahead of the competitors in all cases.
    10691
    10792\begin{figure}
     
    137122\end{figure}
    138123
    139 We selected a set of Wikimedia XML files in several major languages representing
    140 most of the world's major language families as a test corpus.
    141 %
    142 For each program under test, we performed searches for each regular expression against each XML document.
    143 %
    144 Performance is reported in CPU cycles per byte on an Intel i7-2600 machine.   
    145 %
    146 The results are presented in Fig.~\ref{fig:property_test}.
    147 %
    148 They were ranked by the percentage of matching lines found in the XML document and grouped in 5\% increments. 
    149 %
    150 When comparing the three programs, \icGrep{} exhibits dramatically better performance, particularly when searching for rare items.
    151 %
    152 The performance of both pcre2grep and ugrep improves (has a reduction in CPU cycles per byte) as the percentage of matching lines increases.
    153 %
    154 This occurs because each match allows them to bypass processing the rest of the line.
    155 %
    156 On the other hand, \icGrep{} shows a slight drop-off in performance with the number of matches found.   
    157 %
    158 This is primarily due to property classes that include large numbers of codepoints.   
    159 %
    160 These classes require more bitstream equations for calculation and also have a greater probability of matching.   
    161 %
    162 Nevertheless, the performance of \icGrep{} in matching the defined property expressions is stable and well ahead of the competitors in all cases.
    163 
    164 
    165124\subsection{Complex Expressions}
     125
     126This study evaluates the comparative performance of the matching engines on a
     127series of more complex expressions, shown in Table \ref{table:regularexpr}.
     128The first two are alphanumeric (\AN{}) expressions, differing only in that the first
     129one is anchored to match the entire line.
     130The third searches for lines consisting of text in Arabic script.
     131The fourth expression is a published currency expression taken from
     132Stewart and Uckelman~\cite{stewart2013unicode}.
     133An expression matching runs of six or more Cyrillic script characters enclosed
     134in initial/opening and final/ending punctuation is fifth in the list.
     135The final expression is an email expression that allows internationalized names.
    166136
    167137\begin{table}\centering % requires booktabs
     
    189159\end{table}
    190160
    191 This study evaluates the comparative performance of the matching engines on a
    192 series of more complex expressions, shown in Table \ref{table:regularexpr}.
    193 %
    194 The first two are alphanumeric (\AN{}) expressions, differing only in that the first
    195 one is anchored to match the entire line.
    196 %
    197 The third searches for lines consisting of text in Arabic script.
    198 %
    199 The fourth expression is a published currency expression taken from
    200 Stewart and Uckelman~\cite{stewart2013unicode}.
    201 %
    202 An expression matching runs of six or more Cyrillic script characters enclosed
    203 in initial/opening and final/ending punctuation is fifth in the list.
    204 %
    205 The final expression is an email expression that allows internationalized names.
    206 %
    207161In Table \ref{table:complexexpr}, we show the performance results obtained
    208162from an Intel i7-2600 using generic 64-bit binaries for each engine.
    209163We limit the SIMD ISA within \icGrep{} to SSE2 which is available
    210164on all Intel/AMD 64-bit machines.
    211 %
    212165In each case, we report seconds taken per GB of input averaged over 10
    213166runs each on our Wikimedia document collection.
     
    235188\end{table}
    236189
    237 
    238190The most striking aspect of Table \ref{table:complexexpr} is that both ugrep and pcregrep
    239191show dramatic slowdowns with ambiguities in regular expressions.
    240 %
    241192This is most clearly illustrated in the different performance figures
    242193for the two Alphanumeric test expressions but is also evident in the
    243194Arabic, Currency and Email expressions.   
    244 %
    245195Contrastingly, \icGrep{} maintains consistently fast performance in all test scenarios. 
    246 %
    247196The multithreaded \icGrep{} shows speedup in every case but balancing
    248197of the workload across multiple cores is clearly an area for further work.
    249 %
    250198Nevertheless, our three-thread system shows up to a 40\% speedup. %  over the single threaded version
    251199
    252 
    253 
    254 %
    255200Table \ref{table:relperf} shows the speedups obtained with \icGrep{}
    256201on a newer Intel i7-4700MQ machine, considering three SIMD ISA alternatives
     
    258203All speedups are relative to the base single-threaded SSE2 performance on this machine,
    259204which is given in seconds per GB in the first column.
    260 %
    261205The SSE2 results are again using the generic binaries compiled for compatibility
    262206with all 64-bit processors.   
    263 %
    264207The AVX1 results are for Intel AVX instructions
    265208in 128-bit mode.  The main advantage of AVX1 over SSE2 is its support for 3-operand form,
     
    294237but some mixed results due to the limitations of 256 bit addition.   Combining
    295238the AVX2 ISA with multithreading gives and average overall 61\% speedup compared to base.
    296 
    297 
    298 
    299 
    300 
    301 
    302 
    303 
    304 
    305 % \subsection{Single vs. Multithreaded Performance}
    306 %
    307 %
    308 % \begin{figure}
    309 % \begin{center}
    310 % \pgfplotstableread[col sep = comma]{data/icgrep-scatter-mt.csv}\base
    311 % \pgfplotstableread[col sep = comma]{data/icgrep-mt-scatter-mt.csv}\mt
    312 % \pgfplotstableread[col sep = comma]{data/icgrep-mt3-scatter-mt.csv}\mtt
    313 % \pgfplotstableread[col sep = comma]{data/icgrep-flat-scatter-mt.csv}\flat
    314 % \begin{tikzpicture}
    315 % \begin{axis}[
    316 % grid=both,
    317 % x tick label style={ /pgf/number format/1000 sep=},
    318 % ylabel={Seconds Per GB ($1000^3$)},
    319 % xlabel={Percentage of Matching Lines},
    320 % minor y tick num={1},
    321 % ymin=0,ymax=3,
    322 % xmax=100,
    323 % height=0.5\textwidth,
    324 % legend style={at={(1.05,0.5)},
    325 % anchor=west,legend columns=1,
    326 % align=left,draw=none,column sep=2ex}
    327 % ]
    328 % \addplot+[sharp plot, no markers,line width=2pt,color=blue!60,solid] table {\base};
    329 % \addplot+[sharp plot, no markers,line width=2pt,color=red!60,solid] table {\mt};
    330 % \addplot+[sharp plot, no markers,line width=2pt,color=brown!60,solid] table {\mtt};
    331 % %\addplot+[no markers,line width=2pt,color=green!60,solid] table {\flat};
    332 % \legend{icGrep (Base),icGrep (MT2),icGrep (MT3), icGrep (Flat)}
    333 % \end{axis}
    334 %
    335 %
    336 % \end{tikzpicture}
    337 % \end{center}
    338 % \caption{Multithreading performance}\label{fig:performance_test}
    339 % \end{figure}
  • docs/Working/icGrep/introduction.tex

    r4605 r4782  
    11\section{Introduction}
    2 
    32
    43Although well-established technical standards exist for
     
    8584The remainder of this paper is organized as follows.   Section \ref{sec:background}
    8685presents background material dealing with Unicode regular expressions,
    87 %LLVM,
    8886the Parabix framework and regular expression matching techniques
    8987using bitwise data parallelism.   
    90 %Section \ref{sec:seqpar} expands on previous work on bitwise data
    91 %parallelism by more fully characterizing the paradigm and documenting
    92 %important techniques.
    9388Section \ref{sec:Unicode} addresses
    9489the issues and performance challenges associated with meeting Unicode
  • docs/Working/icGrep/unicode-re.tex

    r4781 r4782  
    1111over UTF-8 data streams is to translate Unicode regular expressions
    1212over codepoints to corresponding regular expressions over
    13 sequences of UTF-8 bytes or \emph{code units}.   The {\tt toUTF8} transformation
    14 performs this as a regular expression transformation, transforming
     13sequences of UTF-8 bytes or \emph{code units}.   The {\tt toUTF8} function
     14does the work, transforming
    1515input expressions such as `\verb:\u{244}[\u{2030}-\u{2137}]:'
    16 to the corresponding UTF-8 regular expression consisting of the series of sequences and alternations shown below:
     16to the corresponding UTF-8 representation:
    1717\newline
    1818{
     
    121121stream  to ``fill in the gaps'' in the UnicodeClass($U$) bitstream so that the
    122122 MatchStar addition can move through a contiguous sequence of one
    123  bits.  In this way, matching of an arbitrary Unicode character class
    124  $U$
     123 bits.  In this way, matching of an arbitrary Unicode character class $U$
    125124can be implemented using ${\mbox{MatchStar}(m, \mbox{UnicodeClass}(U) \vee \mbox{NonFinal})}$.
    126125
     
    137136contain long runs of text confined to one or a few ranges.
    138137
    139 %\subsection{Character Class Intersection and Difference}
    140 %\subsection{Unicode Case-Insensitive Matching}
Note: See TracChangeset for help on using the changeset viewer.