# Changeset 4782

Ignore:
Timestamp:
Sep 21, 2015, 12:42:53 PM (4 years ago)
Message:

Figure placement according to Springer

Location:
docs/Working/icGrep
Files:
6 edited

Unmodified
Removed
• ## docs/Working/icGrep/Paper88.tex

 r4780 \title{Bitwise Data Parallelism with LLVM: The icGrep Case Study} \author{Robert D. Cameron \inst{1 (}\Envelope\inst{)} \and Nigel Medforth \inst{1} \and Dan Lin \inst{1} \and Dale Denis \inst{1} \and William N. Sumner \inst{1} \author{Robert D. Cameron\inst{1 (}\Envelope\inst{)} \and Nigel Medforth\inst{1} \and Dan Lin\inst{1} \and Dale Denis\inst{1} \and William N. Sumner\inst{1} } \institute{School of Computing Science, Simon Fraser University} \input{background.tex} %\input{paradigm.tex} \input{unicode-re.tex}
• ## docs/Working/icGrep/architecture.tex

 r4501 \section{Architecture}\label{sec:architecture} \paragraph{Regular Expression Preprocessing.} As shown in Fig.~\ref{fig:compiler}, compilation in \icGrep{} comprises three logical layers: \RegularExpression{}, \Pablo{} and the LLVM layer, each with their own intermediate representation (IR), transformation and compilation modules. As we traverse the layers, the IR becomes more complex as it begins to mirror the final machine code. The layering enables further optimization based on information available at each stage. The \REParser{} validates and transforms the input \RegularExpression{} into an abstract syntax tree (AST). Successive \RegularExpression{} Transformations exploit domain knowledge to optimize the regular expressions. The aforementioned \texttt{toUTF8} transformation also applies during this phase to generate code unit classes. \input{fig-compiler} As shown in Figure \ref{fig:compiler}, compilation in \icGrep{} comprises three logical layers: \RegularExpression{}, \Pablo{} and the LLVM layer, each with their own intermediate representation (IR), transformation and compilation modules. % As we traverse the layers, the IR becomes more complex as it begins to mirror the final machine code. % The layering enables further optimization based on information available at each stage. % The \REParser{} validates and transforms the input \RegularExpression{} into an abstract syntax tree (AST). % %The AST is a minimalistic representation that, unlike traditional \RegularExpression{}, is not converted into a NFA or DFA for further processing. % %Instead, \icGrep{} passes the AST into the transformation module, which includes a set of \RegularExpression{} specific optimization passes. % Successive \RegularExpression{} Transformations exploit domain knowledge to optimize the regular expressions. % %An initial \emph{Nullable} pass, determines whether the \RegularExpression{} %contains prefixes or suffixes that may be removed or %modified whilst matching the same lines of text as the original expression. % %For example, \verb|a*bc+|'' is equivalent to \verb|bc|'' because the Kleene Star (Plus) operator matches zero (one) or more instances of a %specific character. % The aforementioned \texttt{toUTF8} transformation also applies during this phase to generate code unit classes. %The \emph{toUTF8} pass converts the Unicode character classes in the input \RegularExpression{} into equivalent expression(s) that represent sequences %of 8-bit code units necessary to identify occurrences of the class. % %Since some characters have multiple logically equivalent representations, such as \textcolor{red}{\textbf{????}}, this may produce nested sequences or alternations. % %This is described in more detail in \S\ref{sec:Unicode:toUTF8}. % %A final \emph{Simplification} pass flattens nested structures into their simplest legal form. % %For example, \verba(b((c|d)|e))'' becomes \verbab(c|d|e)'' and \verb([0-9]{3,5}){3,5}'' becomes \verb[0-9]{9,25}''. % %% DISCUSS ANALYSIS MODULE? The next layer transforms this AST into the instructions in the \Pablo{} IR. % %has two compilers: the \CodeUnitCompiler{} and \RegularExpressionCompiler{}, both of which produce \Pablo{} IR. % Recall that the \Pablo{} layer assumes a transposed view of the input data. % The \emph{\RegularExpressionCompiler{}} first transforms all input code unit classes, analogous to non-Unicode character classes, These optimizations exploit redundancies that are harder to recognize in the \RegularExpression{} AST itself. % %The \emph{\CodeUnitCompiler{}} transforms the input code unit classes, %either extracted from the \RegularExpression{} or produced by the \emph{toUTF8} transformation, %into a series of bit stream equations. % %The \emph{\RegularExpressionCompiler{}} %assumes that these have been calculated and %transforms the \RegularExpression{} AST into %a sequence of instructions. %\Pablo{} instructions that use the results of these equations. % %For instance, it converts alternations into a sequence of calculations that are merged with \verb|OR|s. % %The results of these passes are combined and transformed through a series of typical optimization passes, including dead code elimination %(DCE), common subexpression elimination (CSE), and constant folding. % %These optimizations exploit redundancies that are harder to recognize in the \RegularExpression{} AST itself. %These are necessary at this stage because the \RegularExpression{} AST may include common subsequences that are costly to recognize in %that form. % %Similarly, to keep the \CodeUnitCompiler{} a linear time function, it may introduce redundant IR instructions as it applies traditional Boolean %algebra transformations, such as de Morgan's law, to the computed streams. % %An intended side-effect of these passes is that they eliminate the need to analyze the data-dependencies inherent in the carry-bit logic, %which is necessary for some \Pablo{} instructions but problematic for optimizers to reason about non-conservatively. % The \PabloCompiler{} then directly converts the \Pablo{} IR into LLVM IR. % %This is a relatively straightforward conversion: % %the only complexities it introduces is the generation of Phi nodes, linking of statically-compiled functions, and assignment of carry variables. % The LLVM Compiler framework provides flexible APIs for compilation and linking. Using these, \icGrep{} dynamically generates a match function for identifying \paragraph{Dynamic Grep Engine.} \input{fig-executor} Figure~\ref{fig:execution} shows the structure of the \icGrep{} matching engine. The input data is transposed into 8 parallel bit streams through the Transposition module. The Dynamic Matcher returns one bitstream that marks all the positions that fully match the compiled regular expression. Finally, a Match Scanner scans through the returned bitstream to select the matching lines and generate the normal grep output. \input{fig-executor} We can also apply a pipeline parallelism strategy to further speed up the process of \icGrep{}.
• ## docs/Working/icGrep/background.tex

 r4781 \section{Background} \label{sec:background} %\subsection{Unicode Regular Expression Requirements} \paragraph{Unicode Regular Expressions.} Traditional regular expression syntax is oriented towards denotes the class of all non-ASCII lower case letters. %\subsection{Bitwise Data Parallel Regular Expression Matching} \paragraph{Bitwise Data Parallel Matching.} Regular expression search using bitwise data parallelism has been equation is applied until no new match positions result. For example, Figure \ref{fig:bitwisematch} shows how the regular expression {\tt d[a-z]*ed} is matched against some input text using bitwise methods.  In this diagram we use periods to denote 0 bits so that the 1 bits stand out. In the first step the character class stream {\tt [d]} is matched and the results shifted one position (Advance) to produce marker bitstream $M_1$. Five matches indicated by marker bits are now in play simultaneously. The next step applies the  MatchStar operation to find all the matches that may then be reached with the Kleene-* repetition {\tt [a-z]*} ($M_2$).   This produces pending matches at many positions.  However, there is no need to consider these matches one at a time using lazy or greedy matching strategies. Rather, the full marker stream $M_3$ of remaining possibilities after matching {\tt [e]} is easily computed using bitwise logic and shift. The final step produces marker stream $M_4$ indicating the single position at which the entire regular expression is matched. \begin{figure} \vspace{-0.5em} \end{figure} For example, Figure \ref{fig:bitwisematch} shows how the regular expression {\tt d[a-z]*ed} is matched against some input text using bitwise methods.  In this diagram we use periods to denote 0 bits so that the 1 bits stand out. In the first step the character class stream {\tt [d]} is matched and the results shifted one position (Advance) to produce marker bitstream $M_1$. Five matches indicated by marker bits are now in play simultaneously. The next step applies the  MatchStar operation to find all the matches that may then be reached with the Kleene-* repetition {\tt [a-z]*} ($M_2$).   This produces pending matches at many positions.  However, there is no need to consider these matches one at a time using lazy or greedy matching strategies. Rather, the full marker stream $M_3$ of remaining possibilities after matching {\tt [e]} is easily computed using bitwise logic and shift. The final step produces marker stream $M_4$ indicating the single position at which the entire regular expression is matched. The Parabix toolchain \cite{lin2012parabix} provides a set of compilers and run-time libraries that target the SIMD instructions of commodity acceleration for traditional ASCII regular expression matching tasks, often 5$\times$ or better \cite{cameron2014bitwise}. % \comment{ % % \subsection{LLVM} % % The LLVM compiler infrastructure is a set of modular compiler components and tools % organized around a powerful generic intermediate representation (LLVM IR) that is % agnostic with respect to source language and code-generation targets. % Beginning as an MSc research project at the University of Illinois \cite{lattner2004llvm}, % LLVM is now an open-source codebase supported by a broad community of % researchers, developers and commercial organizations. % % LLVM features a flexible multi-stage compilation structure that can be organized in % passes in many ways, including fully dynamic just-in-time compilation.   Of particular % importance to the \icGrep{} project, the LLVM IR supports arbitrarily-sized vectors of % arbitrary-width integers, well suited to code generation % targetting the SIMD integer instructions of commodity processors. % % % % \subsection{Parabix Regular Expression Matching} % }

 r4781 In this section, we report on the evaluation of \icGrep{} performance, looking at three aspects. % First, we discuss some performance aspects of \icGrep{} internal methods, looking at the impact of optimizations discussed previously. % Then we move on to a systematic performance study of \icGrep{} with named Unicode property searches in comparison to two contemporary competitors, namely, pcre2grep released in January 2015 and ugrep of the ICU 54.1 software distribution. % Finally, we examine complex expressions and the impact of multithreading \icGrep{} on an Intel i7-2600 (3.4GHz) and i7-4700MQ (2.4GHz) processor. with many long tags. %The {\tt -disable-log2-bounded-repetition} flag allows these %effectiveness of the special techniques for bounded repetition of %byte classes to be assessed.   A slowdown of 30\% was observed %with the searches using the regular expression %\verb:(^|[ ])[a-zA-Z]{11,33}([.!? ]|$):, for example. In order to short-circuit processing when no remaining matches are possible in a block, our regular expression compiler periodically inserts generated code, the number of pattern elements between each if-test %non-nullable can be selected with the {\tt -if-insertion-gap=} option. % The default value in \icGrep{} is 3; setting the gap to 100 effectively turns off if-insertion. % Eliminating if-insertion sometimes improves performance by avoiding the extra if tests and branch mispredictions. % For patterns with long strings, however, there can be a substantial slowdown. %; searching for a pattern of length 40 slows down by more %than 50\% without the if-statement short-circuiting. %%% I think we'd need to show this always true to make this claim. % \comment{ % Additionally, \icGrep{} provides options that allow % various internal representations to be printed out. % % % These can aid in understanding and/or debugging performance issues. % For example, the option % {\tt -print-REs} shows the parsed regular expression as it goes % through various transformations. The internal \Pablo{} code generated % may be displayed with {\tt -print-pablo}. This can be quite useful in % helping understand the match process. It also possible to print out the % generated LLVM IR code ({\tt -dump-generated-IR}), but this may be % less useful as it includes many % details of low-level carry-handling that obscures the core logic. % } The precompiled calculations of the various Unicode properties expressions were removed because they involved properties not supported by pcre2grep. In the end 246 test expressions were constructed in this process. We selected a set of Wikimedia XML files in several major languages representing most of the world's major language families as a test corpus. For each program under test, we performed searches for each regular expression against each XML document. Performance is reported in CPU cycles per byte on an Intel i7-2600 machine. The results are presented in Fig.~\ref{fig:property_test}. They were ranked by the percentage of matching lines found in the XML document and grouped in 5\% increments. When comparing the three programs, \icGrep{} exhibits dramatically better performance, particularly when searching for rare items. The performance of both pcre2grep and ugrep improves (has a reduction in CPU cycles per byte) as the percentage of matching lines increases. This occurs because each match allows them to bypass processing the rest of the line. On the other hand, \icGrep{} shows a slight drop-off in performance with the number of matches found. This is primarily due to property classes that include large numbers of codepoints. These classes require more bitstream equations for calculation and also have a greater probability of matching. Nevertheless, the performance of \icGrep{} in matching the defined property expressions is stable and well ahead of the competitors in all cases. \begin{figure} \end{figure} We selected a set of Wikimedia XML files in several major languages representing most of the world's major language families as a test corpus. % For each program under test, we performed searches for each regular expression against each XML document. % Performance is reported in CPU cycles per byte on an Intel i7-2600 machine. % The results are presented in Fig.~\ref{fig:property_test}. % They were ranked by the percentage of matching lines found in the XML document and grouped in 5\% increments. % When comparing the three programs, \icGrep{} exhibits dramatically better performance, particularly when searching for rare items. % The performance of both pcre2grep and ugrep improves (has a reduction in CPU cycles per byte) as the percentage of matching lines increases. % This occurs because each match allows them to bypass processing the rest of the line. % On the other hand, \icGrep{} shows a slight drop-off in performance with the number of matches found. % This is primarily due to property classes that include large numbers of codepoints. % These classes require more bitstream equations for calculation and also have a greater probability of matching. % Nevertheless, the performance of \icGrep{} in matching the defined property expressions is stable and well ahead of the competitors in all cases. \subsection{Complex Expressions} This study evaluates the comparative performance of the matching engines on a series of more complex expressions, shown in Table \ref{table:regularexpr}. The first two are alphanumeric (\AN{}) expressions, differing only in that the first one is anchored to match the entire line. The third searches for lines consisting of text in Arabic script. The fourth expression is a published currency expression taken from Stewart and Uckelman~\cite{stewart2013unicode}. An expression matching runs of six or more Cyrillic script characters enclosed in initial/opening and final/ending punctuation is fifth in the list. The final expression is an email expression that allows internationalized names. \begin{table}\centering % requires booktabs \end{table} This study evaluates the comparative performance of the matching engines on a series of more complex expressions, shown in Table \ref{table:regularexpr}. % The first two are alphanumeric (\AN{}) expressions, differing only in that the first one is anchored to match the entire line. % The third searches for lines consisting of text in Arabic script. % The fourth expression is a published currency expression taken from Stewart and Uckelman~\cite{stewart2013unicode}. % An expression matching runs of six or more Cyrillic script characters enclosed in initial/opening and final/ending punctuation is fifth in the list. % The final expression is an email expression that allows internationalized names. % In Table \ref{table:complexexpr}, we show the performance results obtained from an Intel i7-2600 using generic 64-bit binaries for each engine. We limit the SIMD ISA within \icGrep{} to SSE2 which is available on all Intel/AMD 64-bit machines. % In each case, we report seconds taken per GB of input averaged over 10 runs each on our Wikimedia document collection. \end{table} The most striking aspect of Table \ref{table:complexexpr} is that both ugrep and pcregrep show dramatic slowdowns with ambiguities in regular expressions. % This is most clearly illustrated in the different performance figures for the two Alphanumeric test expressions but is also evident in the Arabic, Currency and Email expressions. % Contrastingly, \icGrep{} maintains consistently fast performance in all test scenarios. % The multithreaded \icGrep{} shows speedup in every case but balancing of the workload across multiple cores is clearly an area for further work. % Nevertheless, our three-thread system shows up to a 40\% speedup. % over the single threaded version % Table \ref{table:relperf} shows the speedups obtained with \icGrep{} on a newer Intel i7-4700MQ machine, considering three SIMD ISA alternatives All speedups are relative to the base single-threaded SSE2 performance on this machine, which is given in seconds per GB in the first column. % The SSE2 results are again using the generic binaries compiled for compatibility with all 64-bit processors. % The AVX1 results are for Intel AVX instructions in 128-bit mode. The main advantage of AVX1 over SSE2 is its support for 3-operand form, but some mixed results due to the limitations of 256 bit addition. Combining the AVX2 ISA with multithreading gives and average overall 61\% speedup compared to base. % \subsection{Single vs. Multithreaded Performance} % % % \begin{figure} % \begin{center} % \pgfplotstableread[col sep = comma]{data/icgrep-scatter-mt.csv}\base % \pgfplotstableread[col sep = comma]{data/icgrep-mt-scatter-mt.csv}\mt % \pgfplotstableread[col sep = comma]{data/icgrep-mt3-scatter-mt.csv}\mtt % \pgfplotstableread[col sep = comma]{data/icgrep-flat-scatter-mt.csv}\flat % \begin{tikzpicture} % \begin{axis}[ % grid=both, % x tick label style={ /pgf/number format/1000 sep=}, % ylabel={Seconds Per GB ($1000^3)}, % xlabel={Percentage of Matching Lines}, % minor y tick num={1}, % ymin=0,ymax=3, % xmax=100, % height=0.5\textwidth, % legend style={at={(1.05,0.5)}, % anchor=west,legend columns=1, % align=left,draw=none,column sep=2ex} % ] % \addplot+[sharp plot, no markers,line width=2pt,color=blue!60,solid] table {\base}; % \addplot+[sharp plot, no markers,line width=2pt,color=red!60,solid] table {\mt}; % \addplot+[sharp plot, no markers,line width=2pt,color=brown!60,solid] table {\mtt}; % %\addplot+[no markers,line width=2pt,color=green!60,solid] table {\flat}; % \legend{icGrep (Base),icGrep (MT2),icGrep (MT3), icGrep (Flat)} % \end{axis} % % % \end{tikzpicture} % \end{center} % \caption{Multithreading performance}\label{fig:performance_test} % \end{figure} • ## docs/Working/icGrep/introduction.tex  r4605 \section{Introduction} Although well-established technical standards exist for The remainder of this paper is organized as follows. Section \ref{sec:background} presents background material dealing with Unicode regular expressions, %LLVM, the Parabix framework and regular expression matching techniques using bitwise data parallelism. %Section \ref{sec:seqpar} expands on previous work on bitwise data %parallelism by more fully characterizing the paradigm and documenting %important techniques. Section \ref{sec:Unicode} addresses the issues and performance challenges associated with meeting Unicode • ## docs/Working/icGrep/unicode-re.tex  r4781 over UTF-8 data streams is to translate Unicode regular expressions over codepoints to corresponding regular expressions over sequences of UTF-8 bytes or \emph{code units}. The {\tt toUTF8} transformation performs this as a regular expression transformation, transforming sequences of UTF-8 bytes or \emph{code units}. The {\tt toUTF8} function does the work, transforming input expressions such as \verb:\u{244}[\u{2030}-\u{2137}]:' to the corresponding UTF-8 regular expression consisting of the series of sequences and alternations shown below: to the corresponding UTF-8 representation: \newline { stream to `fill in the gaps'' in the UnicodeClass(U$) bitstream so that the MatchStar addition can move through a contiguous sequence of one bits. In this way, matching of an arbitrary Unicode character class$U$bits. In this way, matching of an arbitrary Unicode character class$U$can be implemented using${\mbox{MatchStar}(m, \mbox{UnicodeClass}(U) \vee \mbox{NonFinal})}\$. contain long runs of text confined to one or a few ranges. %\subsection{Character Class Intersection and Difference} %\subsection{Unicode Case-Insensitive Matching}
Note: See TracChangeset for help on using the changeset viewer.