Changeset 4466 for docs


Ignore:
Timestamp:
Feb 6, 2015, 9:37:37 AM (4 years ago)
Author:
cameron
Message:

Evaluation of optimization; diagram tweaks

Location:
docs/Working/icGrep
Files:
4 edited

Legend:

Unmodified
Added
Removed
  • docs/Working/icGrep/architecture.tex

    r4446 r4466  
    11\section{Architecture}\label{sec:architecture}
    2 
    3 \include{fig-compiler}
    4 
    52\subsection{Regular Expression Preprocessing}
    63
     4\input{fig-compiler}
     5
     6As show in Figure \ref{fig:compiler},
     7icGrep is composed of three logical layers: \RegularExpression{}, \Pablo{} and the LLVM layer, each with their own intermediate representation
     8(IR), transformation and compilation modules.
     9%
     10As we traverse the layers, the IR becomes significantly more complex as it begins to mirror the final machine code.
     11%
     12The \REParser{} validates and transforms the input \RegularExpression{} into an abstract syntax tree (AST).
     13%
     14The AST is a minimalistic representation that, unlike traditional \RegularExpression{}, is not converted into a NFA or DFA for further processing.
     15%
     16Instead, icGrep passes the AST into the transformation module, which includes a set of \RegularExpression{} specific optimization passes.
     17%
     18The initial \emph{Nullable} pass, determines whether the \RegularExpression{} contains any prefixes or suffixes that may be removed or
     19modified whilst still providing the same number of matches as the original expression.
     20%
     21For example, ``\verb|a*bc+|'' is equivalent to ``\verb|bc|'' because the Kleene Star (Plus) operator matches zero (one) or more instances of a
     22specific character.
     23%
     24The \emph{toUTF8} pass converts the characters in the input \RegularExpression{} into the equivalent expression(s) that represent the sequences
     25of 8-bit code units necessary to identify the presence of a particular character.
     26%
     27Since some characters have multiple logically equivalent representations, such as \textcolor{red}{\textbf{????}}, this may produce nested sequences or alternations.
     28%
     29This is described in more detail in \S\ref{sec:Unicode:toUTF8}.
     30%
     31To alleviate this, the final \emph{Simplification} pass flattens nested sequences and alternations into their simplest legal form.
     32%
     33For example, ``\verb`a(b((c|d)|e))`'' would become ``\verb`ab(c|d|e)`'' and ``\verb`([0-9]{3,5}){3,5}`'', ``\verb`[0-9]{9,25}`''.
     34%
     35
     36%% DISCUSS ANALYSIS MODULE?
     37
     38
     39The \RegularExpression{} layer has two compilers: the \CodeUnit{} and \RegularExpressionCompiler{}, both of which produce \Pablo{} IR.
     40%
     41Recall that the \Pablo{} layer assumes a transposed view of the input data.
     42%
     43The \emph{\CodeUnitCompiler{}} transforms the input code unit classes, either extracted from the \RegularExpression{} or produced by the
     44\emph{toUTF8} transformation, into a series of bit stream equations.
     45%
     46The \emph{\RegularExpressionCompiler{}} assumes that these have been calculated and transforms the \RegularExpression{} AST into
     47a sequence of instructions.
     48%
     49For instance, it would convert any alternations into a sequence of calculations that are merged with \verb|OR|s.
     50%
     51The results of these passes are combined and transformed through a series of typical optimization passes, including dead code elimination
     52(DCE), common subexpression elimination (CSE), and constant folding.
     53%
     54These are necessary at this stage because the \RegularExpression{} AST may include common subsequences that are costly to recognize in
     55that form.
     56%
     57Similarly, to keep the \CodeUnitCompiler{} a linear time function, it may introduce redundant IR instructions as it applies traditional Boolean
     58algebra transformations, such as de Morgan's law, to the computed streams.
     59%
     60An intended side-effect of these passes is that they eliminate the need to analyze the data-dependencies inherent in the carry-bit logic,
     61which is necessary for some \Pablo{} instructions but problematic for optimizers to reason about non-conservatively.
     62%
     63The \PabloCompiler{} then converts the \Pablo{} IR into LLVM IR.
     64%
     65This is a relatively straightforward conversion:
     66%
     67the only complexities it introduces is the generation of Phi nodes, linking of statically-compiled functions, and assignment of carry variables.
     68%
     69It produces the dynamically-generated match function used by the icGrep.
     70
    771\subsection{Dynamic Grep Engine}
     72\input{fig-executor}
     73
     74
     75As shown in Figure \ref{fig:execution}, icGrep takes the input data and transposed it into 8 parallel bit streams through S2P module.
     76The required streams, e.g. line break stream, can then be generated using the 8 basis bits streams.
     77The JIT function retrieves the 8 basis bits and the required streams from their memory addresses and starts the matching process.
     78Named Property Library that includes all the predefined Unicode categories is installed into JIT function and can be called during the matching process.
     79JIT function returns one bitstream that marks all the matching positions.
     80A match scanner will scan through this bitstream and calculate the total counts or write the context of each match position.
     81
     82We can also apply a pipeline parallelism strategy to further speed up the process of icGrep.
     83S2P and Required Streams Generator can be process in a separate thread and start even before the dynamic compilation starts.
     84The output of S2P and Required Streams Generator, that is the 8 basis bits streams and the required streams,
     85needs to be stored in a shared memory space so that the JIT function can read from it.
     86To be more efficient of memory space usage, we only allocate limit amount of space for the shared data.
     87When each chunk of the shared space is filled up with the bitstream data,
     88the thread will start writing to the first chunk if it is released by JIT function.
     89Otherwise, it will wait for JIT function until it finishes processing that chunk.
     90Therefore, the performance is depended on the slowest thread.
     91In the case that the cost of transposition and required stream generation is more than the matching process,
     92we can further divide up the work and assign two threads for S2P and Required Streams Generator.
     93
     94
  • docs/Working/icGrep/evaluation.tex

    r4465 r4466  
    1717of command-line options.   This makes it relatively straightforward
    1818to report on certain performance aspects of ICgrep, while others require
    19 special builds.
    20 
    21 
     19special builds. 
    2220
    2321For example, the command-line switch {\tt -disable-matchstar} can be used
     
    3230file.   But when search for XML tags using the regular expression
    3331\verb:<[^!?][^>]*>:, a slowdown of more than 2X may be found in files
    34 with many long tags.
     32with many long tags. 
    3533
     34The {\tt -disable-log2-bounded-repetition} flag allows these
     35effectiveness of the special techniques for bounded repetition of
     36byte classes to be assessed.   A slowdown of 30\% was observed
     37with the searches using the regular expression
     38\verb:(^|[ ])[a-zA-Z]{11,33}([.!? ]|$):, for example.
    3639
     40To assess the effectiveness of inserting if-statements, the
     41number of non-nullable pattern elements between the if-tests
     42can be set with the {\tt -if-insertion-gap=} option.   The
     43default value in icGrep is 3, setting the gap to 100 effectively
     44turns of if-insertion.   Eliminating if-insertion sometimes improves
     45performance by avoiding the extra if tests and branch mispredications.
     46For patterns with long strings, however, there can be a substantial
     47slowdown; searching for a pattern of length 40 slows down by more
     48than 50\% without the if-statement short-circuiting.
    3749
    38 In order to better understand the search process, icGrep allows
    39 various internal representations to be printed out.   For example, the option
     50ICgrep also provides options that allow
     51various internal representations to be printed out.   These
     52can aid in understanding and/or debugging performance issues.
     53For example, the option
    4054{\tt -print-REs} show the parsed regular expression as it goes
    4155through various transformations.   The internal Pablo code generated
    4256may be displayed with {\tt -print-pablo}.  This can be quite useful in
    4357helping understand the match process.   It also possible to print out the
    44 generated LLVM IR code ({\tt -dump-generated-IR}), but this includes many
     58generated LLVM IR code ({\tt -dump-generated-IR}), but this may be
     59less useful as it includes many
    4560details of low-level carry-handling that obscures the core logic.
    4661
    47 
    48 
    49 
    50 
    51 
    5262\subsection{Single vs. Multithreaded Performance}
  • docs/Working/icGrep/fig-compiler.tex

    r4461 r4466  
    77\def\PabloCompiler{\Pablo{} Compiler}
    88
    9 \begin{figure}[h] \label{fig:compiler}
     9\begin{figure}[tbh]
    1010\begin{center}
    1111% Define block styles
    1212%\tikzstyle{decision} = [diamond, shape aspect=2, rotate=30, draw, text width=4.5em, text badly centered, inner sep=0pt, thick]
    13 \tikzstyle{block} = [rectangle, draw, text width=15em, text centered, minimum height=1.75em, thick, font=\ttfamily\bfseries, node distance=3.5em]
     13\tikzstyle{block} = [rectangle, draw, text width=12em, text centered, minimum height=1.75em, thick, font=\ttfamily\bfseries, node distance=3.5em]
    1414\tikzstyle{line} = [draw, ->, line width=1.4pt]
    15 \tikzstyle{seperator} = [draw, line width=0.125em, dashed]
     15\tikzstyle{separator} = [draw, line width=0.125em, dashed]
    1616\tikzset{block/.append style={execute at begin node=\footnotesize}}   
    1717\begin{tikzpicture}[node distance=3cm, auto, >=stealth]
     
    1919    % Place nodes
    2020    \node [draw=none] (RE) {\RegularExpression{}};
    21     \node [block, right of=RE, node distance=12em] (PropertyExtraction) {Property Extraction};
    2221    \node [block, below of=RE] (REParser) {\REParser{}};
    2322    \node [block, below of=REParser] (RETransform) {\RegularExpression{} Transformations};   
     
    3231    % Draw edges
    3332    \path [line] (RE) -- (REParser);
    34     \path [line] (RE) -- (PropertyExtraction);
     33    %\path [line] (RE) -- (PropertyExtraction);
    3534    \path [line] (REParser) -- (RETransform);
    3635    \path [line] (RETransform) -| (CUCompiler);
     
    4241    \path [line] (LLVMCompiler) -- (Matcher);
    4342   
    44     % Draw seperators
    45     \coordinate[right of=REParser, node distance=20em] (SR);
    46     \coordinate[left of=REParser, node distance=20em] (SL);
    47     \path [seperator] (SL) -- (REParser);
    48     \path [seperator] (REParser) -- (SR);
     43    % Draw separators
     44    \coordinate[right of=REParser, node distance=15em] (SR);
     45    \coordinate[left of=REParser, node distance=15em] (SL);
     46    \path [separator] (SL) -- (REParser);
     47    \path [separator] (REParser) -- (SR);
    4948   
    50     \coordinate[left of=Point, node distance=20em] (PL);
    51     \coordinate[right of=Point, node distance=20em] (PR);
    52     \path [seperator] (PL) -- (CUCompiler);
    53     \path [seperator] (CUCompiler) -- (RECompiler);
    54     \path [seperator] (RECompiler) -- (PR);
     49    \coordinate[left of=Point, node distance=15em] (PL);
     50    \coordinate[right of=Point, node distance=15em] (PR);
     51    %\path [separator] (PL) -- (CUCompiler);
     52    \path [separator] (CUCompiler) -- (RECompiler);
     53    %\path [separator] (RECompiler) -- (PR);
    5554
    56     \coordinate[right of=PabloCompiler, node distance=20em] (LR);
    57     \coordinate[left of=PabloCompiler, node distance=20em] (LL);
    58     \path [seperator] (LL) -- (PabloCompiler);
    59     \path [seperator] (PabloCompiler) -- (LR);   
     55    \coordinate[right of=PabloCompiler, node distance=15em] (LR);
     56    \coordinate[left of=PabloCompiler, node distance=15em] (LL);
     57    \path [separator] (LL) -- (PabloCompiler);
     58    \path [separator] (PabloCompiler) -- (LR);   
    6059   
    61     \coordinate[right of=LLVMCompiler, node distance=20em] (OR);
    62     \coordinate[left of=LLVMCompiler, node distance=20em] (OL);
    63     \path [seperator] (OL) -- (LLVMCompiler);
    64     \path [seperator] (LLVMCompiler) -- (OR);       
     60    \coordinate[right of=LLVMCompiler, node distance=15em] (OR);
     61    \coordinate[left of=LLVMCompiler, node distance=15em] (OL);
     62    \path [separator] (OL) -- (LLVMCompiler);
     63    \path [separator] (LLVMCompiler) -- (OR);       
    6564   
    6665    % Seperator text
     
    7372
    7473\end{center}
    75 \caption{icGrep Architectural Diagram.}
     74\caption{icGrep Architectural Diagram}\label{fig:compiler}
    7675\end{figure}
    77 
    78 As show in Figure \ref{fig:compiler},
    79 icGrep is composed of three logical layers: \RegularExpression{}, \Pablo{} and the LLVM layer, each with their own intermediate representation
    80 (IR), transformation and compilation modules.
    81 %
    82 As we traverse the layers, the IR becomes significantly more complex as it begins to mirror the final machine code.
    83 %
    84 The \REParser{} validates and transforms the input \RegularExpression{} into an abstract syntax tree (AST).
    85 %
    86 The AST is a minimalistic representation that, unlike traditional \RegularExpression{}, is not converted into a NFA or DFA for further processing.
    87 %
    88 Instead, icGrep passes the AST into the transformation module, which includes a set of \RegularExpression{} specific optimization passes.
    89 %
    90 The initial \emph{Nullable} pass, determines whether the \RegularExpression{} contains any prefixes or suffixes that may be removed or
    91 modified whilst still providing the same number of matches as the original expression.
    92 %
    93 For example, ``\verb|a*bc+|'' is equivalent to ``\verb|bc|'' because the Kleene Star (Plus) operator matches zero (one) or more instances of a
    94 specific character.
    95 %
    96 The \emph{toUTF8} pass converts the characters in the input \RegularExpression{} into the equivalent expression(s) that represent the sequences
    97 of 8-bit code units necessary to identify the presence of a particular character.
    98 %
    99 Since some characters have multiple logically equivalent representations, such as \textcolor{red}{\textbf{????}}, this may produce nested sequences or alternations.
    100 %
    101 This is described in more detail in \S\ref{sec:Unicode:toUTF8}.
    102 %
    103 To alleviate this, the final \emph{Simplification} pass flattens nested sequences and alternations into their simplest legal form.
    104 %
    105 For example, ``\verb`a(b((c|d)|e))`'' would become ``\verb`ab(c|d|e)`'' and ``\verb`([0-9]{3,5}){3,5}`'', ``\verb`[0-9]{9,25}`''.
    106 %
    107 
    108 %% DISCUSS ANALYSIS MODULE?
    109 
    110 
    111 The \RegularExpression{} layer has two compilers: the \CodeUnit{} and \RegularExpressionCompiler{}, both of which produce \Pablo{} IR.
    112 %
    113 Recall that the \Pablo{} layer assumes a transposed view of the input data.
    114 %
    115 The \emph{\CodeUnitCompiler{}} transforms the input code unit classes, either extracted from the \RegularExpression{} or produced by the
    116 \emph{toUTF8} transformation, into a series of bit stream equations.
    117 %
    118 The \emph{\RegularExpressionCompiler{}} assumes that these have been calculated and transforms the \RegularExpression{} AST into
    119 a sequence of instructions.
    120 %
    121 For instance, it would convert any alternations into a sequence of calculations that are merged with \verb|OR|s.
    122 %
    123 The results of these passes are combined and transformed through a series of typical optimization passes, including dead code elimination
    124 (DCE), common subexpression elimination (CSE), and constant folding.
    125 %
    126 These are necessary at this stage because the \RegularExpression{} AST may include common subsequences that are costly to recognize in
    127 that form.
    128 %
    129 Similarly, to keep the \CodeUnitCompiler{} a linear time function, it may introduce redundant IR instructions as it applies traditional Boolean
    130 algebra transformations, such as de Morgan's law, to the computed streams.
    131 %
    132 An intended side-effect of these passes is that they eliminate the need to analyze the data-dependencies inherent in the carry-bit logic,
    133 which is necessary for some \Pablo{} instructions but problematic for optimizers to reason about non-conservatively.
    134 %
    135 The \PabloCompiler{} then converts the \Pablo{} IR into LLVM IR.
    136 %
    137 This is a relatively straightforward conversion:
    138 %
    139 the only complexities it introduces is the generation of Phi nodes, linking of statically-compiled functions, and assignment of carry variables.
    140 %
    141 It produces the dynamically-generated match function used by the icGrep.
    142 
    143 \begin{figure}[h] \label{fig:execution}
    144 \begin{center}
    145 \tikzstyle{block} = [rectangle, draw, text width=15em, text centered, minimum height=1.75em, thick, font=\ttfamily\bfseries, node distance=3.5em]
    146 \tikzstyle{line} = [draw, ->, line width=1.4pt]
    147 \tikzstyle{seperator} = [draw, line width=0.125em, dashed]
    148 \tikzset{block/.append style={execute at begin node=\footnotesize}}   
    149 \begin{tikzpicture}[node distance=3cm, auto, >=stealth]
    150 
    151     % Place nodes
    152     \node [draw=none] (InputData) {Input Data};
    153     \node [block, below of=InputData] (S2P) {S2P};
    154     \node [block, below of=S2P] (RequiredStreamsGenerator) {Required Streams Generator};
    155     \node [block, below of=RequiredStreamsGenerator] (JITFunction) {JIT Function};
    156     \node [block, right of=JITFunction, node distance=20em] (NamedPropertyLibaray) {Named Property Library};
    157     \node [block, below of=JITFunction] (MatchScanner) {Match Scanner};
    158     \node [draw=none, below of=MatchScanner, node distance=3.5em] (OutputResult) {Output Result};
    159    
    160     % Draw edges
    161     \path [line] (InputData) -- (S2P);
    162     \path [line] (S2P) -- (RequiredStreamsGenerator);
    163     \path [line] (RequiredStreamsGenerator) -- (JITFunction);
    164     \path [line] (NamedPropertyLibaray) -- (JITFunction);
    165     \path [line] (JITFunction) -- (MatchScanner);
    166     \path [line] (MatchScanner) -- (OutputResult);
    167    
    168 \end{tikzpicture}
    169 
    170 \end{center}
    171 \caption{icGrep Execution Diagram.}
    172 \end{figure}
    173 
    174 As shown in Figure \ref{fig:execution}, icGrep takes the input data and transposed it into 8 parallel bit streams through S2P module.
    175 The required streams, e.g. line break stream, can then be generated using the 8 basis bits streams.
    176 The JIT function retrieves the 8 basis bits and the required streams from their memory addresses and starts the matching process.
    177 Named Property Library that includes all the predefined Unicode categories is installed into JIT function and can be called during the matching process.
    178 JIT function returns one bitstream that marks all the matching positions.
    179 A match scanner will scan through this bitstream and calculate the total counts or write the context of each match position.
    180 
    181 We can also apply a pipeline parallelism strategy to further speed up the process of icGrep.
    182 S2P and Required Streams Generator can be process in a separate thread and start even before the dynamic compilation starts.
    183 The output of S2P and Required Streams Generator, that is the 8 basis bits streams and the required streams,
    184 needs to be stored in a shared memory space so that the JIT function can read from it.
    185 To be more efficient of memory space usage, we only allocate limit amount of space for the shared data.
    186 When each chunk of the shared space is filled up with the bitstream data,
    187 the thread will start writing to the first chunk if it is released by JIT function.
    188 Otherwise, it will wait for JIT function until it finishes processing that chunk.
    189 Therefore, the performance is depended on the slowest thread.
    190 In the case that the cost of transposition and required stream generation is more than the matching process,
    191 we can further divide up the work and assign two threads for S2P and Required Streams Generator.
    192 
    193 
    194 
    195 
    196 
Note: See TracChangeset for help on using the changeset viewer.