Ignore:
Timestamp:
Sep 21, 2015, 12:42:53 PM (4 years ago)
Author:
cameron
Message:

Figure placement according to Springer

File:
1 edited

Legend:

Unmodified
Added
Removed
  • docs/Working/icGrep/architecture.tex

    r4501 r4782  
    11\section{Architecture}\label{sec:architecture}
     2
    23\paragraph{Regular Expression Preprocessing.}
     4As shown in Fig.~\ref{fig:compiler},
     5compilation in \icGrep{} comprises three logical layers: \RegularExpression{},
     6\Pablo{} and the LLVM layer, each with their own intermediate representation
     7(IR), transformation and compilation modules.
     8As we traverse the layers, the IR becomes more complex as it begins to mirror the final machine code.
     9The layering enables further optimization based on information available at each stage.
     10The \REParser{} validates and transforms the input \RegularExpression{} into an abstract syntax tree (AST).
     11Successive \RegularExpression{} Transformations exploit domain
     12knowledge to optimize the regular expressions.
     13The aforementioned \texttt{toUTF8} transformation also applies during this phase to generate code unit classes.
    314
    415\input{fig-compiler}
    516
    6 As shown in Figure \ref{fig:compiler},
    7 compilation in \icGrep{} comprises three logical layers: \RegularExpression{},
    8 \Pablo{} and the LLVM layer, each with their own intermediate representation
    9 (IR), transformation and compilation modules.
    10 %
    11 As we traverse the layers, the IR becomes more complex as it begins to mirror the final machine code.
    12 %
    13 The layering enables further optimization based on information available at each stage.
    14 %
    15 The \REParser{} validates and transforms the input \RegularExpression{} into an abstract syntax tree (AST).
    16 %
    17 %The AST is a minimalistic representation that, unlike traditional \RegularExpression{}, is not converted into a NFA or DFA for further processing.
    18 %
    19 %Instead, \icGrep{} passes the AST into the transformation module, which includes a set of \RegularExpression{} specific optimization passes.
    20 %
    21 Successive \RegularExpression{} Transformations exploit domain
    22 knowledge to optimize the regular expressions.
    23 %
    24 %An initial \emph{Nullable} pass, determines whether the \RegularExpression{}
    25 %contains prefixes or suffixes that may be removed or
    26 %modified whilst matching the same lines of text as the original expression.
    27 %
    28 %For example, ``\verb|a*bc+|'' is equivalent to ``\verb|bc|'' because the Kleene Star (Plus) operator matches zero (one) or more instances of a
    29 %specific character.
    30 %
    31 The aforementioned \texttt{toUTF8} transformation also applies during this phase to generate code unit classes.
    32 %The \emph{toUTF8} pass converts the Unicode character classes in the input \RegularExpression{} into equivalent expression(s) that represent sequences
    33 %of 8-bit code units necessary to identify occurrences of the class.
    34 %
    35 %Since some characters have multiple logically equivalent representations, such as \textcolor{red}{\textbf{????}}, this may produce nested sequences or alternations.
    36 %
    37 %This is described in more detail in \S\ref{sec:Unicode:toUTF8}.
    38 %
    39 %A final \emph{Simplification} pass flattens nested structures into their simplest legal form.
    40 %
    41 %For example, ``\verb`a(b((c|d)|e))`'' becomes ``\verb`ab(c|d|e)`'' and ``\verb`([0-9]{3,5}){3,5}`'' becomes ``\verb`[0-9]{9,25}`''.
    42 %
    43 
    44 %% DISCUSS ANALYSIS MODULE?
    45 
    4617
    4718The next layer transforms this AST into the instructions in the \Pablo{} IR.
    48 %
    49 %has two compilers: the \CodeUnitCompiler{} and \RegularExpressionCompiler{}, both of which produce \Pablo{} IR.
    50 %
    5119Recall that the \Pablo{} layer assumes a transposed view of the input data.
    52 %
    5320The \emph{\RegularExpressionCompiler{}} first transforms all input code unit classes,
    5421analogous to non-Unicode character classes,
     
    6330These optimizations exploit redundancies that are harder to recognize in
    6431the \RegularExpression{} AST itself.
    65 %
    66 %The \emph{\CodeUnitCompiler{}} transforms the input code unit classes,
    67 %either extracted from the \RegularExpression{} or produced by the \emph{toUTF8} transformation,
    68 %into a series of bit stream equations.
    69 %
    70 %The \emph{\RegularExpressionCompiler{}}
    71 %assumes that these have been calculated and
    72 %transforms the \RegularExpression{} AST into
    73 %a sequence of instructions.
    74 %\Pablo{} instructions that use the results of these equations.
    75 %
    76 %For instance, it converts alternations into a sequence of calculations that are merged with \verb|OR|s.
    77 %
    78 %The results of these passes are combined and transformed through a series of typical optimization passes, including dead code elimination
    79 %(DCE), common subexpression elimination (CSE), and constant folding.
    80 %
    81 %These optimizations exploit redundancies that are harder to recognize in the \RegularExpression{} AST itself.
    82 %These are necessary at this stage because the \RegularExpression{} AST may include common subsequences that are costly to recognize in
    83 %that form.
    84 %
    85 %Similarly, to keep the \CodeUnitCompiler{} a linear time function, it may introduce redundant IR instructions as it applies traditional Boolean
    86 %algebra transformations, such as de Morgan's law, to the computed streams.
    87 %
    88 %An intended side-effect of these passes is that they eliminate the need to analyze the data-dependencies inherent in the carry-bit logic,
    89 %which is necessary for some \Pablo{} instructions but problematic for optimizers to reason about non-conservatively.
    90 %
    9132
    9233The \PabloCompiler{} then directly converts the \Pablo{} IR into LLVM IR.
    93 %
    94 %This is a relatively straightforward conversion:
    95 %
    96 %the only complexities it introduces is the generation of Phi nodes, linking of statically-compiled functions, and assignment of carry variables.
    97 %
    9834The LLVM Compiler framework provides flexible APIs for compilation and linking.
    9935Using these, \icGrep{} dynamically generates a match function for identifying
     
    10137
    10238\paragraph{Dynamic Grep Engine.}
    103 \input{fig-executor}
    10439Figure~\ref{fig:execution} shows the structure of the \icGrep{} matching engine.
    10540The input data is transposed into 8 parallel bit streams through the Transposition module.
     
    11146The Dynamic Matcher returns one bitstream that marks all the positions that fully match the compiled regular expression.
    11247Finally, a Match Scanner scans through the returned bitstream to select the matching lines and generate the normal grep output.
     48
     49\input{fig-executor}
    11350
    11451We can also apply a pipeline parallelism strategy to further speed up the process of \icGrep{}.
Note: See TracChangeset for help on using the changeset viewer.