Changeset 4499 for docs


Ignore:
Timestamp:
Feb 11, 2015, 4:12:01 PM (4 years ago)
Author:
nmedfort
Message:

Cleaned up architecture discussion a bit more

Location:
docs/Working/icGrep
Files:
7 edited

Legend:

Unmodified
Added
Removed
  • docs/Working/icGrep/architecture.tex

    r4493 r4499  
    55
    66As shown in Figure \ref{fig:compiler},
    7 \icGrep{} comprises three logical layers: \RegularExpression{}, \Pablo{} and the LLVM layer, each with their own intermediate representation
     7compilation in \icGrep{} comprises three logical layers: \RegularExpression{},
     8\Pablo{} and the LLVM layer, each with their own intermediate representation
    89(IR), transformation and compilation modules.
    910%
    10 As we traverse the layers, the IR becomes significantly more complex as it begins to mirror the final machine code.
     11As we traverse the layers, the IR becomes more complex as it begins to mirror the final machine code.
    1112%
    12 The \REParser{} validates and transforms the input \RegularExpression{} into an abstract syntax tree (AST).
     13The layering enables further optimization based on information available at each stage.
     14%
     15The initial \REParser{} validates and transforms the input \RegularExpression{} into an abstract syntax tree (AST).
    1316%
    1417%The AST is a minimalistic representation that, unlike traditional \RegularExpression{}, is not converted into a NFA or DFA for further processing.
     
    1619%Instead, \icGrep{} passes the AST into the transformation module, which includes a set of \RegularExpression{} specific optimization passes.
    1720%
    18 The initial \emph{Nullable} pass, determines whether the \RegularExpression{} contains any prefixes or suffixes that may be removed or
    19 modified whilst still providing the same number of matches as the original expression.
     21Successive \RegularExpression{} Transformations exploit knowledge domain
     22knowledge to optimize the regular expressions.
     23%
     24An initial \emph{Nullable} pass, determines whether the \RegularExpression{}
     25contains prefixes or suffixes that may be removed or
     26modified whilst matching the same lines of text as the original expression.
    2027%
    2128For example, ``\verb|a*bc+|'' is equivalent to ``\verb|bc|'' because the Kleene Star (Plus) operator matches zero (one) or more instances of a
    2229specific character.
    23 %
    24 The \emph{toUTF8} pass converts the Unicode character classes in the input \RegularExpression{} into the equivalent expression(s) that represent the sequences
    25 of 8-bit code units necessary to identify occurrences of the class.
     30%
     31The aforementioned \texttt{toUTF8} transformation also applies during this phase to generate code unit classes.
     32%The \emph{toUTF8} pass converts the Unicode character classes in the input \RegularExpression{} into equivalent expression(s) that represent sequences
     33%of 8-bit code units necessary to identify occurrences of the class.
    2634%
    2735%Since some characters have multiple logically equivalent representations, such as \textcolor{red}{\textbf{????}}, this may produce nested sequences or alternations.
     
    3745
    3846
    39 The \RegularExpression{} layer has two compilers: the \CodeUnitCompiler{} and \RegularExpressionCompiler{}, both of which produce \Pablo{} IR.
     47The next layer transforms this AST into the instructions in the \Pablo{} IR.
     48%
     49%has two compilers: the \CodeUnitCompiler{} and \RegularExpressionCompiler{}, both of which produce \Pablo{} IR.
    4050%
    4151Recall that the \Pablo{} layer assumes a transposed view of the input data.
    4252%
    43 The \emph{\CodeUnitCompiler{}} transforms the input code unit classes, either extracted from the \RegularExpression{} or produced by the
    44 \emph{toUTF8} transformation, into a series of bit stream equations.
     53The \emph{\RegularExpressionCompiler{}} first transforms all input code unit classes,
     54analogous to non-Unicode character classes,
     55into a series of equations over these transposed bitstreams.
     56It next transforms the AST into \Pablo{} instructions that use the
     57results of these equations.
     58For instance, it converts alternations into a sequence of calculations that
     59are merged with \verb|OR|s.
     60The results of these passes are combined and transformed through typical
     61optimization passes including dead code elimination
     62(DCE), common subexpression elimination (CSE) and constant folding.
     63These optimizations exploit redundancies that are harder to recognize in
     64the \RegularExpression{} AST itself.
    4565%
    46 The \emph{\RegularExpressionCompiler{}} assumes that these have been calculated and transforms the \RegularExpression{} AST into
    47 a sequence of instructions.
     66%The \emph{\CodeUnitCompiler{}} transforms the input code unit classes,
     67%either extracted from the \RegularExpression{} or produced by the \emph{toUTF8} transformation,
     68%into a series of bit stream equations.
    4869%
    49 For instance, it would convert any alternations into a sequence of calculations that are merged with \verb|OR|s.
     70%The \emph{\RegularExpressionCompiler{}}
     71%assumes that these have been calculated and
     72%transforms the \RegularExpression{} AST into
     73%a sequence of instructions.
     74%\Pablo{} instructions that use the results of these equations.
    5075%
    51 The results of these passes are combined and transformed through a series of typical optimization passes, including dead code elimination
    52 (DCE), common subexpression elimination (CSE), and constant folding.
     76%For instance, it converts alternations into a sequence of calculations that are merged with \verb|OR|s.
    5377%
    54 These are necessary at this stage because the \RegularExpression{} AST may include common subsequences that are costly to recognize in
    55 that form.
     78%The results of these passes are combined and transformed through a series of typical optimization passes, including dead code elimination
     79%(DCE), common subexpression elimination (CSE), and constant folding.
    5680%
    57 Similarly, to keep the \CodeUnitCompiler{} a linear time function, it may introduce redundant IR instructions as it applies traditional Boolean
    58 algebra transformations, such as de Morgan's law, to the computed streams.
     81%These optimizations exploit redundancies that are harder to recognize in the \RegularExpression{} AST itself.
     82%These are necessary at this stage because the \RegularExpression{} AST may include common subsequences that are costly to recognize in
     83%that form.
     84%
     85%Similarly, to keep the \CodeUnitCompiler{} a linear time function, it may introduce redundant IR instructions as it applies traditional Boolean
     86%algebra transformations, such as de Morgan's law, to the computed streams.
    5987%
    60 An intended side-effect of these passes is that they eliminate the need to analyze the data-dependencies inherent in the carry-bit logic,
    61 which is necessary for some \Pablo{} instructions but problematic for optimizers to reason about non-conservatively.
     88%An intended side-effect of these passes is that they eliminate the need to analyze the data-dependencies inherent in the carry-bit logic,
     89%which is necessary for some \Pablo{} instructions but problematic for optimizers to reason about non-conservatively.
    6290%
    63 The \PabloCompiler{} then converts the \Pablo{} IR into LLVM IR.
     91
     92The \PabloCompiler{} then directly converts the \Pablo{} IR into LLVM IR.
    6493%
    65 This is a relatively straightforward conversion:
     94%This is a relatively straightforward conversion:
    6695%
    67 the only complexities it introduces is the generation of Phi nodes, linking of statically-compiled functions, and assignment of carry variables.
     96%the only complexities it introduces is the generation of Phi nodes, linking of statically-compiled functions, and assignment of carry variables.
    6897%
    69 It produces the dynamically-generated match function used by the \icGrep{}.
     98The LLVM Compiler framework provides flexible APIs for compilation and linking.
     99Using these, \icGrep{} dynamically generates a match function for identifying
     100occurrences of the \RegularExpression{}.
    70101
    71102\paragraph{Dynamic Grep Engine.}
     
    84115Transposition and the Required Streams Generator can be performed in a separate thread which can start even before the dynamic compilation starts.
    85116The output of Transposition and the Required Streams Generator, that is the 8 basis bits streams and the required streams,
    86 are stored in a shared memory buffer for susequent processing by the Dynamic Matcher once compilation is complete.
     117are stored in a shared memory buffer for subsequent processing by the Dynamic Matcher once compilation is complete.
    87118A single thread performs both compilation and matching using the computed basis and required streams.
    88119To avoid L2 cache contention, we allocate only a limited amount of space for the shared data in a circular buffer.
    89120The performance is dependent on the slowest thread.   
    90121In the case that the cost of transposition and required stream generation is more than the matching process,
    91 we can further divide up the work and assign two threads for Transposition and Required Streams Generator.
     122we can further divide up the work and assign two threads for Transposition and the Required Streams Generator.
    92123
    93124
  • docs/Working/icGrep/evaluation.tex

    r4498 r4499  
    191191For example, the option
    192192{\tt -print-REs} show the parsed regular expression as it goes
    193 through various transformations.   The internal Pablo code generated
    194 may be displayed with {\tt -print-pablo}.  This can be quite useful in
     193through various transformations.   The internal \Pablo{} code generated
     194may be displayed with {\tt -print-\Pablo{}}.  This can be quite useful in
    195195helping understand the match process.   It also possible to print out the
    196196generated LLVM IR code ({\tt -dump-generated-IR}), but this may be
  • docs/Working/icGrep/fig-compiler.tex

    r4476 r4499  
    1414    \node [block, below of=RE] (REParser) {\REParser{}};
    1515    \node [block, below of=REParser] (RETransform) {\RegularExpression{} Transformations};   
    16     \coordinate[below of=RETransform, node distance=3em] (Point);   
    17     \node [block, left of=Point, node distance=10em] (CUCompiler) {\CodeUnitCompiler{}};
    18     \node [block, right of=Point, node distance=10em] (RECompiler) {\RegularExpressionCompiler{}};   
    19     \node [block, below of=Point, node distance=3em] (PabloTransform) {\Pablo{} Transformations};   
     16%    \coordinate[below of=RETransform, node distance=3em] (Point);   
     17%    \node [block, left of=Point, node distance=10em] (CUCompiler) {\CodeUnitCompiler{}};
     18%    \node [block, right of=Point, node distance=10em] (RECompiler) {\RegularExpressionCompiler{}};
     19    \node [block, below of=RETransform] (RECompiler) {\RegularExpressionCompiler{}};
     20    \node [block, below of=RECompiler] (PabloTransform) {\Pablo{} Transformations};   
    2021    \node [block, below of=PabloTransform] (PabloCompiler) {\PabloCompiler{}};
    2122    \node [block, below of=PabloCompiler] (LLVMCompiler) {LLVM Compiler};
     
    2627    %\path [line] (RE) -- (PropertyExtraction);
    2728    \path [line] (REParser) -- (RETransform);
    28     \path [line] (RETransform) -| (CUCompiler);
    29     \path [line] (RETransform) -| (RECompiler);
    30     \path [line] (CUCompiler) |- (PabloTransform);
    31     \path [line] (RECompiler) |- (PabloTransform);
     29%    \path [line] (RETransform) -| (CUCompiler);
     30%    \path [line] (RETransform) -| (RECompiler);
     31%    \path [line] (CUCompiler) |- (PabloTransform);
     32 %   \path [line] (RECompiler) |- (PabloTransform);
     33    \path [line] (RETransform) -- (RECompiler);
     34    \path [line] (RECompiler) -- (PabloTransform);
    3235    \path [line] (PabloTransform) -- (PabloCompiler);
    3336    \path [line] (PabloCompiler) -- (LLVMCompiler);
     
    4043    \path [separator] (REParser) -- (SR);
    4144   
    42     \coordinate[left of=Point, node distance=15em] (PL);
    43     \coordinate[right of=Point, node distance=15em] (PR);
     45    \coordinate[left of=RECompiler, node distance=15em] (PL);
     46    \coordinate[right of=RECompiler, node distance=15em] (PR);
     47    \path [separator] (PL) -- (RECompiler);
     48    \path [separator] (RECompiler) -- (PR);
    4449    %\path [separator] (PL) -- (CUCompiler);
    45     \path [separator] (CUCompiler) -- (RECompiler);
     50%    \path [separator] (CUCompiler) -- (RECompiler);
    4651    %\path [separator] (RECompiler) -- (PR);
    4752
     
    5762   
    5863    % Seperator text
    59     \node [draw=none,anchor=west] at ($(SL)!0.5!(PL)$) {1)~\RegularExpression{}};
     64    \node [draw=none,anchor=west] at ($(SL)!0.5!(PL)$) {1)~\RegularExpression{} AST};
    6065    \node [draw=none,anchor=west] at ($(PL)!0.5!(LL)$) {2)~\Pablo{}};
    6166    \node [draw=none,anchor=west] at ($(LL)!0.5!(OL)$) {3)~LLVM};
     
    6570
    6671\end{center}
    67 \caption{icGrep Architectural Diagram}\label{fig:compiler}
     72\caption{icGrep compilation architecture}\label{fig:compiler}
    6873\end{figure}
  • docs/Working/icGrep/fig-executor.tex

    r4498 r4499  
    5050
    5151\end{center}
    52 \caption{icGrep Execution Diagram} \label{fig:execution}
     52\caption{Data flow in an icGrep execution} \label{fig:execution}
    5353\end{figure}
  • docs/Working/icGrep/icgrep.tex

    r4498 r4499  
    1818
    1919\def\RegularExpression{RegEx}
    20 \def\Pablo{Pablo}
     20\def\Pablo{Parabix}
    2121\def\CodeUnit{Code Unit}
    2222\def\REParser{\RegularExpression{} Parser}
  • docs/Working/icGrep/unicode-re.tex

    r4495 r4499  
    1111over UTF-8 data streams is to translate Unicode regular expressions
    1212over codepoints to corresponding regular expressions over
    13 sequences of UTF-8 bytes.   The {\tt toUTF8} transformation
     13sequences of UTF-8 bytes or \emph{code units}.   The {\tt toUTF8} transformation
    1414performs this as a regular expression transformation, transforming
    1515input expressions such as `\verb:\u{244}[\u{2030}-\u{2137}]:`
Note: See TracChangeset for help on using the changeset viewer.