Changeset 4476


Ignore:
Timestamp:
Feb 7, 2015, 12:52:42 AM (4 years ago)
Author:
nmedfort
Message:

Architecture updates

Location:
docs/Working/icGrep
Files:
7 edited

Legend:

Unmodified
Added
Removed
  • docs/Working/icGrep/architecture.tex

    r4472 r4476  
    44\input{fig-compiler}
    55
    6 As show in Figure \ref{fig:compiler},
    7 \icGrep{} is composed of three logical layers: \RegularExpression{}, \Pablo{} and the LLVM layer, each with their own intermediate representation
     6As shown in Figure \ref{fig:compiler},
     7\icGrep{} comprises three logical layers: \RegularExpression{}, \Pablo{} and the LLVM layer, each with their own intermediate representation
    88(IR), transformation and compilation modules.
    99%
     
    2525of 8-bit code units necessary to identify the presence of a particular character.
    2626%
    27 Since some characters have multiple logically equivalent representations, such as \textcolor{red}{\textbf{????}}, this may produce nested sequences or alternations.
     27%Since some characters have multiple logically equivalent representations, such as \textcolor{red}{\textbf{????}}, this may produce nested sequences or alternations.
    2828%
    29 This is described in more detail in \S\ref{sec:Unicode:toUTF8}.
     29%This is described in more detail in \S\ref{sec:Unicode:toUTF8}.
    3030%
    31 To alleviate this, the final \emph{Simplification} pass flattens nested sequences and alternations into their simplest legal form.
     31A final \emph{Simplification} pass flattens nested sequences and alternations into their simplest legal form.
    3232%
    3333For example, ``\verb`a(b((c|d)|e))`'' would become ``\verb`ab(c|d|e)`'' and ``\verb`([0-9]{3,5}){3,5}`'', ``\verb`[0-9]{9,25}`''.
     
    3737
    3838
    39 The \RegularExpression{} layer has two compilers: the \CodeUnit{} and \RegularExpressionCompiler{}, both of which produce \Pablo{} IR.
     39The \RegularExpression{} layer has two compilers: the \CodeUnitCompiler{} and \RegularExpressionCompiler{}, both of which produce \Pablo{} IR.
    4040%
    4141Recall that the \Pablo{} layer assumes a transposed view of the input data.
     
    7373
    7474
    75 As shown in Figure \ref{fig:execution}, \icGrep{} takes the input data and transposed it into 8 parallel bit streams through S2P module.
    76 The required streams, e.g. line break stream, can then be generated using the 8 basis bits streams.
    77 The JIT function retrieves the 8 basis bits and the required streams from their memory addresses and starts the matching process.
    78 Named Property Library that includes all the predefined Unicode categories is installed into JIT function and can be called during the matching process.
    79 JIT function returns one bitstream that marks all the matching positions.
    80 A match scanner will scan through this bitstream and calculate the total counts or write the context of each match position.
     75As shown in Figure~\ref{fig:execution}, \icGrep{} takes the input data and transposes it into 8 parallel bit streams through the Transposition module.
     76The required streams, e.g. the line break stream, can then be generated using the 8 basis bits streams.
     77The Dynamic Matcher, dynamically compiled via LLVM, retrieves the 8 basis bits and the required streams from their memory addresses and starts the matching process.
     78The Named Property Library that includes all the predefined Unicode categories is installed into the Dynamic Matcher and can be called during the matching process.
     79The Dynamic Matcher returns one bitstream that marks all the matching positions.
     80Finally, a Match Scanner scans through the returned bitstream and calculates the total counts or writes the context of each match position.
    8181
    8282We can also apply a pipeline parallelism strategy to further speed up the process of \icGrep{}.
    83 S2P and Required Streams Generator can be process in a separate thread and start even before the dynamic compilation starts.
    84 The output of S2P and Required Streams Generator, that is the 8 basis bits streams and the required streams,
    85 needs to be stored in a shared memory space so that the JIT function can read from it.
    86 To be more efficient of memory space usage, we only allocate limit amount of space for the shared data.
    87 When each chunk of the shared space is filled up with the bitstream data,
    88 the thread will start writing to the first chunk if it is released by JIT function.
    89 Otherwise, it will wait for JIT function until it finishes processing that chunk.
     83Transposition and the Required Streams Generator can be performed in a separate thread and start even before the dynamic compilation starts.
     84The output of Transposition and the Required Streams Generator, that is the 8 basis bits streams and the required streams,
     85needs to be stored in a shared memory space so that the Dynamic Matcher can read from it.
     86To more efficiently use memory, we allocate only a limited amount of space for the shared data.
     87After each chunk of the shared space is filled with bitstream data,
     88the thread starts writing to the first chunk if it has been released by Dynamic Matcher.
     89Otherwise, it will wait for Dynamic Matcher until it finishes processing that chunk.
    9090Therefore, the performance is depended on the slowest thread.
    9191In the case that the cost of transposition and required stream generation is more than the matching process,
    92 we can further divide up the work and assign two threads for S2P and Required Streams Generator.
     92we can further divide up the work and assign two threads for Transposition and Required Streams Generator.
    9393
    9494
  • docs/Working/icGrep/evaluation.tex

    r4475 r4476  
    3737most of the world's major language families as a test corpus.   For each program
    3838under test, we perform searches for each regular expression against each XML document.
    39 Results are presented in \Figure \ref{fig:property_test}.  Performance is reported
     39Results are presented in Figure \ref{fig:property_test}.  Performance is reported
    4040in CPU cycles per byte on an Intel Core i7 machine.   The results were grouped
    4141by the percentage of matching lines found in the XML document, grouped in
  • docs/Working/icGrep/fig-compiler.tex

    r4466 r4476  
    1 \def\RegularExpression{RegEx}
    2 \def\Pablo{Pablo}
    3 \def\CodeUnit{Code Unit}
    4 \def\REParser{\RegularExpression{} Parser}
    5 \def\CodeUnitCompiler{\CodeUnit{} Compiler}
    6 \def\RegularExpressionCompiler{\RegularExpression{} Compiler}
    7 \def\PabloCompiler{\Pablo{} Compiler}
    81
    92\begin{figure}[tbh]
  • docs/Working/icGrep/fig-executor.tex

    r4468 r4476  
    99    % Place nodes
    1010    \node [draw=none] (InputData) {Input Data};
    11     \node [block, below of=InputData] (S2P) {S2P};
     11    \node [block, below of=InputData] (S2P) {Transposition};
    1212    \node [block, below of=S2P] (RequiredStreamsGenerator) {Required Streams Generator};
    13     \node [block, below of=RequiredStreamsGenerator] (JITFunction) {JIT Function};
     13    \node [block, below of=RequiredStreamsGenerator] (JITFunction) {Dynamic Matcher};
    1414    \node [block, right of=JITFunction, node distance=20em] (NamedPropertyLibaray) {Named Property Library};
    1515    \node [block, below of=JITFunction] (MatchScanner) {Match Scanner};
  • docs/Working/icGrep/icgrep.tex

    r4474 r4476  
    1616\newcommand{\icGrep}[1]{icGrep}
    1717\newcommand{\ICgrep}[1]{ICgrep}
     18
     19\def\RegularExpression{RegEx}
     20\def\Pablo{Pablo}
     21\def\CodeUnit{Code Unit}
     22\def\REParser{\RegularExpression{} Parser}
     23\def\CodeUnitCompiler{\CodeUnit{} Compiler}
     24\def\RegularExpressionCompiler{\RegularExpression{} Compiler}
     25\def\PabloCompiler{\Pablo{} Compiler}
    1826
    1927
  • docs/Working/icGrep/unicode-re.tex

    r4472 r4476  
    3232\subsection{Match Unicode Characters}
    3333After the transformation, we can then generate the character class bitstreams with our character class compiler.
    34 As shown in figure \ref{fig:multibytesequence}, the input data has 4 Chinese characters, each formed by 3 bytes.
    35 To better demonstrate the process, we use $ni3$, $hao$ and $men$ to represent these characters.
    36 $CC_{\mbox{ni3}}$ is the bitstream that marks character $ni3$ and $CC_{\mbox{hao}}$ is the bitstreams that marks character stream $hao$.
    37 To match a two utf8 character sequence $ni3hao$, we need to first create a $Initial$ steams that marks the first byte of all the valid characters.
    38 We also need to generate a $NonFinal$ streams that marks every bytes of all the multibyte characters except the last byte.
    39 Using $Initial$ to ScanThru $NonFinal$, we get the bitstream $M_2$ which marks the positions of the last byte of every character.
     34As shown in Figure \ref{fig:multibytesequence}, the input data has 4 Chinese characters, each formed by 3 bytes.
     35To better demonstrate the process, we use \emph{ni3}, \emph{hao} and \emph{men} to represent these characters.
     36$CC_{\mbox{ni3}}$ is the bitstream that marks character \emph{ni3} and $CC_{\mbox{hao}}$ is the bitstream that marks character stream \emph{hao}.
     37To match a two UTF-8 character sequence \emph{ni3hao}, we need to first create an \emph{Initial} steam that marks the first byte of all the valid characters.
     38We also need to generate a \emph{NonFinal} stream that marks every byte of all the multibyte characters except for the last byte.
     39Using \emph{Initial} to ScanThru \emph{NonFinal}, we get the bitstream $M_2$, which marks the positions of the last byte of every character.
    4040An overlap between $M_2$ and $CC_{\mbox{ni3}}$ gives the start position for matching the next character.
    41 As illustrated by $Adv$, we find two matches for $ni3$ and from these two positions we can start the matching process for the next character $hao$.
     41As illustrated by \emph{Adv}, we find two matches for \emph{ni3} and from these two positions we can start the matching process for the next character \emph{hao}.
    4242The final result stream shows 1 match for Multibyte Sequence ni3hao.
    4343
     
    4848%\begin{tabular}{cr}\\
    4949
    50 \begin{tabular}{cr}\\
     50\begin{tabular}{r@{\hspace{1em}}r}\\
    5151input data  & \verb`ni3hao(Hello),ni3men(You),`\\
    5252$CC_{\mbox{ni3}}$ & \verb`..1.............1.........`\\
    5353$CC_{\mbox{hao}}$ & \verb`.....1....................`\\
    54 $Initial$ & \verb`1..1..111111111..1..111111`\\
    55 $NonFinal$ & \verb`11.11.........11.11.......`\\
    56 $M_1 = ScanThru(Initial, NonFinal)$ & \verb`..1..111111111..1..1111111`\\
    57 $Adv = Advance(M_1 \land CC_{\mbox{ni3}})$ & \verb`...1.............1........`\\
    58 $M_2 = ScanThru(Initial \land Adv, NonFinal)$ & \verb`.....1.............1......`\\
    59 $match = M_2 \land CC_{\mbox{hao}}$ & \verb`.....1....................`
     54\emph{Initial} & \verb`1..1..111111111..1..111111`\\
     55\emph{NonFinal} & \verb`11.11.........11.11.......`\\
     56$M_1 = \mathit{ScanThru}(\mathit{Initial}, \mathit{NonFinal})$ & \verb`..1..111111111..1..1111111`\\
     57$\mathit{Adv} = \mathit{Advance}(M_1 \land CC_{\mbox{ni3}})$ & \verb`...1.............1........`\\
     58$M_2 = \mathit{ScanThru}(\mathit{Initial} \land \mathit{Adv}, \mathit{NonFinal})$ & \verb`.....1.............1......`\\
     59$\mathit{match} = M_2 \land CC_{\mbox{hao}}$ & \verb`.....1....................`
    6060\end{tabular}
    6161
     
    6868\subsection{Predefined Unicode classes}
    6969
    70 Every character in the Unicode database has been assigned to a general category classification based upon the character's type. As the categories seldom change the parallel bitstream equations for the categories have been statically compiled into icGrep. Each of the categories contain a large number of code points, therefore an \emph{If Hierarchy} optimization has been included in the statically compiled implementation of each category.  The optimization works under the assumption that most input documents will only contain the code points of the characters from a small number of writing systems. Processing the blocks of code points for characters that exist outside of this range is unnecessary and will only add to the total running time of the application.  The optimization tests the input text to determine the ranges of the code points that are contained in the input text and it only processes the character class equations and the regular expression matching equations for the code point ranges that the input text contains. The optimization tests the input text with a series of nested \emph{if else} statements, using a process similar to that of a binary search.  As the nesting of the statements increases, the range of the code points in the conditions of the \emph{if} statements narrow until the exact ranges of the code points in the text has been found.
     70The Unicode Consortium assigned every Unicode character to a general category, such as letter or punctuation.
     71These categories seldom change, so we precomputed their parallel bitstream equations and statically compiled them into icGrep.
     72Each category contains many code points, so we further devised an \emph{If Hierarchy} optimization for each category.
     73This optimization assumes that most input documents only contain code points from a small number of writing systems.
     74Including equations for code points outside of the used range is unnecessary and will only add to the total running time of the application.
     75The \emph{If Hierarchy} optimization determines the ranges of the code points in the input text and only processes the character class
     76equations for observed code point ranges.
     77The optimization tests the input text with a series of nested \emph{if else} statements, similar to a static binary search.
     78As the nesting of the \emph{if} statements increases, the range of the code points narrows and the equations for a precise range of
     79code points are selected.
    7180
    72 \subsection{Character Class Intersection and Difference}
    73 \subsection{Unicode Case-Insensitive Matching}
     81%\subsection{Character Class Intersection and Difference}
     82%\subsection{Unicode Case-Insensitive Matching}
Note: See TracChangeset for help on using the changeset viewer.