# Changeset 4476 for docs

Ignore:
Timestamp:
Feb 7, 2015, 12:52:42 AM (4 years ago)
Message:

Location:
docs/Working/icGrep
Files:
7 edited

Unmodified
Removed
• ## docs/Working/icGrep/architecture.tex

 r4472 \input{fig-compiler} As show in Figure \ref{fig:compiler}, \icGrep{} is composed of three logical layers: \RegularExpression{}, \Pablo{} and the LLVM layer, each with their own intermediate representation As shown in Figure \ref{fig:compiler}, \icGrep{} comprises three logical layers: \RegularExpression{}, \Pablo{} and the LLVM layer, each with their own intermediate representation (IR), transformation and compilation modules. % of 8-bit code units necessary to identify the presence of a particular character. % Since some characters have multiple logically equivalent representations, such as \textcolor{red}{\textbf{????}}, this may produce nested sequences or alternations. %Since some characters have multiple logically equivalent representations, such as \textcolor{red}{\textbf{????}}, this may produce nested sequences or alternations. % This is described in more detail in \S\ref{sec:Unicode:toUTF8}. %This is described in more detail in \S\ref{sec:Unicode:toUTF8}. % To alleviate this, the final \emph{Simplification} pass flattens nested sequences and alternations into their simplest legal form. A final \emph{Simplification} pass flattens nested sequences and alternations into their simplest legal form. % For example, \verba(b((c|d)|e))'' would become \verbab(c|d|e)'' and \verb([0-9]{3,5}){3,5}'', \verb[0-9]{9,25}''. The \RegularExpression{} layer has two compilers: the \CodeUnit{} and \RegularExpressionCompiler{}, both of which produce \Pablo{} IR. The \RegularExpression{} layer has two compilers: the \CodeUnitCompiler{} and \RegularExpressionCompiler{}, both of which produce \Pablo{} IR. % Recall that the \Pablo{} layer assumes a transposed view of the input data. As shown in Figure \ref{fig:execution}, \icGrep{} takes the input data and transposed it into 8 parallel bit streams through S2P module. The required streams, e.g. line break stream, can then be generated using the 8 basis bits streams. The JIT function retrieves the 8 basis bits and the required streams from their memory addresses and starts the matching process. Named Property Library that includes all the predefined Unicode categories is installed into JIT function and can be called during the matching process. JIT function returns one bitstream that marks all the matching positions. A match scanner will scan through this bitstream and calculate the total counts or write the context of each match position. As shown in Figure~\ref{fig:execution}, \icGrep{} takes the input data and transposes it into 8 parallel bit streams through the Transposition module. The required streams, e.g. the line break stream, can then be generated using the 8 basis bits streams. The Dynamic Matcher, dynamically compiled via LLVM, retrieves the 8 basis bits and the required streams from their memory addresses and starts the matching process. The Named Property Library that includes all the predefined Unicode categories is installed into the Dynamic Matcher and can be called during the matching process. The Dynamic Matcher returns one bitstream that marks all the matching positions. Finally, a Match Scanner scans through the returned bitstream and calculates the total counts or writes the context of each match position. We can also apply a pipeline parallelism strategy to further speed up the process of \icGrep{}. S2P and Required Streams Generator can be process in a separate thread and start even before the dynamic compilation starts. The output of S2P and Required Streams Generator, that is the 8 basis bits streams and the required streams, needs to be stored in a shared memory space so that the JIT function can read from it. To be more efficient of memory space usage, we only allocate limit amount of space for the shared data. When each chunk of the shared space is filled up with the bitstream data, the thread will start writing to the first chunk if it is released by JIT function. Otherwise, it will wait for JIT function until it finishes processing that chunk. Transposition and the Required Streams Generator can be performed in a separate thread and start even before the dynamic compilation starts. The output of Transposition and the Required Streams Generator, that is the 8 basis bits streams and the required streams, needs to be stored in a shared memory space so that the Dynamic Matcher can read from it. To more efficiently use memory, we allocate only a limited amount of space for the shared data. After each chunk of the shared space is filled with bitstream data, the thread starts writing to the first chunk if it has been released by Dynamic Matcher. Otherwise, it will wait for Dynamic Matcher until it finishes processing that chunk. Therefore, the performance is depended on the slowest thread. In the case that the cost of transposition and required stream generation is more than the matching process, we can further divide up the work and assign two threads for S2P and Required Streams Generator. we can further divide up the work and assign two threads for Transposition and Required Streams Generator.
• ## docs/Working/icGrep/evaluation.tex

 r4475 most of the world's major language families as a test corpus.   For each program under test, we perform searches for each regular expression against each XML document. Results are presented in \Figure \ref{fig:property_test}.  Performance is reported Results are presented in Figure \ref{fig:property_test}.  Performance is reported in CPU cycles per byte on an Intel Core i7 machine.   The results were grouped by the percentage of matching lines found in the XML document, grouped in
• ## docs/Working/icGrep/fig-compiler.tex

 r4466 \def\RegularExpression{RegEx} \def\Pablo{Pablo} \def\CodeUnit{Code Unit} \def\REParser{\RegularExpression{} Parser} \def\CodeUnitCompiler{\CodeUnit{} Compiler} \def\RegularExpressionCompiler{\RegularExpression{} Compiler} \def\PabloCompiler{\Pablo{} Compiler} \begin{figure}[tbh]
• ## docs/Working/icGrep/fig-executor.tex

 r4468 % Place nodes \node [draw=none] (InputData) {Input Data}; \node [block, below of=InputData] (S2P) {S2P}; \node [block, below of=InputData] (S2P) {Transposition}; \node [block, below of=S2P] (RequiredStreamsGenerator) {Required Streams Generator}; \node [block, below of=RequiredStreamsGenerator] (JITFunction) {JIT Function}; \node [block, below of=RequiredStreamsGenerator] (JITFunction) {Dynamic Matcher}; \node [block, right of=JITFunction, node distance=20em] (NamedPropertyLibaray) {Named Property Library}; \node [block, below of=JITFunction] (MatchScanner) {Match Scanner};
• ## docs/Working/icGrep/icgrep.tex

• ## docs/Working/icGrep/unicode-re.tex

 r4472 \subsection{Match Unicode Characters} After the transformation, we can then generate the character class bitstreams with our character class compiler. As shown in figure \ref{fig:multibytesequence}, the input data has 4 Chinese characters, each formed by 3 bytes. To better demonstrate the process, we use $ni3$, $hao$ and $men$ to represent these characters. $CC_{\mbox{ni3}}$ is the bitstream that marks character $ni3$ and $CC_{\mbox{hao}}$ is the bitstreams that marks character stream $hao$. To match a two utf8 character sequence $ni3hao$, we need to first create a $Initial$ steams that marks the first byte of all the valid characters. We also need to generate a $NonFinal$ streams that marks every bytes of all the multibyte characters except the last byte. Using $Initial$ to ScanThru $NonFinal$, we get the bitstream $M_2$ which marks the positions of the last byte of every character. As shown in Figure \ref{fig:multibytesequence}, the input data has 4 Chinese characters, each formed by 3 bytes. To better demonstrate the process, we use \emph{ni3}, \emph{hao} and \emph{men} to represent these characters. $CC_{\mbox{ni3}}$ is the bitstream that marks character \emph{ni3} and $CC_{\mbox{hao}}$ is the bitstream that marks character stream \emph{hao}. To match a two UTF-8 character sequence \emph{ni3hao}, we need to first create an \emph{Initial} steam that marks the first byte of all the valid characters. We also need to generate a \emph{NonFinal} stream that marks every byte of all the multibyte characters except for the last byte. Using \emph{Initial} to ScanThru \emph{NonFinal}, we get the bitstream $M_2$, which marks the positions of the last byte of every character. An overlap between $M_2$ and $CC_{\mbox{ni3}}$ gives the start position for matching the next character. As illustrated by $Adv$, we find two matches for $ni3$ and from these two positions we can start the matching process for the next character $hao$. As illustrated by \emph{Adv}, we find two matches for \emph{ni3} and from these two positions we can start the matching process for the next character \emph{hao}. The final result stream shows 1 match for Multibyte Sequence ni3hao. %\begin{tabular}{cr}\\ \begin{tabular}{cr}\\ \begin{tabular}{r@{\hspace{1em}}r}\\ input data  & \verbni3hao(Hello),ni3men(You),\\ $CC_{\mbox{ni3}}$ & \verb..1.............1.........\\ $CC_{\mbox{hao}}$ & \verb.....1....................\\ $Initial$ & \verb1..1..111111111..1..111111\\ $NonFinal$ & \verb11.11.........11.11.......\\ $M_1 = ScanThru(Initial, NonFinal)$ & \verb..1..111111111..1..1111111\\ $Adv = Advance(M_1 \land CC_{\mbox{ni3}})$ & \verb...1.............1........\\ $M_2 = ScanThru(Initial \land Adv, NonFinal)$ & \verb.....1.............1......\\ $match = M_2 \land CC_{\mbox{hao}}$ & \verb.....1.................... \emph{Initial} & \verb1..1..111111111..1..111111\\ \emph{NonFinal} & \verb11.11.........11.11.......\\ $M_1 = \mathit{ScanThru}(\mathit{Initial}, \mathit{NonFinal})$ & \verb..1..111111111..1..1111111\\ $\mathit{Adv} = \mathit{Advance}(M_1 \land CC_{\mbox{ni3}})$ & \verb...1.............1........\\ $M_2 = \mathit{ScanThru}(\mathit{Initial} \land \mathit{Adv}, \mathit{NonFinal})$ & \verb.....1.............1......\\ $\mathit{match} = M_2 \land CC_{\mbox{hao}}$ & \verb.....1.................... \end{tabular} \subsection{Predefined Unicode classes} Every character in the Unicode database has been assigned to a general category classification based upon the character's type. As the categories seldom change the parallel bitstream equations for the categories have been statically compiled into icGrep. Each of the categories contain a large number of code points, therefore an \emph{If Hierarchy} optimization has been included in the statically compiled implementation of each category.  The optimization works under the assumption that most input documents will only contain the code points of the characters from a small number of writing systems. Processing the blocks of code points for characters that exist outside of this range is unnecessary and will only add to the total running time of the application.  The optimization tests the input text to determine the ranges of the code points that are contained in the input text and it only processes the character class equations and the regular expression matching equations for the code point ranges that the input text contains. The optimization tests the input text with a series of nested \emph{if else} statements, using a process similar to that of a binary search.  As the nesting of the statements increases, the range of the code points in the conditions of the \emph{if} statements narrow until the exact ranges of the code points in the text has been found. The Unicode Consortium assigned every Unicode character to a general category, such as letter or punctuation. These categories seldom change, so we precomputed their parallel bitstream equations and statically compiled them into icGrep. Each category contains many code points, so we further devised an \emph{If Hierarchy} optimization for each category. This optimization assumes that most input documents only contain code points from a small number of writing systems. Including equations for code points outside of the used range is unnecessary and will only add to the total running time of the application. The \emph{If Hierarchy} optimization determines the ranges of the code points in the input text and only processes the character class equations for observed code point ranges. The optimization tests the input text with a series of nested \emph{if else} statements, similar to a static binary search. As the nesting of the \emph{if} statements increases, the range of the code points narrows and the equations for a precise range of code points are selected. \subsection{Character Class Intersection and Difference} \subsection{Unicode Case-Insensitive Matching} %\subsection{Character Class Intersection and Difference} %\subsection{Unicode Case-Insensitive Matching}
Note: See TracChangeset for help on using the changeset viewer.