Changeset 4488 for docs/Working


Ignore:
Timestamp:
Feb 10, 2015, 12:17:27 PM (4 years ago)
Author:
cameron
Message:

Initial bitwise example, table placeholder

Location:
docs/Working/icGrep
Files:
1 added
4 edited

Legend:

Unmodified
Added
Removed
  • docs/Working/icGrep/architecture.tex

    r4479 r4488  
    2222specific character.
    2323%
    24 The \emph{toUTF8} pass converts the characters in the input \RegularExpression{} into the equivalent expression(s) that represent the sequences
    25 of 8-bit code units necessary to identify the presence of a particular character.
     24The \emph{toUTF8} pass converts the Unicode character classes in the input \RegularExpression{} into the equivalent expression(s) that represent the sequences
     25of 8-bit code units necessary to identify occurrences of the class.
    2626%
    2727%Since some characters have multiple logically equivalent representations, such as \textcolor{red}{\textbf{????}}, this may produce nested sequences or alternations.
     
    2929%This is described in more detail in \S\ref{sec:Unicode:toUTF8}.
    3030%
    31 A final \emph{Simplification} pass flattens nested sequences and alternations into their simplest legal form.
     31A final \emph{Simplification} pass flattens nested structures into their simplest legal form.
    3232%
    33 For example, ``\verb`a(b((c|d)|e))`'' would become ``\verb`ab(c|d|e)`'' and ``\verb`([0-9]{3,5}){3,5}`'', ``\verb`[0-9]{9,25}`''.
     33For example, ``\verb`a(b((c|d)|e))`'' becomes ``\verb`ab(c|d|e)`'' and ``\verb`([0-9]{3,5}){3,5}`'' becomes ``\verb`[0-9]{9,25}`''.
    3434%
    3535
  • docs/Working/icGrep/background.tex

    r4472 r4488  
    6767
    6868
     69\begin{figure}
     70\begin{center}
     71\begin{tabular}{cr}\\
     72input data  & \verb`dead dreams defeated.`\\
     73$M_1 = {\text{Advance}}({\text{\tt [d]}})$ & \verb`.1..1.1......1......1`\\
     74$M_2 = {\text{MatchStar}}(M_1, {\text{\tt [a-z]}})$ & \verb`.1111.111111.11111111`\\
     75$M_3 = {\text{Advance}}(M_2 \wedge {\text{\tt [e]}})$ & \verb`..1.....1.....1.1..1.`\\
     76$M_4 = {\text{Advance}}(M_3 \wedge {\text{\tt [d]}})$ & \verb`....................1`
     77\end{tabular}
     78\end{center}
     79\caption{Matching with Bitwise Data Parallelism}\label{fig:bitwisematch}
     80\end{figure}
     81
     82
     83The central concept in regular expression matching is that of marker bitstreams.
     84At each step in the matching process, a marker bitstream indicates the matches
     85that have been found to this point.   The bitwise matching method uses the
     86operations Advance to move an entire stream of markers one step forward, and MatchStar
     87to find all positions matched after a character class repetition.
     88For example,
     89Figure \ref{fig:bitwisematch} shows how the regular expression {\tt d[a-z]*ed} is matched
     90against some input text using bitwise methods. 
     91In the first step the character class stream
     92{\tt [d]} is matched and the results advanced one position to produce marker bitstream $M_1$.   Four match results are now in play
     93simultaneously. 
     94The next step applies the
     95MatchStar operation to find all the matches that may then be reached with the Kleene-* repetition
     96{\tt [a-z]*} ($M_2$).  We now have pending matches at almost all possible positions.   However, most of
     97the positions are filtered out with the match to {\tt [e]} in the computation of $M_3$.
     98The final step produces marker stream $M_4$ indicating that single position at which the entire regular expression is matched.
     99
     100
     101
    69102\comment{
    70103
  • docs/Working/icGrep/evaluation.tex

    r4481 r4488  
    99at the impact of optimizations and multithreading.
    1010
    11 \subsection{ICgrep vs. Contemporary Competitors}
     11\subsection{Simple Property Expressions}
    1212
    1313A key feature of Unicode level 1 support in regular expression engines
     
    2121regular expression formed with a one of the property expressions and a
    2222positive lookbehind assertion on the other, while set difference uses
    23 a negative lookbehind assertion.  As all three programs support lookbehind
    24 assertions in this way, we systematically generated set intersection and
    25 difference in this way.
     23a negative lookbehind assertion. 
    2624
    2725We generated a set of regular expressions involving all Unicode values of
    28 the Unicode general
    29 category property ({\tt gc}) and all values of the Unicode script property ({\tt sc}).  We then generated
     26the Unicode general category property ({\tt gc}) and all values of the Unicode
     27script property ({\tt sc}). 
     28We then generated
    3029expressions involving random pairs of {\tt gc} and {\tt sc}
    3130values combined with a random set operator chosen from union, intersection and difference.
    32 All property values are represented at least once.   A small number of
     31All property values are represented at least once.   
     32A small number of
    3333expressions were removed because they involved properties not supported by pcre2grep.
    3434In the end 246 test expressions were constructed in this process.
     
    8585\end{figure}
    8686
     87\subsection{Complex Expressions}
     88
     89We also comparative performance of the matching engines on a series
     90of more complex expressions as shown in Table \ref{table:complexexpr}.
     91
     92\begin{table}
     93\begin{center}
     94\begin{tabular}{|c|r|r|r|}  \hline
     95Regular & \multicolumn{3}{|c|}{CPU cycles per byte} \\ \cline{2-4}
     96Expression & icGrep{} & pcre2grep & ugrep \\ \hline
     97blah    & 1 & 1000 & 100 \\ \hline
     98\end{tabular}
     99\caption{Matching Times for Complex Expressions}\label{table:complexexpr}
     100\end{center}
     101\end{table}
    87102
    88103\subsection{Optimizations of Bitwise Methods}
     
    113128\verb:(^|[ ])[a-zA-Z]{11,33}([.!? ]|$):, for example.
    114129
    115 To assess the effectiveness of inserting if-statements, the
     130To control the insertion of if-statements into dynamically
     131generated code, the
    116132number of non-nullable pattern elements between the if-tests
    117133can be set with the {\tt -if-insertion-gap=} option.   The
     
    142158of the Unicode blocks represented in the input document.   For the classes
    143159covering the largest numbers of codepoints, we observed slowdowns of up to 5X.
    144 
    145160
    146161
Note: See TracChangeset for help on using the changeset viewer.