Changeset 4488

Ignore:
Timestamp:
Feb 10, 2015, 12:17:27 PM (4 years ago)
Message:

Initial bitwise example, table placeholder

Location:
docs/Working/icGrep
Files:
1 added
4 edited

Unmodified
Added
Removed
• docs/Working/icGrep/architecture.tex

 r4479 specific character. % The \emph{toUTF8} pass converts the characters in the input \RegularExpression{} into the equivalent expression(s) that represent the sequences of 8-bit code units necessary to identify the presence of a particular character. The \emph{toUTF8} pass converts the Unicode character classes in the input \RegularExpression{} into the equivalent expression(s) that represent the sequences of 8-bit code units necessary to identify occurrences of the class. % %Since some characters have multiple logically equivalent representations, such as \textcolor{red}{\textbf{????}}, this may produce nested sequences or alternations. %This is described in more detail in \S\ref{sec:Unicode:toUTF8}. % A final \emph{Simplification} pass flattens nested sequences and alternations into their simplest legal form. A final \emph{Simplification} pass flattens nested structures into their simplest legal form. % For example, \verba(b((c|d)|e))'' would become \verbab(c|d|e)'' and \verb([0-9]{3,5}){3,5}'', \verb[0-9]{9,25}''. For example, \verba(b((c|d)|e))'' becomes \verbab(c|d|e)'' and \verb([0-9]{3,5}){3,5}'' becomes \verb[0-9]{9,25}''. %
• docs/Working/icGrep/background.tex

 r4472 \begin{figure} \begin{center} \begin{tabular}{cr}\\ input data  & \verbdead dreams defeated.\\ $M_1 = {\text{Advance}}({\text{\tt [d]}})$ & \verb.1..1.1......1......1\\ $M_2 = {\text{MatchStar}}(M_1, {\text{\tt [a-z]}})$ & \verb.1111.111111.11111111\\ $M_3 = {\text{Advance}}(M_2 \wedge {\text{\tt [e]}})$ & \verb..1.....1.....1.1..1.\\ $M_4 = {\text{Advance}}(M_3 \wedge {\text{\tt [d]}})$ & \verb....................1 \end{tabular} \end{center} \caption{Matching with Bitwise Data Parallelism}\label{fig:bitwisematch} \end{figure} The central concept in regular expression matching is that of marker bitstreams. At each step in the matching process, a marker bitstream indicates the matches that have been found to this point.   The bitwise matching method uses the operations Advance to move an entire stream of markers one step forward, and MatchStar to find all positions matched after a character class repetition. For example, Figure \ref{fig:bitwisematch} shows how the regular expression {\tt d[a-z]*ed} is matched against some input text using bitwise methods. In the first step the character class stream {\tt [d]} is matched and the results advanced one position to produce marker bitstream $M_1$.   Four match results are now in play simultaneously. The next step applies the MatchStar operation to find all the matches that may then be reached with the Kleene-* repetition {\tt [a-z]*} ($M_2$).  We now have pending matches at almost all possible positions.   However, most of the positions are filtered out with the match to {\tt [e]} in the computation of $M_3$. The final step produces marker stream $M_4$ indicating that single position at which the entire regular expression is matched. \comment{
• docs/Working/icGrep/evaluation.tex

 r4481 at the impact of optimizations and multithreading. \subsection{ICgrep vs. Contemporary Competitors} \subsection{Simple Property Expressions} A key feature of Unicode level 1 support in regular expression engines regular expression formed with a one of the property expressions and a positive lookbehind assertion on the other, while set difference uses a negative lookbehind assertion.  As all three programs support lookbehind assertions in this way, we systematically generated set intersection and difference in this way. a negative lookbehind assertion. We generated a set of regular expressions involving all Unicode values of the Unicode general category property ({\tt gc}) and all values of the Unicode script property ({\tt sc}).  We then generated the Unicode general category property ({\tt gc}) and all values of the Unicode script property ({\tt sc}). We then generated expressions involving random pairs of {\tt gc} and {\tt sc} values combined with a random set operator chosen from union, intersection and difference. All property values are represented at least once.   A small number of All property values are represented at least once. A small number of expressions were removed because they involved properties not supported by pcre2grep. In the end 246 test expressions were constructed in this process. \end{figure} \subsection{Complex Expressions} We also comparative performance of the matching engines on a series of more complex expressions as shown in Table \ref{table:complexexpr}. \begin{table} \begin{center} \begin{tabular}{|c|r|r|r|}  \hline Regular & \multicolumn{3}{|c|}{CPU cycles per byte} \\ \cline{2-4} Expression & icGrep{} & pcre2grep & ugrep \\ \hline blah    & 1 & 1000 & 100 \\ \hline \end{tabular} \caption{Matching Times for Complex Expressions}\label{table:complexexpr} \end{center} \end{table} \subsection{Optimizations of Bitwise Methods} \verb:(^|[ ])[a-zA-Z]{11,33}([.!? ]|\$):, for example. To assess the effectiveness of inserting if-statements, the To control the insertion of if-statements into dynamically generated code, the number of non-nullable pattern elements between the if-tests can be set with the {\tt -if-insertion-gap=} option.   The of the Unicode blocks represented in the input document.   For the classes covering the largest numbers of codepoints, we observed slowdowns of up to 5X.
Note: See TracChangeset for help on using the changeset viewer.