# Changeset 4506

Ignore:
Timestamp:
Feb 11, 2015, 8:11:36 PM (5 years ago)
Message:

Small fixes to eval

Location:
docs/Working/icGrep
Files:
4 edited

### Legend:

Unmodified
 r4504 regular expression search was shown to deliver substantial performance acceleration for traditional ASCII regular expression matching tasks, often 5X or better \cite{cameron2014bitwise}. often 5$\times$ or better \cite{cameron2014bitwise}.
 r4505 at three aspects.   First, we examine some performance aspects of ICgrep internal methods, looking at the impact of optimizations discussed previously. Then we move on to a systematic performance study of \icGrep{} search performance with named Unicode property searches in comparison to two Then we move on to a systematic performance study of \icGrep{} with named Unicode property searches in comparison to two contemporary competitors, namely, pcre2grep released in January 2015 and ugrep of the ICU 54.1 software distribution.  Finally, we look at some more complex expressions and also look at the impact and ugrep of the ICU 54.1 software distribution.  Finally, we examine both more complex expressions and also the impact of multithreading \icGrep{}. In order to support evaluation of bitwise methods, as well as to support the teaching of those methods and ongoing research, \icGrep{} has an array of command-line options.   This makes it relatively straightforward of command-line options.   This makes it straightforward to report on certain performance aspects of ICgrep, while others require special builds. For example, the command-line switch {\tt -disable-matchstar} can be used For example, the command-line switch \texttt{-disable-matchstar} can be used to eliminate the use of the MatchStar operation for handling Kleene-* repetition of character classes.   In this case, \icGrep{} substitutes In each block, the maximum iteration count is the maximum length run encountered; the overall performance is based on the average of these maximums throughout the overall performance is based on the average of these maxima throughout the file.   But when search for XML tags using the regular expression \verb:<[^!?][^>]*>:, a slowdown of more than 2X may be found in files \verb:<[^!?][^>]*>:, a slowdown of more than 2$\times$ may be found in files with many long tags. To control the insertion of if-statements into dynamically generated code, the number of non-nullable pattern elements between the if-tests can be set with the {\tt -if-insertion-gap=} option.   The number of %non-nullable pattern elements between if-tests can be selected with the {\tt -if-insertion-gap=} option.   The default value in \icGrep{} is 3, setting the gap to 100 effectively turns of if-insertion.   Eliminating if-insertion sometimes improves performance by avoiding the extra if tests and branch mispredications. turns off if-insertion.   Eliminating if-insertion sometimes improves performance by avoiding the extra if tests and branch mispredictions. For patterns with long strings, however, there can be a substantial slowdown; searching for a pattern of length 40 slows down by more than 50\% without the if-statement short-circuiting. ICgrep also provides options that allow \ICgrep{} also provides options that allow various internal representations to be printed out.   These can aid in understanding and/or debugging performance issues. For example, the option {\tt -print-REs} show the parsed regular expression as it goes {\tt -print-REs} shows the parsed regular expression as it goes through various transformations.   The internal \Pablo{} code generated may be displayed with {\tt -print-\Pablo{}}.  This can be quite useful in bitwise logic equations are applied for all members of the class independent of the Unicode blocks represented in the input document.   For the classes covering the largest numbers of codepoints, we observed slowdowns of up to 5X. covering the largest numbers of codepoints, we observed slowdowns of up to 5$\times$. \subsection{Simple Property Expressions} A key feature of Unicode level 1 support in regular expression engines is how the support that they provide for property expressions and combinations of property expressions the support that they provide for property expressions and combinations of property expressions using set union, intersection and difference operators.   Both {\tt ugrep} and {\tt icgrep} provide systematic support for all property expressions We selected a set of Wikimedia XML files in several major languages representing most of the world's major language families as a test corpus.   For each program under test, we perform searches for each regular expression against each XML document. Results are presented in Figure \ref{fig:property_test}.  Performance is reported most of the world's major language families as a test corpus. For each program under test, we performed searches for each regular expression against each XML document. Results are presented in Figure~\ref{fig:property_test}.  Performance is reported in CPU cycles per byte on an Intel Core i7 machine.   The results were grouped by the percentage of matching lines found in the XML document, grouped in \end{tabular} \caption{Regular Expressions}\label{table:regularexpr} \vspace{-1em} \end{table} We also comparative performance of the matching engines on a series of more complex expressions as shown in Table \ref{table:regularexpr}. The first two are alphanumeric expressions, differing only in the first one is anchored to match the entire line.  The third searches for lines consisting of text in Arabic script. We also examine the comparative performance of the matching engines on a series of more complex expressions as shown in Table \ref{table:regularexpr}. The first two are alphanumeric expressions, differing only in that the first one is anchored to match the entire line. The third searches for lines consisting of text in Arabic script. The fourth expression is a published currency expression taken from Stewart and Uckelman \cite{stewart2013unicode}. An expression matching runs of 6 or more Cyrillic script characters enclosed Stewart and Uckelman~\cite{stewart2013unicode}. An expression matching runs of six or more Cyrillic script characters enclosed in initial/opening and final/ending punctuation is fifth in the list. The final expression is an email expression that allows internationalized show dramatic slowdowns with ambiguities in regular expressions. This is most clearly illustrated in the different performance figures for the two Alphanumeric test expressions, but is also evident in the Arabic, Currency and Email expressions.   By way of contrast, icGrep{} maintains consistent fast performance in all test scenarios. for the two Alphanumeric test expressions but is also evident in the Arabic, Currency and Email expressions.   By way of contrast, \icGrep{} maintains consistently fast performance in all test scenarios. The multithreaded \icGrep{} shows speedup in every case, but balancing of the workload across multiple cores is clearly an area for further work. Nevertheless, our three thread system shows a speedup of over Nevertheless, our three thread system shows a speedup over the single threaded version by up to 40\%.
 r4502 of dynamic compilation and bitwise data parallelism. In performance comparisons with several contemporary alternatives, 10X or better speedups are often observed. 10$\times$ or better speedups are often observed. \end{abstract}