source: docs/Working/icGrep/evaluation.tex @ 4474

Last change on this file since 4474 was 4474, checked in by nmedfort, 4 years ago

Minor typo + figure

File size: 6.9 KB
3In this section, we report on the evaluation of ICgrep performance, looking
4at three aspects.   First we consider a performance studies in a series
5of Unicode regular expression search problems in comparison to the
6contemporary competitors, including pcre2grep released in January 2015
7and ugrep of the ICU 54.1 software distribution.  Then we move on to
8investigate some performance aspects of ICgrep internal methods, looking
9at the impact of optimizations and multithreading.
11\subsection{ICgrep vs. Contemporary Competitors}
13A key feature of Unicode level 1 support in regular expression engines
14is how the support that they provide for property expressions and combinations of property expressions
15using set union, intersection and difference operators.   Both {\tt ugrep}
16and {\tt icgrep} provide systematic support for all property expressions
17at Unicode Level 1 as well as set union, intersection and difference.
18On the other hand, {\tt pcre2grep} does not support the set intersection and difference operators directly.
19However, these operators can instead be expressed using a regular expression
20feature known as a lookbehind assertion.   Set intersection involves a
21regular expression formed with a one of the property expressions and a
22positive lookbehind assertion on the other, while set difference uses
23a negative lookbehind assertion.  As all three programs support lookbehind
24assertions in this way, we systematically generated set intersection and
25difference in this way.
27We generated a set of regular expressions involving all Unicode values of
28the Unicode general
29category property ({\tt gc}) and all values of the Unicode script property ({\tt sc}).  We then generated
30expressions involving random pairs of {\tt gc} and {\tt sc}
31values combined with a random set operator chosen from union, intersection and difference.
32All property values are represented at least once.   A small number of
33expressions were removed because they involved properties not supported by pcre2grep.
34In the end 246 test expressions were constructed in this process.
36We selected a set of Wikimedia XML files in several major languages representing
37most of the world's major language families as a test corpus.   For each program
38under test, we perform searches for each regular expression against each XML document.
39Searches were repeated n times.  Table \ref{tbl:property_test} shows the results.
43\caption{Performance of Matching Property and Property Combinations}\label{tbl:property_test}
47\pgfplotstableread[col sep = comma]{data/icgrep-scatter.csv}\icgrep
48\pgfplotstableread[col sep = comma]{data/ugrep541-scatter.csv}\ugrep
49\pgfplotstableread[col sep = comma]{data/pcre2grep-scatter.csv}\pcre
54x tick label style={ /pgf/number format/1000 sep=},
55% x buffer=sort,
56ylabel={Cycles Per Byte},
57xlabel={Match Percentage},
58minor y tick num={1},
62\addplot+[no markers,line width=2pt,color=blue!60,solid] table {\icgrep};
63\addplot+[no markers,line width=2pt,color=red!60,solid] table {\ugrep};
64\addplot+[no markers,line width=2pt,color=brown!60,solid] table {\pcre};
73% \begin{figure}
74% \pgfplotstableread[col sep = comma]{data/icgrep-cp-scatter.csv}\icgrep
75% \pgfplotstableread[col sep = comma]{data/icgrep-flat-cp-scatter.csv}\icgrepf
78% \begin{tikzpicture}
79% \begin{semilogxaxis}[
80% grid=y,
81% x tick label style={ /pgf/number format/1000 sep=},
82% % x buffer=sort,
83% ylabel={Cycles Per Byte},
84% xlabel={Match Percentage},
85% minor y tick num={1},
86% %xmax=100
87% %ymax=30000000
88% ]
89% \addplot+[no markers,line width=2pt,color=blue!60,solid] table {\icgrep};
90% \addplot+[no markers,line width=2pt,color=red!60,solid] table {\icgrepf};
92% \end{semilogxaxis}
93% \end{tikzpicture}
95% \end{figure}
98\subsection{Optimizations of Bitwise Methods}
100In order to support evaluation of bitwise methods, as well as to support
101the teaching of those methods and ongoing research, \icGrep{} has an array
102of command-line options.   This makes it relatively straightforward
103to report on certain performance aspects of ICgrep, while others require
104special builds. 
106For example, the command-line switch {\tt -disable-matchstar} can be used
107to eliminate the use of the MatchStar operation for handling
108Kleene-* repetition of character classes.   In this case, \icGrep{} substitutes
109a while loop that iteratively extends match results.   
110Surprisingly, this
111does not change performance much in many practical cases.   
112In each block,
113the maximum iteration count is the maximum length run encountered; the
114overall performance is based on the average of these maximums throughout the
115file.   But when search for XML tags using the regular expression
116\verb:<[^!?][^>]*>:, a slowdown of more than 2X may be found in files
117with many long tags. 
119The {\tt -disable-log2-bounded-repetition} flag allows these
120effectiveness of the special techniques for bounded repetition of
121byte classes to be assessed.   A slowdown of 30\% was observed
122with the searches using the regular expression
123\verb:(^|[ ])[a-zA-Z]{11,33}([.!? ]|$):, for example.
125To assess the effectiveness of inserting if-statements, the
126number of non-nullable pattern elements between the if-tests
127can be set with the {\tt -if-insertion-gap=} option.   The
128default value in \icGrep{} is 3, setting the gap to 100 effectively
129turns of if-insertion.   Eliminating if-insertion sometimes improves
130performance by avoiding the extra if tests and branch mispredications.
131For patterns with long strings, however, there can be a substantial
132slowdown; searching for a pattern of length 40 slows down by more
133than 50\% without the if-statement short-circuiting.
135ICgrep also provides options that allow
136various internal representations to be printed out.   These
137can aid in understanding and/or debugging performance issues.
138For example, the option
139{\tt -print-REs} show the parsed regular expression as it goes
140through various transformations.   The internal Pablo code generated
141may be displayed with {\tt -print-pablo}.  This can be quite useful in
142helping understand the match process.   It also possible to print out the
143generated LLVM IR code ({\tt -dump-generated-IR}), but this may be
144less useful as it includes many
145details of low-level carry-handling that obscures the core logic.
147The precompiled calculations of the various Unicode properties
148are each placed in if-hierarchies as described previously.   To assess the
149impact of this strategy, we built a version of icGrep without such
150if-hierarchies.  In this case, when a Unicode property class is defined,
151bitwise logic equations are applied for all members of the class independent
152of the Unicode blocks represented in the input document.   For the classes
153covering the largest numbers of codepoints, we observed slowdowns of up to 5X.
157\subsection{Single vs. Multithreaded Performance}
Note: See TracBrowser for help on using the repository browser.