source: docs/Working/icGrep/evaluation.tex @ 4498

Last change on this file since 4498 was 4498, checked in by nmedfort, 4 years ago

Temporary check in. Still getting table formatting right.

File size: 10.3 KB
RevLine 
[4446]1\section{Evaluation}\label{sec:evaluation}
2
[4465]3In this section, we report on the evaluation of ICgrep performance, looking
4at three aspects.   First we consider a performance studies in a series
5of Unicode regular expression search problems in comparison to the
6contemporary competitors, including pcre2grep released in January 2015
7and ugrep of the ICU 54.1 software distribution.  Then we move on to
8investigate some performance aspects of ICgrep internal methods, looking
9at the impact of optimizations and multithreading.
10
[4488]11\subsection{Simple Property Expressions}
[4465]12
[4469]13A key feature of Unicode level 1 support in regular expression engines
14is how the support that they provide for property expressions and combinations of property expressions
15using set union, intersection and difference operators.   Both {\tt ugrep}
16and {\tt icgrep} provide systematic support for all property expressions
17at Unicode Level 1 as well as set union, intersection and difference.
18On the other hand, {\tt pcre2grep} does not support the set intersection and difference operators directly.
19However, these operators can instead be expressed using a regular expression
20feature known as a lookbehind assertion.   Set intersection involves a
21regular expression formed with a one of the property expressions and a
22positive lookbehind assertion on the other, while set difference uses
[4488]23a negative lookbehind assertion. 
[4469]24
25We generated a set of regular expressions involving all Unicode values of
[4488]26the Unicode general category property ({\tt gc}) and all values of the Unicode
27script property ({\tt sc}). 
28We then generated
[4469]29expressions involving random pairs of {\tt gc} and {\tt sc}
30values combined with a random set operator chosen from union, intersection and difference.
[4488]31All property values are represented at least once.   
32A small number of
[4469]33expressions were removed because they involved properties not supported by pcre2grep.
34In the end 246 test expressions were constructed in this process.
35
36We selected a set of Wikimedia XML files in several major languages representing
37most of the world's major language families as a test corpus.   For each program
38under test, we perform searches for each regular expression against each XML document.
[4476]39Results are presented in Figure \ref{fig:property_test}.  Performance is reported
[4475]40in CPU cycles per byte on an Intel Core i7 machine.   The results were grouped
41by the percentage of matching lines found in the XML document, grouped in
425\% increments.  ICgrep shows dramatically better performance, particularly
43when searching for rare items.
44As shown in the figure, pcre2grep and ugrep both show
45increased performance (reduced CPU cycles per byte) with increasing percentage
46of matches found.  In essence, each match found allows these programs
47to skip the full processing of the rest of the line.   On the other
48hand, icGrep shows a slight drop-off in performance with the number
49of matches found.   This is primarily due to property classes that
50include large numbers of codepoints.   These classes require more
51bitstream equations for calculation and also have a greater probability
52of matching.   Nevertheless, the performance of icGrep in matching
53the defined property expressions is stable and well ahead of the competitors
54in all cases.
[4469]55
[4474]56\begin{figure}
[4475]57\begin{center}
[4474]58\pgfplotstableread[col sep = comma]{data/icgrep-scatter.csv}\icgrep
59\pgfplotstableread[col sep = comma]{data/ugrep541-scatter.csv}\ugrep
60\pgfplotstableread[col sep = comma]{data/pcre2grep-scatter.csv}\pcre
[4469]61
[4474]62\begin{tikzpicture}
63\begin{axis}[
64grid=both,
65x tick label style={ /pgf/number format/1000 sep=},
[4475]66ylabel={CPU Cycles Per Byte},
67xlabel={Percentage of Matching Lines},
[4474]68minor y tick num={1},
[4481]69xmax=100,
70height=0.5\textwidth,
71legend style={at={(1.05,0.5)},
72anchor=west,legend columns=1,
73align=left,draw=none,column sep=2ex}
[4474]74]
75\addplot+[no markers,line width=2pt,color=blue!60,solid] table {\icgrep};
76\addplot+[no markers,line width=2pt,color=red!60,solid] table {\ugrep};
77\addplot+[no markers,line width=2pt,color=brown!60,solid] table {\pcre};
78\legend{icGrep,ugrep541,pcre2grep}
79\end{axis}
[4469]80
[4474]81
82\end{tikzpicture}
[4475]83\end{center}
[4494]84\caption{Matching Performance for Simple Property Expressions}\label{fig:property_test}
[4474]85\end{figure}
86
[4488]87\subsection{Complex Expressions}
[4474]88
[4488]89We also comparative performance of the matching engines on a series
90of more complex expressions as shown in Table \ref{table:complexexpr}.
91
[4498]92% \begin{table}
93% \begin{center}
94% \begin{tabular}{|c|r|r|r|}  \hline
95% Regular & \multicolumn{3}{|c|}{CPU cycles per byte} \\ \cline{2-4}
96% Expression & icGrep{} & pcre2grep & ugrep \\ \hline
97% blah  & 1 & 1000 & 100 \\ \hline
98% \end{tabular}
99% \caption{Matching Times for Complex Expressions}\label{table:complexexpr}
100% \end{center}
101% \end{table}
102
103\begin{table}[ht]\centering % requires booktabs
104\small
105\begin{tabular}{@{}p{2.7cm}p{10.8cm}@{}}
106\textbf{Name}&\textbf{Regular Expression}\\
107\toprule
108alphanumeric \#1&\verb`^[\p{L}\p{N}]*((\p{L}\p{N})|(\p{N}\p{L}))[\p{L}\p{N}]*$`\\
109\midrule
110alphanumeric \#2&\verb`[\p{L}\p{N}]*((\p{L}\p{N})|(\p{N}\p{L}))[\p{L}\p{N}]*`\\
111\midrule
112arabic&\verb`^[\p{Arabic}\p{Common}]*\p{Arabic}[\p{Arabic}\p{Common}]*$`\\
113\midrule
114currency&\verb`(\p{Sc}\s*(\d*|(\d{1,3}([,.]\d{3})*))([,.]\d{2}?)?)|`\newline\verb`((\d*|(\d{1,3}([,.]\d{3})*))([,.]\d{2}?)?\s*\p{Sc})`\\
115\midrule
116email &\verb`([^\p{Z}<]+@[\p{L}\p{M}\p{N}.-]+\.(\p{L}\p{M}*){2,6})(>|\p{Z}|$)`\\
117\bottomrule
118\end{tabular}
119\caption{Regular Expressions}\label{table:regularexpr}
120\end{table}
121
122\begin{table}[h]\centering % requires booktabs
123\small
124\begin{tabular}{@{}p{2.7cm}r@{~--~}rr@{~--~}rr@{~--~}rr@{~--~}rr@{~--~}r@{}}
125Expression&\multicolumn{2}{>{\centering}p{2.2cm}}{icgrep}&\multicolumn{2}{>{\centering}p{2.2cm}}{icgrep-mt2}&\multicolumn{2}{>{\centering}p{2.2cm}}{icgrep-mt3}&\multicolumn{2}{>{\centering}p{2.2cm}}{pcre2grep}&\multicolumn{2}{>{\centering}p{2.2cm}}{ugrep541}\\
126\toprule
127alphanumeric \#1&2.4&5.0&2.0&4.1&2.0&4.1&8.2&11.3&8.8&11.3\\
128alphanumeric \#2&2.3&4.9&1.9&4.0&2.0&4.0&209.9&563.5&182.3&457.9\\
129arabic&1.5&3.4&1.1&2.4&1.1&2.5&7.5&270.8&8.9&327.8\\
130currency&0.7&2.1&0.6&1.5&0.4&1.3&188.4&352.3&52.8&152.8\\
131email&3.0&6.9&0.6&2.1&0.4&1.9&67.2&1442.0&108.8&1022.3\\
132\bottomrule
133\end{tabular}
[4488]134\caption{Matching Times for Complex Expressions}\label{table:complexexpr}
135\end{table}
136
[4498]137
138
139
140
141
142
143
144
145
146
147
148
149
[4465]150\subsection{Optimizations of Bitwise Methods}
151
152In order to support evaluation of bitwise methods, as well as to support
[4472]153the teaching of those methods and ongoing research, \icGrep{} has an array
[4465]154of command-line options.   This makes it relatively straightforward
155to report on certain performance aspects of ICgrep, while others require
[4466]156special builds. 
[4465]157
158For example, the command-line switch {\tt -disable-matchstar} can be used
159to eliminate the use of the MatchStar operation for handling
[4472]160Kleene-* repetition of character classes.   In this case, \icGrep{} substitutes
[4465]161a while loop that iteratively extends match results.   
162Surprisingly, this
163does not change performance much in many practical cases.   
164In each block,
165the maximum iteration count is the maximum length run encountered; the
166overall performance is based on the average of these maximums throughout the
167file.   But when search for XML tags using the regular expression
168\verb:<[^!?][^>]*>:, a slowdown of more than 2X may be found in files
[4466]169with many long tags. 
[4465]170
[4466]171The {\tt -disable-log2-bounded-repetition} flag allows these
172effectiveness of the special techniques for bounded repetition of
173byte classes to be assessed.   A slowdown of 30\% was observed
174with the searches using the regular expression
175\verb:(^|[ ])[a-zA-Z]{11,33}([.!? ]|$):, for example.
[4465]176
[4488]177To control the insertion of if-statements into dynamically
178generated code, the
[4466]179number of non-nullable pattern elements between the if-tests
180can be set with the {\tt -if-insertion-gap=} option.   The
[4472]181default value in \icGrep{} is 3, setting the gap to 100 effectively
[4466]182turns of if-insertion.   Eliminating if-insertion sometimes improves
183performance by avoiding the extra if tests and branch mispredications.
184For patterns with long strings, however, there can be a substantial
185slowdown; searching for a pattern of length 40 slows down by more
186than 50\% without the if-statement short-circuiting.
[4465]187
[4466]188ICgrep also provides options that allow
189various internal representations to be printed out.   These
190can aid in understanding and/or debugging performance issues.
191For example, the option
[4465]192{\tt -print-REs} show the parsed regular expression as it goes
193through various transformations.   The internal Pablo code generated
194may be displayed with {\tt -print-pablo}.  This can be quite useful in
195helping understand the match process.   It also possible to print out the
[4466]196generated LLVM IR code ({\tt -dump-generated-IR}), but this may be
197less useful as it includes many
[4465]198details of low-level carry-handling that obscures the core logic.
199
[4473]200The precompiled calculations of the various Unicode properties
201are each placed in if-hierarchies as described previously.   To assess the
202impact of this strategy, we built a version of icGrep without such
203if-hierarchies.  In this case, when a Unicode property class is defined,
204bitwise logic equations are applied for all members of the class independent
205of the Unicode blocks represented in the input document.   For the classes
206covering the largest numbers of codepoints, we observed slowdowns of up to 5X.
207
208
[4446]209\subsection{Single vs. Multithreaded Performance}
[4481]210
211
212\begin{figure}
213\begin{center}
214\pgfplotstableread[col sep = comma]{data/icgrep-scatter-mt.csv}\base
215\pgfplotstableread[col sep = comma]{data/icgrep-mt-scatter-mt.csv}\mt
216\pgfplotstableread[col sep = comma]{data/icgrep-mt3-scatter-mt.csv}\mtt
217\pgfplotstableread[col sep = comma]{data/icgrep-flat-scatter-mt.csv}\flat
218\begin{tikzpicture}
219\begin{axis}[
220grid=both,
221x tick label style={ /pgf/number format/1000 sep=},
222ylabel={Seconds Per GB ($1000^3$)},
223xlabel={Percentage of Matching Lines},
224minor y tick num={1},
225ymin=0,ymax=3,
226xmax=100,
227height=0.5\textwidth,
228legend style={at={(1.05,0.5)},
229anchor=west,legend columns=1,
230align=left,draw=none,column sep=2ex}
231]
232\addplot+[sharp plot, no markers,line width=2pt,color=blue!60,solid] table {\base};
233\addplot+[sharp plot, no markers,line width=2pt,color=red!60,solid] table {\mt};
234\addplot+[sharp plot, no markers,line width=2pt,color=brown!60,solid] table {\mtt};
235%\addplot+[no markers,line width=2pt,color=green!60,solid] table {\flat};
236\legend{icGrep (Base),icGrep (MT2),icGrep (MT3), icGrep (Flat)}
237\end{axis}
238
239
240\end{tikzpicture}
241\end{center}
242\caption{Multithreading Performance}\label{fig:performance_test}
243\end{figure}
Note: See TracBrowser for help on using the repository browser.