# source:docs/Working/icGrep/evaluation.tex@4503

Last change on this file since 4503 was 4503, checked in by cameron, 4 years ago

Evaluation

File size: 12.0 KB
Line
1\section{Evaluation}\label{sec:evaluation}
2
3In this section, we report on the evaluation of ICgrep performance, looking
4at three aspects.   First we examine some performance aspects of
5ICgrep internal methods, looking at the impact of optimizations discussed previously.
6Then we move on to a systematic performance study of icGrep{} search
7performance with named Unicode property searches in comparison to two
8contemporary competitors, namely, pcre2grep released in January 2015
9and ugrep of the ICU 54.1 software distribution.  Finally, we look
10at some more complex expressions and also look at the impact
12
13\subsection{Optimizations of Bitwise Methods}
14
15In order to support evaluation of bitwise methods, as well as to support
16the teaching of those methods and ongoing research, \icGrep{} has an array
17of command-line options.   This makes it relatively straightforward
18to report on certain performance aspects of ICgrep, while others require
19special builds.
20
21For example, the command-line switch {\tt -disable-matchstar} can be used
22to eliminate the use of the MatchStar operation for handling
23Kleene-* repetition of character classes.   In this case, \icGrep{} substitutes
24a while loop that iteratively extends match results.
25Surprisingly, this
26does not change performance much in many practical cases.
27In each block,
28the maximum iteration count is the maximum length run encountered; the
29overall performance is based on the average of these maximums throughout the
30file.   But when search for XML tags using the regular expression
31\verb:<[^!?][^>]*>:, a slowdown of more than 2X may be found in files
32with many long tags.
33
34%The {\tt -disable-log2-bounded-repetition} flag allows these
35%effectiveness of the special techniques for bounded repetition of
36%byte classes to be assessed.   A slowdown of 30\% was observed
37%with the searches using the regular expression
38%\verb:(^|[ ])[a-zA-Z]{11,33}([.!? ]|):, for example. 39 40To control the insertion of if-statements into dynamically 41generated code, the 42number of non-nullable pattern elements between the if-tests 43can be set with the {\tt -if-insertion-gap=} option. The 44default value in \icGrep{} is 3, setting the gap to 100 effectively 45turns of if-insertion. Eliminating if-insertion sometimes improves 46performance by avoiding the extra if tests and branch mispredications. 47For patterns with long strings, however, there can be a substantial 48slowdown; searching for a pattern of length 40 slows down by more 49than 50\% without the if-statement short-circuiting. 50 51ICgrep also provides options that allow 52various internal representations to be printed out. These 53can aid in understanding and/or debugging performance issues. 54For example, the option 55{\tt -print-REs} show the parsed regular expression as it goes 56through various transformations. The internal \Pablo{} code generated 57may be displayed with {\tt -print-\Pablo{}}. This can be quite useful in 58helping understand the match process. It also possible to print out the 59generated LLVM IR code ({\tt -dump-generated-IR}), but this may be 60less useful as it includes many 61details of low-level carry-handling that obscures the core logic. 62 63The precompiled calculations of the various Unicode properties 64are each placed in if-hierarchies as described previously. To assess the 65impact of this strategy, we built a version of icGrep without such 66if-hierarchies. In this case, when a Unicode property class is defined, 67bitwise logic equations are applied for all members of the class independent 68of the Unicode blocks represented in the input document. For the classes 69covering the largest numbers of codepoints, we observed slowdowns of up to 5X. 70 71\subsection{Simple Property Expressions} 72 73A key feature of Unicode level 1 support in regular expression engines 74is how the support that they provide for property expressions and combinations of property expressions 75using set union, intersection and difference operators. Both {\tt ugrep} 76and {\tt icgrep} provide systematic support for all property expressions 77at Unicode Level 1 as well as set union, intersection and difference. 78On the other hand, {\tt pcre2grep} does not support the set intersection and difference operators directly. 79However, these operators can instead be expressed using a regular expression 80feature known as a lookbehind assertion. Set intersection involves a 81regular expression formed with a one of the property expressions and a 82positive lookbehind assertion on the other, while set difference uses 83a negative lookbehind assertion. 84 85We generated a set of regular expressions involving all Unicode values of 86the Unicode general category property ({\tt gc}) and all values of the Unicode 87script property ({\tt sc}). 88We then generated 89expressions involving random pairs of {\tt gc} and {\tt sc} 90values combined with a random set operator chosen from union, intersection and difference. 91All property values are represented at least once. 92A small number of 93expressions were removed because they involved properties not supported by pcre2grep. 94In the end 246 test expressions were constructed in this process. 95 96\begin{figure} 97\begin{center} 98\pgfplotstableread[col sep = comma]{data/icgrep-scatter.csv}\icgrep 99\pgfplotstableread[col sep = comma]{data/ugrep541-scatter.csv}\ugrep 100\pgfplotstableread[col sep = comma]{data/pcre2grep-scatter.csv}\pcre 101 102\begin{tikzpicture} 103\begin{axis}[ 104grid=both, 105x tick label style={ /pgf/number format/1000 sep=}, 106ylabel={CPU Cycles Per Byte}, 107xlabel={Percentage of Matching Lines}, 108minor y tick num={1}, 109xmax=100, 110height=0.5\textwidth, 111legend style={at={(1.05,0.5)}, 112anchor=west,legend columns=1, 113align=left,draw=none,column sep=2ex} 114] 115\addplot+[no markers,line width=2pt,color=blue!60,solid] table {\icgrep}; 116\addplot+[no markers,line width=2pt,color=red!60,solid] table {\ugrep}; 117\addplot+[no markers,line width=2pt,color=brown!60,solid] table {\pcre}; 118\legend{icGrep,ugrep541,pcre2grep} 119\end{axis} 120 121 122\end{tikzpicture} 123\end{center} 124\caption{Matching Performance for Simple Property Expressions}\label{fig:property_test} 125\end{figure} 126 127We selected a set of Wikimedia XML files in several major languages representing 128most of the world's major language families as a test corpus. For each program 129under test, we perform searches for each regular expression against each XML document. 130Results are presented in Figure \ref{fig:property_test}. Performance is reported 131in CPU cycles per byte on an Intel Core i7 machine. The results were grouped 132by the percentage of matching lines found in the XML document, grouped in 1335\% increments. ICgrep shows dramatically better performance, particularly 134when searching for rare items. 135As shown in the figure, pcre2grep and ugrep both show 136increased performance (reduced CPU cycles per byte) with increasing percentage 137of matches found. In essence, each match found allows these programs 138to skip the full processing of the rest of the line. On the other 139hand, icGrep shows a slight drop-off in performance with the number 140of matches found. This is primarily due to property classes that 141include large numbers of codepoints. These classes require more 142bitstream equations for calculation and also have a greater probability 143of matching. Nevertheless, the performance of icGrep in matching 144the defined property expressions is stable and well ahead of the competitors 145in all cases. 146 147 148\subsection{Complex Expressions} 149 150\begin{table}\centering % requires booktabs 151\small 152\begin{tabular}{@{}p{2.7cm}p{10.8cm}@{}} 153\textbf{Name}&\textbf{Regular Expression}\\ 154\toprule 155Alphanumeric \#1&\verb^[\p{L}\p{N}]*((\p{L}\p{N})|(\p{N}\p{L}))[\p{L}\p{N}]*\\
156\midrule
157Alphanumeric \#2&\verb[\p{L}\p{N}]*((\p{L}\p{N})|(\p{N}\p{L}))[\p{L}\p{N}]*\\
158\midrule
159Arabic&\verb^[\p{Arabic}\p{Common}]*\p{Arabic}[\p{Arabic}\p{Common}]*$\\ 160\midrule 161Currency&\verb(\p{Sc}\s*(\d*|(\d{1,3}([,.]\d{3})*))([,.]\d{2}?)?)|\newline\verb((\d*|(\d{1,3}([,.]\d{3})*))([,.]\d{2}?)?\s*\p{Sc})\\ 162\midrule 163Cyrillic&\verb[\p{Pi}\p{Po}]\p{Cyrillic}{6,}[\p{Pf}\p{Pe}]\\ 164\midrule 165Email &\verb([^\p{Z}<]+@[\p{L}\p{M}\p{N}.-]+\.(\p{L}\p{M}*){2,6})(>|\p{Z}|$)\\
166\bottomrule
167\end{tabular}
168\caption{Regular Expressions}\label{table:regularexpr}
169\end{table}
170
171
172We also comparative performance of the matching engines on a series
173of more complex expressions as shown in Table \ref{table:regularexpr}.
174The first two are alphanumeric expressions, differing only in the first
175one is anchored to match the entire line.  The third
176searches for lines consisting of text in Arabic script.
177The fourth expression is a published currency expression taken from
178Stewart and Uckelman \cite{stewart2013unicode}.
179An expression matching runs of 6 or more Cyrillic script characters enclosed
180in initial/opening and final/ending punctuation is fifth in the list.
181The final expression is an email expression that allows internationalized
182names.
183
184
185% \begin{table}
186% \begin{center}
187% \begin{tabular}{|c|r|r|r|}  \hline
188% Regular & \multicolumn{3}{|c|}{CPU cycles per byte} \\ \cline{2-4}
189% Expression & icGrep{} & pcre2grep & ugrep \\ \hline
190% blah  & 1 & 1000 & 100 \\ \hline
191% \end{tabular}
192% \caption{Matching Times for Complex Expressions}\label{table:complexexpr}
193% \end{center}
194% \end{table}
195
196
197\begin{table}\centering % requires booktabs
198\newcolumntype{T}{c}
199\small
200\begin{tabular}{@{}p{3cm}r@{~--~}rp{4pt}r@{~--~}rp{4pt}r@{~--~}rp{4pt}r@{~--~}rp{4pt}@{}}
201&\multicolumn{6}{c}{\textbf{\icGrep{}}}\\
202\cmidrule[1pt](lr){2-7}
203\cmidrule[1pt](lr){8-10}
204\cmidrule[1pt](lr){11-13}
205\textbf{Expression}&\multicolumn{3}{T}{\textbf{SEQ}}&\multicolumn{3}{T}{\textbf{MT}}&\multicolumn{3}{T}{\textbf{pcre2grep}}&\multicolumn{3}{T}{\textbf{ugrep541}}\\
206\toprule
207Alphanumeric \#1&2.4&5.0&&2.1&4.4&&8.2&11.3&&8.8&11.3&\\
208Alphanumeric \#2&2.3&4.9&&2.0&4.1&&209.9&563.5&&182.3&457.9&\\
209Arabic&1.5&3.4&&1.2&2.6&&7.5&270.8&&8.9&327.8&\\
210Currency&0.7&2.1&&0.4&1.4&&188.4&352.3&&52.8&152.8&\\
211Cyrillic&1.6&3.9&&1.3&2.8&&30.5&49.7&&11.2&20.1&\\
212Email&3.0&6.9&&2.7&6.4&&67.2&1442.0&&108.8&1022.3&\\
213\bottomrule
214\end{tabular}
215\caption{Matching Times for Complex Expressions (Seconds Per GB)}\label{table:complexexpr}
216\end{table}
217
218The performance results are shown in Table \ref{table:complexexpr}.
219In each case, we report seconds taken per GB of input averaged over 10
220runs each on our Wikimedia document collection.
221The most striking aspect of the results is that both ugrep and pcregrep
222show dramatic slowdowns with ambiguities in regular expressions.
223This is most clearly illustrated in the different performance figures
224for the two Alphanumeric test expressions, but is also evident in the
225Arabic, Currency and Email expressions.   By way of contrast, icGrep{}
226maintains consistent fast performance in all test scenarios.
227
228The multithreaded \icGrep{} shows speedup in every case, but balancing
229of the workload across multiple cores is clearly an area for further work.
230Nevertheless, our three thread system shows a speedup of over
231the single threaded version by up to 40\%.
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
248%
249%
250% \begin{figure}
251% \begin{center}
256% \begin{tikzpicture}
257% \begin{axis}[
258% grid=both,
259% x tick label style={ /pgf/number format/1000 sep=},
260% ylabel={Seconds Per GB ($1000^3$)},
261% xlabel={Percentage of Matching Lines},
262% minor y tick num={1},
263% ymin=0,ymax=3,
264% xmax=100,
265% height=0.5\textwidth,
266% legend style={at={(1.05,0.5)},
267% anchor=west,legend columns=1,
268% align=left,draw=none,column sep=2ex}
269% ]
270% \addplot+[sharp plot, no markers,line width=2pt,color=blue!60,solid] table {\base};
271% \addplot+[sharp plot, no markers,line width=2pt,color=red!60,solid] table {\mt};
272% \addplot+[sharp plot, no markers,line width=2pt,color=brown!60,solid] table {\mtt};
273% %\addplot+[no markers,line width=2pt,color=green!60,solid] table {\flat};
274% \legend{icGrep (Base),icGrep (MT2),icGrep (MT3), icGrep (Flat)}
275% \end{axis}
276%
277%
278% \end{tikzpicture}
279% \end{center}