source: docs/Working/icGrep/evaluation.tex @ 4503

Last change on this file since 4503 was 4503, checked in by cameron, 4 years ago

Evaluation

File size: 12.0 KB
Line 
1\section{Evaluation}\label{sec:evaluation}
2
3In this section, we report on the evaluation of ICgrep performance, looking
4at three aspects.   First we examine some performance aspects of
5ICgrep internal methods, looking at the impact of optimizations discussed previously.
6Then we move on to a systematic performance study of icGrep{} search
7performance with named Unicode property searches in comparison to two
8contemporary competitors, namely, pcre2grep released in January 2015
9and ugrep of the ICU 54.1 software distribution.  Finally, we look
10at some more complex expressions and also look at the impact
11of multithreading icGrep{}.
12
13\subsection{Optimizations of Bitwise Methods}
14
15In order to support evaluation of bitwise methods, as well as to support
16the teaching of those methods and ongoing research, \icGrep{} has an array
17of command-line options.   This makes it relatively straightforward
18to report on certain performance aspects of ICgrep, while others require
19special builds. 
20
21For example, the command-line switch {\tt -disable-matchstar} can be used
22to eliminate the use of the MatchStar operation for handling
23Kleene-* repetition of character classes.   In this case, \icGrep{} substitutes
24a while loop that iteratively extends match results.   
25Surprisingly, this
26does not change performance much in many practical cases.   
27In each block,
28the maximum iteration count is the maximum length run encountered; the
29overall performance is based on the average of these maximums throughout the
30file.   But when search for XML tags using the regular expression
31\verb:<[^!?][^>]*>:, a slowdown of more than 2X may be found in files
32with many long tags. 
33
34%The {\tt -disable-log2-bounded-repetition} flag allows these
35%effectiveness of the special techniques for bounded repetition of
36%byte classes to be assessed.   A slowdown of 30\% was observed
37%with the searches using the regular expression
38%\verb:(^|[ ])[a-zA-Z]{11,33}([.!? ]|$):, for example.
39
40To control the insertion of if-statements into dynamically
41generated code, the
42number of non-nullable pattern elements between the if-tests
43can be set with the {\tt -if-insertion-gap=} option.   The
44default value in \icGrep{} is 3, setting the gap to 100 effectively
45turns of if-insertion.   Eliminating if-insertion sometimes improves
46performance by avoiding the extra if tests and branch mispredications.
47For patterns with long strings, however, there can be a substantial
48slowdown; searching for a pattern of length 40 slows down by more
49than 50\% without the if-statement short-circuiting.
50
51ICgrep also provides options that allow
52various internal representations to be printed out.   These
53can aid in understanding and/or debugging performance issues.
54For example, the option
55{\tt -print-REs} show the parsed regular expression as it goes
56through various transformations.   The internal \Pablo{} code generated
57may be displayed with {\tt -print-\Pablo{}}.  This can be quite useful in
58helping understand the match process.   It also possible to print out the
59generated LLVM IR code ({\tt -dump-generated-IR}), but this may be
60less useful as it includes many
61details of low-level carry-handling that obscures the core logic.
62
63The precompiled calculations of the various Unicode properties
64are each placed in if-hierarchies as described previously.   To assess the
65impact of this strategy, we built a version of icGrep without such
66if-hierarchies.  In this case, when a Unicode property class is defined,
67bitwise logic equations are applied for all members of the class independent
68of the Unicode blocks represented in the input document.   For the classes
69covering the largest numbers of codepoints, we observed slowdowns of up to 5X.
70
71\subsection{Simple Property Expressions}
72
73A key feature of Unicode level 1 support in regular expression engines
74is how the support that they provide for property expressions and combinations of property expressions
75using set union, intersection and difference operators.   Both {\tt ugrep}
76and {\tt icgrep} provide systematic support for all property expressions
77at Unicode Level 1 as well as set union, intersection and difference.
78On the other hand, {\tt pcre2grep} does not support the set intersection and difference operators directly.
79However, these operators can instead be expressed using a regular expression
80feature known as a lookbehind assertion.   Set intersection involves a
81regular expression formed with a one of the property expressions and a
82positive lookbehind assertion on the other, while set difference uses
83a negative lookbehind assertion. 
84
85We generated a set of regular expressions involving all Unicode values of
86the Unicode general category property ({\tt gc}) and all values of the Unicode
87script property ({\tt sc}). 
88We then generated
89expressions involving random pairs of {\tt gc} and {\tt sc}
90values combined with a random set operator chosen from union, intersection and difference.
91All property values are represented at least once.   
92A small number of
93expressions were removed because they involved properties not supported by pcre2grep.
94In the end 246 test expressions were constructed in this process.
95
96\begin{figure}
97\begin{center}
98\pgfplotstableread[col sep = comma]{data/icgrep-scatter.csv}\icgrep
99\pgfplotstableread[col sep = comma]{data/ugrep541-scatter.csv}\ugrep
100\pgfplotstableread[col sep = comma]{data/pcre2grep-scatter.csv}\pcre
101
102\begin{tikzpicture}
103\begin{axis}[
104grid=both,
105x tick label style={ /pgf/number format/1000 sep=},
106ylabel={CPU Cycles Per Byte},
107xlabel={Percentage of Matching Lines},
108minor y tick num={1},
109xmax=100,
110height=0.5\textwidth,
111legend style={at={(1.05,0.5)},
112anchor=west,legend columns=1,
113align=left,draw=none,column sep=2ex}
114]
115\addplot+[no markers,line width=2pt,color=blue!60,solid] table {\icgrep};
116\addplot+[no markers,line width=2pt,color=red!60,solid] table {\ugrep};
117\addplot+[no markers,line width=2pt,color=brown!60,solid] table {\pcre};
118\legend{icGrep,ugrep541,pcre2grep}
119\end{axis}
120
121
122\end{tikzpicture}
123\end{center}
124\caption{Matching Performance for Simple Property Expressions}\label{fig:property_test}
125\end{figure}
126
127We selected a set of Wikimedia XML files in several major languages representing
128most of the world's major language families as a test corpus.   For each program
129under test, we perform searches for each regular expression against each XML document.
130Results are presented in Figure \ref{fig:property_test}.  Performance is reported
131in CPU cycles per byte on an Intel Core i7 machine.   The results were grouped
132by the percentage of matching lines found in the XML document, grouped in
1335\% increments.  ICgrep shows dramatically better performance, particularly
134when searching for rare items.
135As shown in the figure, pcre2grep and ugrep both show
136increased performance (reduced CPU cycles per byte) with increasing percentage
137of matches found.  In essence, each match found allows these programs
138to skip the full processing of the rest of the line.   On the other
139hand, icGrep shows a slight drop-off in performance with the number
140of matches found.   This is primarily due to property classes that
141include large numbers of codepoints.   These classes require more
142bitstream equations for calculation and also have a greater probability
143of matching.   Nevertheless, the performance of icGrep in matching
144the defined property expressions is stable and well ahead of the competitors
145in all cases.
146
147
148\subsection{Complex Expressions}
149
150\begin{table}\centering % requires booktabs
151\small
152\begin{tabular}{@{}p{2.7cm}p{10.8cm}@{}}
153\textbf{Name}&\textbf{Regular Expression}\\
154\toprule
155Alphanumeric \#1&\verb`^[\p{L}\p{N}]*((\p{L}\p{N})|(\p{N}\p{L}))[\p{L}\p{N}]*$`\\
156\midrule
157Alphanumeric \#2&\verb`[\p{L}\p{N}]*((\p{L}\p{N})|(\p{N}\p{L}))[\p{L}\p{N}]*`\\
158\midrule
159Arabic&\verb`^[\p{Arabic}\p{Common}]*\p{Arabic}[\p{Arabic}\p{Common}]*$`\\
160\midrule
161Currency&\verb`(\p{Sc}\s*(\d*|(\d{1,3}([,.]\d{3})*))([,.]\d{2}?)?)|`\newline\verb`((\d*|(\d{1,3}([,.]\d{3})*))([,.]\d{2}?)?\s*\p{Sc})`\\
162\midrule
163Cyrillic&\verb`[\p{Pi}\p{Po}]\p{Cyrillic}{6,}[\p{Pf}\p{Pe}]`\\
164\midrule
165Email &\verb`([^\p{Z}<]+@[\p{L}\p{M}\p{N}.-]+\.(\p{L}\p{M}*){2,6})(>|\p{Z}|$)`\\
166\bottomrule
167\end{tabular}
168\caption{Regular Expressions}\label{table:regularexpr}
169\end{table}
170
171
172We also comparative performance of the matching engines on a series
173of more complex expressions as shown in Table \ref{table:regularexpr}.
174The first two are alphanumeric expressions, differing only in the first
175one is anchored to match the entire line.  The third
176searches for lines consisting of text in Arabic script.
177The fourth expression is a published currency expression taken from
178Stewart and Uckelman \cite{stewart2013unicode}.
179An expression matching runs of 6 or more Cyrillic script characters enclosed
180in initial/opening and final/ending punctuation is fifth in the list.
181The final expression is an email expression that allows internationalized
182names.
183
184
185% \begin{table}
186% \begin{center}
187% \begin{tabular}{|c|r|r|r|}  \hline
188% Regular & \multicolumn{3}{|c|}{CPU cycles per byte} \\ \cline{2-4}
189% Expression & icGrep{} & pcre2grep & ugrep \\ \hline
190% blah  & 1 & 1000 & 100 \\ \hline
191% \end{tabular}
192% \caption{Matching Times for Complex Expressions}\label{table:complexexpr}
193% \end{center}
194% \end{table}
195
196
197\begin{table}\centering % requires booktabs
198\newcolumntype{T}{c}
199\small
200\begin{tabular}{@{}p{3cm}r@{~--~}rp{4pt}r@{~--~}rp{4pt}r@{~--~}rp{4pt}r@{~--~}rp{4pt}@{}}
201&\multicolumn{6}{c}{\textbf{\icGrep{}}}\\
202\cmidrule[1pt](lr){2-7}
203\cmidrule[1pt](lr){8-10}
204\cmidrule[1pt](lr){11-13}
205\textbf{Expression}&\multicolumn{3}{T}{\textbf{SEQ}}&\multicolumn{3}{T}{\textbf{MT}}&\multicolumn{3}{T}{\textbf{pcre2grep}}&\multicolumn{3}{T}{\textbf{ugrep541}}\\
206\toprule
207Alphanumeric \#1&2.4&5.0&&2.1&4.4&&8.2&11.3&&8.8&11.3&\\
208Alphanumeric \#2&2.3&4.9&&2.0&4.1&&209.9&563.5&&182.3&457.9&\\
209Arabic&1.5&3.4&&1.2&2.6&&7.5&270.8&&8.9&327.8&\\
210Currency&0.7&2.1&&0.4&1.4&&188.4&352.3&&52.8&152.8&\\
211Cyrillic&1.6&3.9&&1.3&2.8&&30.5&49.7&&11.2&20.1&\\
212Email&3.0&6.9&&2.7&6.4&&67.2&1442.0&&108.8&1022.3&\\
213\bottomrule
214\end{tabular}
215\caption{Matching Times for Complex Expressions (Seconds Per GB)}\label{table:complexexpr}
216\end{table}
217
218The performance results are shown in Table \ref{table:complexexpr}.
219In each case, we report seconds taken per GB of input averaged over 10 
220runs each on our Wikimedia document collection.
221The most striking aspect of the results is that both ugrep and pcregrep
222show dramatic slowdowns with ambiguities in regular expressions.
223This is most clearly illustrated in the different performance figures
224for the two Alphanumeric test expressions, but is also evident in the
225Arabic, Currency and Email expressions.   By way of contrast, icGrep{}
226maintains consistent fast performance in all test scenarios. 
227
228The multithreaded \icGrep{} shows speedup in every case, but balancing
229of the workload across multiple cores is clearly an area for further work. 
230Nevertheless, our three thread system shows a speedup of over
231the single threaded version by up to 40\%.
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247% \subsection{Single vs. Multithreaded Performance}
248%
249%
250% \begin{figure}
251% \begin{center}
252% \pgfplotstableread[col sep = comma]{data/icgrep-scatter-mt.csv}\base
253% \pgfplotstableread[col sep = comma]{data/icgrep-mt-scatter-mt.csv}\mt
254% \pgfplotstableread[col sep = comma]{data/icgrep-mt3-scatter-mt.csv}\mtt
255% \pgfplotstableread[col sep = comma]{data/icgrep-flat-scatter-mt.csv}\flat
256% \begin{tikzpicture}
257% \begin{axis}[
258% grid=both,
259% x tick label style={ /pgf/number format/1000 sep=},
260% ylabel={Seconds Per GB ($1000^3$)},
261% xlabel={Percentage of Matching Lines},
262% minor y tick num={1},
263% ymin=0,ymax=3,
264% xmax=100,
265% height=0.5\textwidth,
266% legend style={at={(1.05,0.5)},
267% anchor=west,legend columns=1,
268% align=left,draw=none,column sep=2ex}
269% ]
270% \addplot+[sharp plot, no markers,line width=2pt,color=blue!60,solid] table {\base};
271% \addplot+[sharp plot, no markers,line width=2pt,color=red!60,solid] table {\mt};
272% \addplot+[sharp plot, no markers,line width=2pt,color=brown!60,solid] table {\mtt};
273% %\addplot+[no markers,line width=2pt,color=green!60,solid] table {\flat};
274% \legend{icGrep (Base),icGrep (MT2),icGrep (MT3), icGrep (Flat)}
275% \end{axis}
276%
277%
278% \end{tikzpicture}
279% \end{center}
280% \caption{Multithreading Performance}\label{fig:performance_test}
281% \end{figure}
Note: See TracBrowser for help on using the repository browser.