source: docs/Working/icGrep/evaluation.tex @ 4506

Last change on this file since 4506 was 4506, checked in by nmedfort, 4 years ago

Small fixes to eval

File size: 12.1 KB
Line 
1\section{Evaluation}\label{sec:evaluation}
2
3In this section, we report on the evaluation of ICgrep performance, looking
4at three aspects.   First, we examine some performance aspects of
5ICgrep internal methods, looking at the impact of optimizations discussed previously.
6Then we move on to a systematic performance study of \icGrep{}
7with named Unicode property searches in comparison to two
8contemporary competitors, namely, pcre2grep released in January 2015
9and ugrep of the ICU 54.1 software distribution.  Finally, we examine
10both more complex expressions and also the impact
11of multithreading \icGrep{}.
12
13\subsection{Optimizations of Bitwise Methods}
14
15In order to support evaluation of bitwise methods, as well as to support
16the teaching of those methods and ongoing research, \icGrep{} has an array
17of command-line options.   This makes it straightforward
18to report on certain performance aspects of ICgrep, while others require
19special builds. 
20
21For example, the command-line switch \texttt{-disable-matchstar} can be used
22to eliminate the use of the MatchStar operation for handling
23Kleene-* repetition of character classes.   In this case, \icGrep{} substitutes
24a while loop that iteratively extends match results.   
25Surprisingly, this
26does not change performance much in many practical cases.   
27In each block,
28the maximum iteration count is the maximum length run encountered; the
29overall performance is based on the average of these maxima throughout the
30file.   But when search for XML tags using the regular expression
31\verb:<[^!?][^>]*>:, a slowdown of more than 2$\times$ may be found in files
32with many long tags. 
33
34%The {\tt -disable-log2-bounded-repetition} flag allows these
35%effectiveness of the special techniques for bounded repetition of
36%byte classes to be assessed.   A slowdown of 30\% was observed
37%with the searches using the regular expression
38%\verb:(^|[ ])[a-zA-Z]{11,33}([.!? ]|$):, for example.
39
40To control the insertion of if-statements into dynamically
41generated code, the
42number of
43%non-nullable
44pattern elements between if-tests
45can be selected with the {\tt -if-insertion-gap=} option.   The
46default value in \icGrep{} is 3, setting the gap to 100 effectively
47turns off if-insertion.   Eliminating if-insertion sometimes improves
48performance by avoiding the extra if tests and branch mispredictions.
49For patterns with long strings, however, there can be a substantial
50slowdown; searching for a pattern of length 40 slows down by more
51than 50\% without the if-statement short-circuiting.
52
53\ICgrep{} also provides options that allow
54various internal representations to be printed out.   These
55can aid in understanding and/or debugging performance issues.
56For example, the option
57{\tt -print-REs} shows the parsed regular expression as it goes
58through various transformations.   The internal \Pablo{} code generated
59may be displayed with {\tt -print-\Pablo{}}.  This can be quite useful in
60helping understand the match process.   It also possible to print out the
61generated LLVM IR code ({\tt -dump-generated-IR}), but this may be
62less useful as it includes many
63details of low-level carry-handling that obscures the core logic.
64
65The precompiled calculations of the various Unicode properties
66are each placed in if-hierarchies as described previously.   To assess the
67impact of this strategy, we built a version of icGrep without such
68if-hierarchies.  In this case, when a Unicode property class is defined,
69bitwise logic equations are applied for all members of the class independent
70of the Unicode blocks represented in the input document.   For the classes
71covering the largest numbers of codepoints, we observed slowdowns of up to 5$\times$.
72
73\subsection{Simple Property Expressions}
74
75A key feature of Unicode level 1 support in regular expression engines
76the support that they provide for property expressions and combinations of property expressions
77using set union, intersection and difference operators.   Both {\tt ugrep}
78and {\tt icgrep} provide systematic support for all property expressions
79at Unicode Level 1 as well as set union, intersection and difference.
80On the other hand, {\tt pcre2grep} does not support the set intersection and difference operators directly.
81However, these operators can instead be expressed using a regular expression
82feature known as a lookbehind assertion.   Set intersection involves a
83regular expression formed with a one of the property expressions and a
84positive lookbehind assertion on the other, while set difference uses
85a negative lookbehind assertion. 
86
87We generated a set of regular expressions involving all Unicode values of
88the Unicode general category property ({\tt gc}) and all values of the Unicode
89script property ({\tt sc}). 
90We then generated
91expressions involving random pairs of {\tt gc} and {\tt sc}
92values combined with a random set operator chosen from union, intersection and difference.
93All property values are represented at least once.   
94A small number of
95expressions were removed because they involved properties not supported by pcre2grep.
96In the end 246 test expressions were constructed in this process.
97
98\begin{figure}
99\begin{center}
100\pgfplotstableread[col sep = comma]{data/icgrep-scatter.csv}\icgrep
101\pgfplotstableread[col sep = comma]{data/ugrep541-scatter.csv}\ugrep
102\pgfplotstableread[col sep = comma]{data/pcre2grep-scatter.csv}\pcre
103
104\begin{tikzpicture}
105\begin{axis}[
106grid=both,
107x tick label style={ /pgf/number format/1000 sep=},
108ylabel={CPU Cycles Per Byte},
109xlabel={Percentage of Matching Lines},
110minor y tick num={1},
111xmax=100,
112height=0.5\textwidth,
113legend style={at={(1.05,0.5)},
114anchor=west,legend columns=1,
115align=left,draw=none,column sep=2ex}
116]
117\addplot+[no markers,line width=2pt,color=blue!60,solid] table {\icgrep};
118\addplot+[no markers,line width=2pt,color=red!60,solid] table {\ugrep};
119\addplot+[no markers,line width=2pt,color=brown!60,solid] table {\pcre};
120\legend{icGrep,ugrep541,pcre2grep}
121\end{axis}
122
123
124\end{tikzpicture}
125\end{center}
126\caption{Matching Performance for Simple Property Expressions}\label{fig:property_test}
127\end{figure}
128
129We selected a set of Wikimedia XML files in several major languages representing
130most of the world's major language families as a test corpus.
131For each program under test, we performed searches for each regular
132expression against each XML document.
133Results are presented in Figure~\ref{fig:property_test}.  Performance is reported
134in CPU cycles per byte on an Intel Core i7 machine.   The results were grouped
135by the percentage of matching lines found in the XML document, grouped in
1365\% increments.  ICgrep shows dramatically better performance, particularly
137when searching for rare items.
138As shown in the figure, pcre2grep and ugrep both show
139increased performance (reduced CPU cycles per byte) with increasing percentage
140of matches found.  In essence, each match found allows these programs
141to skip the full processing of the rest of the line.   On the other
142hand, icGrep shows a slight drop-off in performance with the number
143of matches found.   This is primarily due to property classes that
144include large numbers of codepoints.   These classes require more
145bitstream equations for calculation and also have a greater probability
146of matching.   Nevertheless, the performance of icGrep in matching
147the defined property expressions is stable and well ahead of the competitors
148in all cases.
149
150
151\subsection{Complex Expressions}
152
153\begin{table}\centering % requires booktabs
154\small
155\begin{tabular}{@{}p{2.7cm}p{10.8cm}@{}}
156\textbf{Name}&\textbf{Regular Expression}\\
157\toprule
158Alphanumeric \#1&\verb`^[\p{L}\p{N}]*((\p{L}\p{N})|(\p{N}\p{L}))[\p{L}\p{N}]*$`\\
159\midrule
160Alphanumeric \#2&\verb`[\p{L}\p{N}]*((\p{L}\p{N})|(\p{N}\p{L}))[\p{L}\p{N}]*`\\
161\midrule
162Arabic&\verb`^[\p{Arabic}\p{Common}]*\p{Arabic}[\p{Arabic}\p{Common}]*$`\\
163\midrule
164Currency&\verb`(\p{Sc}\s*(\d*|(\d{1,3}([,.]\d{3})*))([,.]\d{2}?)?)|`\newline\verb`((\d*|(\d{1,3}([,.]\d{3})*))([,.]\d{2}?)?\s*\p{Sc})`\\
165\midrule
166Cyrillic&\verb`[\p{Pi}\p{Po}]\p{Cyrillic}{6,}[\p{Pf}\p{Pe}]`\\
167\midrule
168Email &\verb`([^\p{Z}<]+@[\p{L}\p{M}\p{N}.-]+\.(\p{L}\p{M}*){2,6})(>|\p{Z}|$)`\\
169\bottomrule
170\end{tabular}
171\caption{Regular Expressions}\label{table:regularexpr}
172\vspace{-1em}
173\end{table}
174
175
176We also examine the comparative performance of the matching engines on a
177series of more complex expressions as shown in Table \ref{table:regularexpr}.
178The first two are alphanumeric expressions, differing only in that the first
179one is anchored to match the entire line.
180The third searches for lines consisting of text in Arabic script.
181The fourth expression is a published currency expression taken from
182Stewart and Uckelman~\cite{stewart2013unicode}.
183An expression matching runs of six or more Cyrillic script characters enclosed
184in initial/opening and final/ending punctuation is fifth in the list.
185The final expression is an email expression that allows internationalized
186names.
187
188
189% \begin{table}
190% \begin{center}
191% \begin{tabular}{|c|r|r|r|}  \hline
192% Regular & \multicolumn{3}{|c|}{CPU cycles per byte} \\ \cline{2-4}
193% Expression & icGrep{} & pcre2grep & ugrep \\ \hline
194% blah  & 1 & 1000 & 100 \\ \hline
195% \end{tabular}
196% \caption{Matching Times for Complex Expressions}\label{table:complexexpr}
197% \end{center}
198% \end{table}
199
200
201\begin{table}\centering % requires booktabs
202\newcolumntype{T}{c}
203\small
204\begin{tabular}{@{}p{3cm}r@{~--~}rp{4pt}r@{~--~}rp{4pt}r@{~--~}rp{4pt}r@{~--~}rp{4pt}@{}}
205&\multicolumn{6}{c}{\textbf{\icGrep{}}}\\
206\cmidrule[1pt](lr){2-7}
207\cmidrule[1pt](lr){8-10}
208\cmidrule[1pt](lr){11-13}
209\textbf{Expression}&\multicolumn{3}{T}{\textbf{SEQ}}&\multicolumn{3}{T}{\textbf{MT}}&\multicolumn{3}{T}{\textbf{pcre2grep}}&\multicolumn{3}{T}{\textbf{ugrep541}}\\
210\toprule
211Alphanumeric \#1&2.4&5.0&&2.1&4.4&&8.2&11.3&&8.8&11.3&\\
212Alphanumeric \#2&2.3&4.9&&2.0&4.1&&209.9&563.5&&182.3&457.9&\\
213Arabic&1.5&3.4&&1.2&2.6&&7.5&270.8&&8.9&327.8&\\
214Currency&0.7&2.1&&0.4&1.4&&188.4&352.3&&52.8&152.8&\\
215Cyrillic&1.6&3.9&&1.3&2.8&&30.5&49.7&&11.2&20.1&\\
216Email&3.0&6.9&&2.7&6.4&&67.2&1442.0&&108.8&1022.3&\\
217\bottomrule
218\end{tabular}
219\caption{Matching Times for Complex Expressions (Seconds Per GB)}\label{table:complexexpr}
220\end{table}
221
222The performance results are shown in Table \ref{table:complexexpr}.
223In each case, we report seconds taken per GB of input averaged over 10 
224runs each on our Wikimedia document collection.
225The most striking aspect of the results is that both ugrep and pcregrep
226show dramatic slowdowns with ambiguities in regular expressions.
227This is most clearly illustrated in the different performance figures
228for the two Alphanumeric test expressions but is also evident in the
229Arabic, Currency and Email expressions.   By way of contrast, \icGrep{}
230maintains consistently fast performance in all test scenarios. 
231
232The multithreaded \icGrep{} shows speedup in every case, but balancing
233of the workload across multiple cores is clearly an area for further work. 
234Nevertheless, our three thread system shows a speedup over
235the single threaded version by up to 40\%.
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251% \subsection{Single vs. Multithreaded Performance}
252%
253%
254% \begin{figure}
255% \begin{center}
256% \pgfplotstableread[col sep = comma]{data/icgrep-scatter-mt.csv}\base
257% \pgfplotstableread[col sep = comma]{data/icgrep-mt-scatter-mt.csv}\mt
258% \pgfplotstableread[col sep = comma]{data/icgrep-mt3-scatter-mt.csv}\mtt
259% \pgfplotstableread[col sep = comma]{data/icgrep-flat-scatter-mt.csv}\flat
260% \begin{tikzpicture}
261% \begin{axis}[
262% grid=both,
263% x tick label style={ /pgf/number format/1000 sep=},
264% ylabel={Seconds Per GB ($1000^3$)},
265% xlabel={Percentage of Matching Lines},
266% minor y tick num={1},
267% ymin=0,ymax=3,
268% xmax=100,
269% height=0.5\textwidth,
270% legend style={at={(1.05,0.5)},
271% anchor=west,legend columns=1,
272% align=left,draw=none,column sep=2ex}
273% ]
274% \addplot+[sharp plot, no markers,line width=2pt,color=blue!60,solid] table {\base};
275% \addplot+[sharp plot, no markers,line width=2pt,color=red!60,solid] table {\mt};
276% \addplot+[sharp plot, no markers,line width=2pt,color=brown!60,solid] table {\mtt};
277% %\addplot+[no markers,line width=2pt,color=green!60,solid] table {\flat};
278% \legend{icGrep (Base),icGrep (MT2),icGrep (MT3), icGrep (Flat)}
279% \end{axis}
280%
281%
282% \end{tikzpicture}
283% \end{center}
284% \caption{Multithreading Performance}\label{fig:performance_test}
285% \end{figure}
Note: See TracBrowser for help on using the repository browser.