source: docs/Working/icGrep/evaluation.tex @ 4554

Last change on this file since 4554 was 4554, checked in by nmedfort, 4 years ago

Removed ICgrep

File size: 12.2 KB
RevLine 
[4446]1\section{Evaluation}\label{sec:evaluation}
2
[4554]3In this section, we report on the evaluation of \icGrep{} performance, looking
[4505]4at three aspects.   First, we examine some performance aspects of
[4554]5\icGrep{} internal methods, looking at the impact of optimizations discussed previously.
[4506]6Then we move on to a systematic performance study of \icGrep{}
7with named Unicode property searches in comparison to two
[4503]8contemporary competitors, namely, pcre2grep released in January 2015
[4506]9and ugrep of the ICU 54.1 software distribution.  Finally, we examine
10both more complex expressions and also the impact
[4505]11of multithreading \icGrep{}.
[4465]12
[4503]13\subsection{Optimizations of Bitwise Methods}
14
15In order to support evaluation of bitwise methods, as well as to support
16the teaching of those methods and ongoing research, \icGrep{} has an array
[4506]17of command-line options.   This makes it straightforward
[4554]18to report on certain performance aspects of \icGrep{}, while others require
[4503]19special builds. 
20
[4506]21For example, the command-line switch \texttt{-disable-matchstar} can be used
[4503]22to eliminate the use of the MatchStar operation for handling
23Kleene-* repetition of character classes.   In this case, \icGrep{} substitutes
24a while loop that iteratively extends match results.   
25Surprisingly, this
26does not change performance much in many practical cases.   
27In each block,
28the maximum iteration count is the maximum length run encountered; the
[4506]29overall performance is based on the average of these maxima throughout the
[4503]30file.   But when search for XML tags using the regular expression
[4506]31\verb:<[^!?][^>]*>:, a slowdown of more than 2$\times$ may be found in files
[4503]32with many long tags. 
33
34%The {\tt -disable-log2-bounded-repetition} flag allows these
35%effectiveness of the special techniques for bounded repetition of
36%byte classes to be assessed.   A slowdown of 30\% was observed
37%with the searches using the regular expression
38%\verb:(^|[ ])[a-zA-Z]{11,33}([.!? ]|$):, for example.
39
40To control the insertion of if-statements into dynamically
[4554]41generated code, the number of pattern elements between each if-test %non-nullable
42can be selected with the {\tt -if-insertion-gap=} option.   
43%
44The default value in \icGrep{} is 3; setting the gap to 100 effectively
45turns off if-insertion. 
46%
47Eliminating if-insertion sometimes improves performance by avoiding the extra if tests and branch mispredictions.
48%
49For patterns with long strings, however, there can be a substantial slowdown.
[4503]50
[4554]51%; searching for a pattern of length 40 slows down by more
52%than 50\% without the if-statement short-circuiting. %%% I think we'd need to show this always true to make this claim.
53
54Additionally, \icGrep{} provides options that allow
55various internal representations to be printed out.   
56%
57These can aid in understanding and/or debugging performance issues.
[4503]58For example, the option
[4506]59{\tt -print-REs} shows the parsed regular expression as it goes
[4503]60through various transformations.   The internal \Pablo{} code generated
[4554]61may be displayed with {\tt -print-pablo}.  This can be quite useful in
[4503]62helping understand the match process.   It also possible to print out the
63generated LLVM IR code ({\tt -dump-generated-IR}), but this may be
64less useful as it includes many
65details of low-level carry-handling that obscures the core logic.
66
67The precompiled calculations of the various Unicode properties
68are each placed in if-hierarchies as described previously.   To assess the
69impact of this strategy, we built a version of icGrep without such
70if-hierarchies.  In this case, when a Unicode property class is defined,
71bitwise logic equations are applied for all members of the class independent
72of the Unicode blocks represented in the input document.   For the classes
[4506]73covering the largest numbers of codepoints, we observed slowdowns of up to 5$\times$.
[4503]74
[4488]75\subsection{Simple Property Expressions}
[4465]76
[4469]77A key feature of Unicode level 1 support in regular expression engines
[4506]78the support that they provide for property expressions and combinations of property expressions
[4469]79using set union, intersection and difference operators.   Both {\tt ugrep}
80and {\tt icgrep} provide systematic support for all property expressions
81at Unicode Level 1 as well as set union, intersection and difference.
82On the other hand, {\tt pcre2grep} does not support the set intersection and difference operators directly.
83However, these operators can instead be expressed using a regular expression
84feature known as a lookbehind assertion.   Set intersection involves a
85regular expression formed with a one of the property expressions and a
86positive lookbehind assertion on the other, while set difference uses
[4488]87a negative lookbehind assertion. 
[4469]88
89We generated a set of regular expressions involving all Unicode values of
[4488]90the Unicode general category property ({\tt gc}) and all values of the Unicode
91script property ({\tt sc}). 
92We then generated
[4469]93expressions involving random pairs of {\tt gc} and {\tt sc}
94values combined with a random set operator chosen from union, intersection and difference.
[4488]95All property values are represented at least once.   
96A small number of
[4469]97expressions were removed because they involved properties not supported by pcre2grep.
98In the end 246 test expressions were constructed in this process.
99
[4474]100\begin{figure}
[4475]101\begin{center}
[4474]102\pgfplotstableread[col sep = comma]{data/icgrep-scatter.csv}\icgrep
103\pgfplotstableread[col sep = comma]{data/ugrep541-scatter.csv}\ugrep
104\pgfplotstableread[col sep = comma]{data/pcre2grep-scatter.csv}\pcre
[4469]105
[4474]106\begin{tikzpicture}
107\begin{axis}[
108grid=both,
109x tick label style={ /pgf/number format/1000 sep=},
[4475]110ylabel={CPU Cycles Per Byte},
111xlabel={Percentage of Matching Lines},
[4474]112minor y tick num={1},
[4481]113xmax=100,
114height=0.5\textwidth,
115legend style={at={(1.05,0.5)},
116anchor=west,legend columns=1,
117align=left,draw=none,column sep=2ex}
[4474]118]
119\addplot+[no markers,line width=2pt,color=blue!60,solid] table {\icgrep};
120\addplot+[no markers,line width=2pt,color=red!60,solid] table {\ugrep};
121\addplot+[no markers,line width=2pt,color=brown!60,solid] table {\pcre};
122\legend{icGrep,ugrep541,pcre2grep}
123\end{axis}
[4469]124
[4474]125
126\end{tikzpicture}
[4475]127\end{center}
[4494]128\caption{Matching Performance for Simple Property Expressions}\label{fig:property_test}
[4474]129\end{figure}
130
[4503]131We selected a set of Wikimedia XML files in several major languages representing
[4506]132most of the world's major language families as a test corpus.
[4554]133%
134For each program under test, we performed searches for each regular expression against each XML document.
135%
136Performance is reported in CPU cycles per byte on an Intel Core i7 machine.   
137%
138The results are presented in Figure~\ref{fig:property_test}.
139%
140They were ranked by the percentage of matching lines found in the XML document and grouped in 5\% increments. 
141%
142When comparing the three programs, \icGrep{} exhibits dramatically better performance, particularly when searching for rare items.
143%
144The performance of both pcre2grep and ugrep improves (has a reduction in CPU cycles per byte) as the percentage of matching lines increases.
145%
146This occurs because each match allows them to bypass processing the rest of the line.
147%
148On the other hand, \icGrep{} shows a slight drop-off in performance with the number of matches found.   
149%
150This is primarily due to property classes that include large numbers of codepoints.   
151%
152These classes require more bitstream equations for calculation and also have a greater probability of matching.   
153%
154Nevertheless, the performance of \icGrep{} in matching the defined property expressions is stable and well ahead of the competitors in all cases.
[4474]155
[4488]156
[4503]157\subsection{Complex Expressions}
[4498]158
[4502]159\begin{table}\centering % requires booktabs
[4498]160\small
161\begin{tabular}{@{}p{2.7cm}p{10.8cm}@{}}
162\textbf{Name}&\textbf{Regular Expression}\\
163\toprule
[4503]164Alphanumeric \#1&\verb`^[\p{L}\p{N}]*((\p{L}\p{N})|(\p{N}\p{L}))[\p{L}\p{N}]*$`\\
[4498]165\midrule
[4503]166Alphanumeric \#2&\verb`[\p{L}\p{N}]*((\p{L}\p{N})|(\p{N}\p{L}))[\p{L}\p{N}]*`\\
[4498]167\midrule
[4503]168Arabic&\verb`^[\p{Arabic}\p{Common}]*\p{Arabic}[\p{Arabic}\p{Common}]*$`\\
[4498]169\midrule
[4503]170Currency&\verb`(\p{Sc}\s*(\d*|(\d{1,3}([,.]\d{3})*))([,.]\d{2}?)?)|`\newline\verb`((\d*|(\d{1,3}([,.]\d{3})*))([,.]\d{2}?)?\s*\p{Sc})`\\
[4498]171\midrule
[4503]172Cyrillic&\verb`[\p{Pi}\p{Po}]\p{Cyrillic}{6,}[\p{Pf}\p{Pe}]`\\
173\midrule
174Email &\verb`([^\p{Z}<]+@[\p{L}\p{M}\p{N}.-]+\.(\p{L}\p{M}*){2,6})(>|\p{Z}|$)`\\
[4498]175\bottomrule
176\end{tabular}
177\caption{Regular Expressions}\label{table:regularexpr}
[4506]178\vspace{-1em}
[4498]179\end{table}
180
[4503]181
[4506]182We also examine the comparative performance of the matching engines on a
183series of more complex expressions as shown in Table \ref{table:regularexpr}.
184The first two are alphanumeric expressions, differing only in that the first
185one is anchored to match the entire line.
186The third searches for lines consisting of text in Arabic script.
[4503]187The fourth expression is a published currency expression taken from
[4506]188Stewart and Uckelman~\cite{stewart2013unicode}.
189An expression matching runs of six or more Cyrillic script characters enclosed
[4503]190in initial/opening and final/ending punctuation is fifth in the list.
191The final expression is an email expression that allows internationalized
192names.
193
194
195% \begin{table}
196% \begin{center}
197% \begin{tabular}{|c|r|r|r|}  \hline
198% Regular & \multicolumn{3}{|c|}{CPU cycles per byte} \\ \cline{2-4}
199% Expression & icGrep{} & pcre2grep & ugrep \\ \hline
200% blah  & 1 & 1000 & 100 \\ \hline
201% \end{tabular}
202% \caption{Matching Times for Complex Expressions}\label{table:complexexpr}
203% \end{center}
204% \end{table}
205
206
[4502]207\begin{table}\centering % requires booktabs
208\newcolumntype{T}{c}
[4498]209\small
[4502]210\begin{tabular}{@{}p{3cm}r@{~--~}rp{4pt}r@{~--~}rp{4pt}r@{~--~}rp{4pt}r@{~--~}rp{4pt}@{}}
211&\multicolumn{6}{c}{\textbf{\icGrep{}}}\\
212\cmidrule[1pt](lr){2-7}
213\cmidrule[1pt](lr){8-10}
214\cmidrule[1pt](lr){11-13}
215\textbf{Expression}&\multicolumn{3}{T}{\textbf{SEQ}}&\multicolumn{3}{T}{\textbf{MT}}&\multicolumn{3}{T}{\textbf{pcre2grep}}&\multicolumn{3}{T}{\textbf{ugrep541}}\\
[4498]216\toprule
[4502]217Alphanumeric \#1&2.4&5.0&&2.1&4.4&&8.2&11.3&&8.8&11.3&\\
218Alphanumeric \#2&2.3&4.9&&2.0&4.1&&209.9&563.5&&182.3&457.9&\\
219Arabic&1.5&3.4&&1.2&2.6&&7.5&270.8&&8.9&327.8&\\
220Currency&0.7&2.1&&0.4&1.4&&188.4&352.3&&52.8&152.8&\\
221Cyrillic&1.6&3.9&&1.3&2.8&&30.5&49.7&&11.2&20.1&\\
222Email&3.0&6.9&&2.7&6.4&&67.2&1442.0&&108.8&1022.3&\\
[4498]223\bottomrule
224\end{tabular}
[4502]225\caption{Matching Times for Complex Expressions (Seconds Per GB)}\label{table:complexexpr}
[4488]226\end{table}
227
[4503]228The performance results are shown in Table \ref{table:complexexpr}.
229In each case, we report seconds taken per GB of input averaged over 10 
230runs each on our Wikimedia document collection.
231The most striking aspect of the results is that both ugrep and pcregrep
232show dramatic slowdowns with ambiguities in regular expressions.
233This is most clearly illustrated in the different performance figures
[4506]234for the two Alphanumeric test expressions but is also evident in the
235Arabic, Currency and Email expressions.   By way of contrast, \icGrep{}
236maintains consistently fast performance in all test scenarios. 
[4498]237
[4503]238The multithreaded \icGrep{} shows speedup in every case, but balancing
239of the workload across multiple cores is clearly an area for further work. 
[4506]240Nevertheless, our three thread system shows a speedup over
[4503]241the single threaded version by up to 40\%.
[4498]242
243
244
245
246
247
248
249
250
251
252
253
[4465]254
255
256
[4502]257% \subsection{Single vs. Multithreaded Performance}
258%
259%
260% \begin{figure}
261% \begin{center}
262% \pgfplotstableread[col sep = comma]{data/icgrep-scatter-mt.csv}\base
263% \pgfplotstableread[col sep = comma]{data/icgrep-mt-scatter-mt.csv}\mt
264% \pgfplotstableread[col sep = comma]{data/icgrep-mt3-scatter-mt.csv}\mtt
265% \pgfplotstableread[col sep = comma]{data/icgrep-flat-scatter-mt.csv}\flat
266% \begin{tikzpicture}
267% \begin{axis}[
268% grid=both,
269% x tick label style={ /pgf/number format/1000 sep=},
270% ylabel={Seconds Per GB ($1000^3$)},
271% xlabel={Percentage of Matching Lines},
272% minor y tick num={1},
273% ymin=0,ymax=3,
274% xmax=100,
275% height=0.5\textwidth,
276% legend style={at={(1.05,0.5)},
277% anchor=west,legend columns=1,
278% align=left,draw=none,column sep=2ex}
279% ]
280% \addplot+[sharp plot, no markers,line width=2pt,color=blue!60,solid] table {\base};
281% \addplot+[sharp plot, no markers,line width=2pt,color=red!60,solid] table {\mt};
282% \addplot+[sharp plot, no markers,line width=2pt,color=brown!60,solid] table {\mtt};
283% %\addplot+[no markers,line width=2pt,color=green!60,solid] table {\flat};
284% \legend{icGrep (Base),icGrep (MT2),icGrep (MT3), icGrep (Flat)}
285% \end{axis}
286%
287%
288% \end{tikzpicture}
289% \end{center}
290% \caption{Multithreading Performance}\label{fig:performance_test}
291% \end{figure}
Note: See TracBrowser for help on using the repository browser.