source: docs/Working/icGrep/evaluation.tex @ 4606

Last change on this file since 4606 was 4606, checked in by cameron, 4 years ago

Elaborate on if-insertion.

File size: 14.6 KB
Line 
1\section{Evaluation}\label{sec:evaluation}
2
3In this section, we report on the evaluation of \icGrep{} performance, looking at three aspects.   
4%
5First, we discuss some performance aspects of \icGrep{} internal methods, looking at the impact of optimizations discussed previously.
6%
7Then we move on to a systematic performance study of \icGrep{} with named Unicode property searches in comparison to two contemporary competitors,
8namely, pcre2grep released in January 2015 and ugrep of the ICU 54.1 software distribution.
9%
10Finally, we examine complex expressions and the impact of multithreading \icGrep{} on an
11Intel i7-2600 (3.4GHz) and i7-4700MQ (2.4GHz) processor.
12
13\subsection{Optimizations of Bitwise Methods}
14
15In order to support evaluation of bitwise methods, as well as to support
16the teaching of those methods and ongoing research, \icGrep{} has an array
17of command-line options.   This makes it straightforward
18to report on certain performance aspects of \icGrep{}, while others require
19special builds. 
20
21For example, the command-line switch \texttt{-disable-matchstar} can be used
22to eliminate the use of the MatchStar operation for handling
23Kleene-* repetition of character classes.   In this case, \icGrep{} substitutes
24a while loop that iteratively extends match results.   
25Surprisingly, this
26does not change performance much in many practical cases.   
27In each block,
28the maximum iteration count is the maximum length run encountered; the
29overall performance is based on the average of these maxima throughout the
30file.   But when search for XML tags using the regular expression
31\verb:<[^!?][^>]*>:, a slowdown of more than 2$\times$ may be found in files
32with many long tags. 
33
34%The {\tt -disable-log2-bounded-repetition} flag allows these
35%effectiveness of the special techniques for bounded repetition of
36%byte classes to be assessed.   A slowdown of 30\% was observed
37%with the searches using the regular expression
38%\verb:(^|[ ])[a-zA-Z]{11,33}([.!? ]|$):, for example.
39
40In order to short-circuit processing when no remaining matches
41are possible in a block, our regular expression compiler periodically inserts
42if-statements to check whether there are any marker bits still in play.
43To control this feature in dynamically
44generated code, the number of pattern elements between each if-test %non-nullable
45can be selected with the {\tt -if-insertion-gap=} option.   
46%
47The default value in \icGrep{} is 3; setting the gap to 100 effectively
48turns off if-insertion. 
49%
50Eliminating if-insertion sometimes improves performance by avoiding the extra if tests and branch mispredictions.
51%
52For patterns with long strings, however, there can be a substantial slowdown.
53
54%; searching for a pattern of length 40 slows down by more
55%than 50\% without the if-statement short-circuiting. %%% I think we'd need to show this always true to make this claim.
56
57\comment{
58Additionally, \icGrep{} provides options that allow
59various internal representations to be printed out.   
60%
61These can aid in understanding and/or debugging performance issues.
62For example, the option
63{\tt -print-REs} shows the parsed regular expression as it goes
64through various transformations.   The internal \Pablo{} code generated
65may be displayed with {\tt -print-pablo}.  This can be quite useful in
66helping understand the match process.   It also possible to print out the
67generated LLVM IR code ({\tt -dump-generated-IR}), but this may be
68less useful as it includes many
69details of low-level carry-handling that obscures the core logic.
70}
71
72The precompiled calculations of the various Unicode properties
73are each placed in if-hierarchies as described previously.   To assess the
74impact of this strategy, we built a version of icGrep without such
75if-hierarchies.  In this case, when a Unicode property class is defined,
76bitwise logic equations are applied for all members of the class independent
77of the Unicode blocks represented in the input document.   For the classes
78covering the largest numbers of codepoints, we observed slowdowns of up to 5$\times$.
79
80\subsection{Simple Property Expressions}
81
82A key feature of Unicode level 1 support in regular expression engines
83the support that they provide for property expressions and combinations of property expressions
84using set union, intersection and difference operators.   Both {\tt ugrep}
85and {\tt icgrep} provide systematic support for all property expressions
86at Unicode Level 1 as well as set union, intersection and difference.
87Unfortunately, {\tt pcre2grep} does not support the set intersection and difference operators directly.
88However, these operators can be expressed using a regular expression
89feature known as a lookbehind assertion.   Set intersection involves a
90regular expression formed with a one of the property expressions and a
91positive lookbehind assertion on the other, while set difference uses
92a negative lookbehind assertion. 
93
94We generated a set of regular expressions involving all Unicode values of
95the Unicode general category property ({\tt gc}) and all values of the Unicode
96script property ({\tt sc}). 
97We then generated
98expressions involving random pairs of {\tt gc} and {\tt sc}
99values combined with a random set operator chosen from union, intersection and difference.
100All property values are represented at least once.   
101A small number of
102expressions were removed because they involved properties not supported by pcre2grep.
103In the end 246 test expressions were constructed in this process.
104
105\begin{figure}
106\begin{center}
107\pgfplotstableread[col sep = comma]{data/icgrep-scatter.csv}\icgrep
108\pgfplotstableread[col sep = comma]{data/ugrep541-scatter.csv}\ugrep
109\pgfplotstableread[col sep = comma]{data/pcre2grep-scatter.csv}\pcre
110
111\begin{tikzpicture}
112\begin{axis}[
113grid=both,
114x tick label style={ /pgf/number format/1000 sep=},
115ylabel={CPU Cycles Per Byte},
116xlabel={Percentage of Matching Lines},
117minor y tick num={1},
118xmax=100,
119height=0.5\textwidth,
120legend style={at={(1.05,0.5)},
121anchor=west,legend columns=1,
122align=left,draw=none,column sep=2ex}
123]
124\addplot+[no markers,line width=2pt,color=blue!60,solid] table {\icgrep};
125\addplot+[no markers,line width=2pt,color=red!60,solid] table {\ugrep};
126\addplot+[no markers,line width=2pt,color=brown!60,solid] table {\pcre};
127\legend{icGrep,ugrep541,pcre2grep}
128\end{axis}
129
130
131\end{tikzpicture}
132\end{center}
133\caption{Matching Performance for Simple Property Expressions}\label{fig:property_test}
134\end{figure}
135
136We selected a set of Wikimedia XML files in several major languages representing
137most of the world's major language families as a test corpus.
138%
139For each program under test, we performed searches for each regular expression against each XML document.
140%
141Performance is reported in CPU cycles per byte on an Intel i7-2600 machine.   
142%
143The results are presented in Figure~\ref{fig:property_test}.
144%
145They were ranked by the percentage of matching lines found in the XML document and grouped in 5\% increments. 
146%
147When comparing the three programs, \icGrep{} exhibits dramatically better performance, particularly when searching for rare items.
148%
149The performance of both pcre2grep and ugrep improves (has a reduction in CPU cycles per byte) as the percentage of matching lines increases.
150%
151This occurs because each match allows them to bypass processing the rest of the line.
152%
153On the other hand, \icGrep{} shows a slight drop-off in performance with the number of matches found.   
154%
155This is primarily due to property classes that include large numbers of codepoints.   
156%
157These classes require more bitstream equations for calculation and also have a greater probability of matching.   
158%
159Nevertheless, the performance of \icGrep{} in matching the defined property expressions is stable and well ahead of the competitors in all cases.
160
161
162\subsection{Complex Expressions}
163
164\begin{table}\centering % requires booktabs
165\small\vspace{-2em}
166\begin{tabular}{@{}p{2.7cm}p{10.8cm}@{}}
167\textbf{Name}&\textbf{Regular Expression}\\
168\toprule
169Alphanumeric \#1&\verb`^[\p{L}\p{N}]*((\p{L}\p{N})|(\p{N}\p{L}))[\p{L}\p{N}]*$`\\
170\midrule
171Alphanumeric \#2&\verb`[\p{L}\p{N}]*((\p{L}\p{N})|(\p{N}\p{L}))[\p{L}\p{N}]*`\\
172\midrule
173Arabic&\verb`^[\p{Arabic}\p{Common}]*\p{Arabic}[\p{Arabic}\p{Common}]*$`\\
174\midrule
175Currency&\verb`(\p{Sc}\s*(\d*|(\d{1,3}([,.]\d{3})*))([,.]\d{2}?)?)|`\newline\verb`((\d*|(\d{1,3}([,.]\d{3})*))([,.]\d{2}?)?\s*\p{Sc})`\\
176\midrule
177Cyrillic&\verb`[\p{Pi}\p{Po}]\p{Cyrillic}{6,}[\p{Pf}\p{Pe}]`\\
178\midrule
179Email &\verb`([^\p{Z}<]+@[\p{L}\p{M}\p{N}.-]+\.(\p{L}\p{M}*){2,6})(>|\p{Z}|$)`\\
180\bottomrule
181\end{tabular}
182\caption{Regular Expressions}\label{table:regularexpr}
183\vspace{-2em}
184\end{table}
185
186This study evaluates the comparative performance of the matching engines on a
187series of more complex expressions, shown in Table \ref{table:regularexpr}.
188%
189The first two are alphanumeric expressions, differing only in that the first
190one is anchored to match the entire line.
191%
192The third searches for lines consisting of text in Arabic script.
193%
194The fourth expression is a published currency expression taken from
195Stewart and Uckelman~\cite{stewart2013unicode}.
196%
197An expression matching runs of six or more Cyrillic script characters enclosed
198in initial/opening and final/ending punctuation is fifth in the list.
199%
200The final expression is an email expression that allows internationalized names.
201%
202In Table \ref{table:complexexpr}, we show the performance results obtained
203from an Intel i7-2600 using generic 64-bit binaries for each engine.
204We limit the SIMD ISA within \icGrep{} to SSE2 which is available
205on all Intel/AMD 64-bit machines.
206%
207In each case, we report seconds taken per GB of input averaged over 10 
208runs each on our Wikimedia document collection.
209
210\begin{table}[ht]\centering % requires booktabs
211\newcolumntype{T}{c}
212\small\vspace{-2em}
213\begin{tabular}{@{}p{3cm}r@{~--~}rp{4pt}r@{~--~}rp{4pt}r@{~--~}rp{4pt}r@{~--~}rp{4pt}@{}}
214&\multicolumn{6}{c}{\textbf{\icGrep{}}}\\
215\cmidrule[1pt](lr){2-7}
216\cmidrule[1pt](lr){8-10}
217\cmidrule[1pt](lr){11-13}
218\textbf{Expression}&\multicolumn{3}{T}{\textbf{SEQ}}&\multicolumn{3}{T}{\textbf{MT}}&\multicolumn{3}{T}{\textbf{pcre2grep}}&\multicolumn{3}{T}{\textbf{ugrep541}}\\
219\toprule
220Alphanumeric \#1&2.4&5.0&&2.1&4.4&&8.2&11.3&&8.8&11.3&\\
221Alphanumeric \#2&2.3&4.9&&2.0&4.1&&209.9&563.5&&182.3&457.9&\\
222Arabic&1.5&3.4&&1.2&2.6&&7.5&270.8&&8.9&327.8&\\
223Currency&0.7&2.1&&0.4&1.4&&188.4&352.3&&52.8&152.8&\\
224Cyrillic&1.6&3.9&&1.3&2.8&&30.5&49.7&&11.2&20.1&\\
225Email&3.0&6.9&&2.7&6.4&&67.2&1442.0&&108.8&1022.3&\\
226\bottomrule
227\end{tabular}
228\caption{Matching Times for Complex Expressions (s/GB)}\label{table:complexexpr}
229\vspace{-2em}
230\end{table}
231
232
233The most striking aspect of Table \ref{table:complexexpr} is that both ugrep and pcregrep
234show dramatic slowdowns with ambiguities in regular expressions.
235%
236This is most clearly illustrated in the different performance figures
237for the two Alphanumeric test expressions but is also evident in the
238Arabic, Currency and Email expressions.   
239%
240Contrastingly, \icGrep{} maintains consistently fast performance in all test scenarios. 
241%
242The multithreaded \icGrep{} shows speedup in every case but balancing
243of the workload across multiple cores is clearly an area for further work.
244%
245Nevertheless, our three-thread system shows up to a 40\% speedup. %  over the single threaded version
246
247
248
249%
250Table \ref{table:relperf} shows the speedups obtained with \icGrep{}
251on a newer Intel i7-4700MQ machine, considering three SIMD ISA alternatives
252and both single-threaded and multi-threaded versions.
253All speedups are relative to the base single-threaded SSE2 performance on this machine,
254which is given in seconds per GB in the first column.
255%
256The SSE2 results are again using the generic binaries compiled for compatibility
257with all 64-bit processors.   
258%
259The AVX1 results are for Intel AVX instructions
260in 128-bit mode.  The main advantage of AVX1 over SSE2 is its support for 3-operand form,
261which helps reduce register pressure.   The AVX2 results are for \icGrep{}
262compiled to use the 256-bit AVX2 instructions, processing blocks of 256 bytes at a time.
263
264\begin{table}[h]\centering % requires booktabs,siunitx
265\small
266\vspace{-2em}
267\begin{tabular}{@{}p{3cm}l@{~}r@{~~}l@{~}r@{~~}l@{~}r@{~~}l@{~}r@{~~}l@{~}r@{~~}l@{~}r@{~~}@{}}
268&\multicolumn{2}{c}{\textbf{Base}}&\multicolumn{4}{c}{\textbf{SEQ}}&\multicolumn{6}{c}{\textbf{MT}}\\
269\cmidrule[1pt](lr){2-3}
270\cmidrule[1pt](lr){4-7}
271\cmidrule[1pt](lr){8-13}
272\textbf{Expression}&\multicolumn{2}{c}{\textbf{s/GB}}&\multicolumn{2}{c}{\textbf{AVX1}}&\multicolumn{2}{c}{\textbf{AVX2}}&\multicolumn{2}{c}{\textbf{SSE2}}&\multicolumn{2}{c}{\textbf{AVX1}}&\multicolumn{2}{c}{\textbf{AVX2}}\\
273\toprule
274Alphanumeric \#1&2.76&(.65)&1.05&(.03)&1.25&(.08)&1.18&(.02)&1.19&(.03)&1.59&(.10)\\
275Alphanumeric \#2&2.69&(.66)&1.05&(.02)&1.36&(.09)&1.20&(.03)&1.19&(.04)&1.80&(.11)\\
276Arabic&1.82&(.39)&1.05&(.03)&1.15&(.08)&1.37&(.03)&1.37&(.04)&1.66&(.10)\\
277Currency&1.04&(.30)&1.03&(.02)&1.04&(.06)&1.59&(.15)&1.61&(.14)&1.78&(.21)\\
278Cyrillic&2.10&(.44)&1.06&(.02)&0.96&(.06)&1.27&(.02)&1.33&(.04)&1.23&(.09)\\
279Email&3.57&(.87)&1.05&(.03)&1.37&(.14)&1.13&(.03)&1.16&(.04)&1.67&(.18)\\
280\midrule
281\textit{Geomean}&\multicolumn{2}{c}{--}&1.04&&1.18&&1.28&&1.30&&1.61&\\
282\bottomrule
283\end{tabular}
284\caption{Speedup of Complex Expressions on i7-4700MQ $(\sigma)$}\label{table:relperf}
285\vspace{-2em}
286\end{table}
287
288In each case, the use of three-operand form with AVX1 confers a slight
289speedup.  The change to use 256 bits with AVX2 gives a further overall improvement,
290but some mixed results due to the limitations of 256 bit addition.   Combining
291the AVX2 ISA with multithreading gives and average overall 61\% speedup compared to base.
292
293
294
295
296
297
298
299
300
301% \subsection{Single vs. Multithreaded Performance}
302%
303%
304% \begin{figure}
305% \begin{center}
306% \pgfplotstableread[col sep = comma]{data/icgrep-scatter-mt.csv}\base
307% \pgfplotstableread[col sep = comma]{data/icgrep-mt-scatter-mt.csv}\mt
308% \pgfplotstableread[col sep = comma]{data/icgrep-mt3-scatter-mt.csv}\mtt
309% \pgfplotstableread[col sep = comma]{data/icgrep-flat-scatter-mt.csv}\flat
310% \begin{tikzpicture}
311% \begin{axis}[
312% grid=both,
313% x tick label style={ /pgf/number format/1000 sep=},
314% ylabel={Seconds Per GB ($1000^3$)},
315% xlabel={Percentage of Matching Lines},
316% minor y tick num={1},
317% ymin=0,ymax=3,
318% xmax=100,
319% height=0.5\textwidth,
320% legend style={at={(1.05,0.5)},
321% anchor=west,legend columns=1,
322% align=left,draw=none,column sep=2ex}
323% ]
324% \addplot+[sharp plot, no markers,line width=2pt,color=blue!60,solid] table {\base};
325% \addplot+[sharp plot, no markers,line width=2pt,color=red!60,solid] table {\mt};
326% \addplot+[sharp plot, no markers,line width=2pt,color=brown!60,solid] table {\mtt};
327% %\addplot+[no markers,line width=2pt,color=green!60,solid] table {\flat};
328% \legend{icGrep (Base),icGrep (MT2),icGrep (MT3), icGrep (Flat)}
329% \end{axis}
330%
331%
332% \end{tikzpicture}
333% \end{center}
334% \caption{Multithreading Performance}\label{fig:performance_test}
335% \end{figure}
Note: See TracBrowser for help on using the repository browser.