1 | \section{Discussion}\label{sec:Concl} |
---|
2 | \paragraph*{Contributions} A new class of regular expression matching algorithm has been |
---|
3 | introduced based on the concept of bit-parallel data streams |
---|
4 | together with the MatchStar operation. The algorithm is |
---|
5 | fully general for nondeterministic regular expression matching; |
---|
6 | however it does not address the nonregular extensions |
---|
7 | found in Perl-compatible backtracking implementations. |
---|
8 | Taking advantage of the SIMD features available on commodity |
---|
9 | processors, its implementation in grep offers consistently |
---|
10 | good performance in contrast to available alternatives. |
---|
11 | For moderately complex expressions, 10X or better |
---|
12 | performance advantages over DFA-based gre2p and 5X performance |
---|
13 | advantage over nrgrep were frequently seen. |
---|
14 | While lacking some |
---|
15 | special optimizations found in other engines to deal with |
---|
16 | repeated substrings or to perform skipping actions based |
---|
17 | on fixed substrings, it nevertheless performs competitively |
---|
18 | in all cases. |
---|
19 | |
---|
20 | A model for parallelized long-stream addition has also been presented |
---|
21 | in the paper, allowing our techniques to scale beyond |
---|
22 | the blocks of 128 bytes we use with the SSE2 implementation. |
---|
23 | This model allowed straightforward extension to the 256-byte |
---|
24 | block size used in our AVX2 implementation and should |
---|
25 | continue to scale well up for SIMD vectors up to 4096 bytes |
---|
26 | in length based on 64-bit additions. The model also |
---|
27 | supports GPU implementation with some additional |
---|
28 | effort. |
---|
29 | |
---|
30 | \paragraph*{Related Work} Much of the previous work in parallelizing of regular processing |
---|
31 | has dealt with the problem of using parallel resources to handle |
---|
32 | multiple instances of a matching problem in parallel. It is thus |
---|
33 | complementary to our approach which focuses on parallelization |
---|
34 | to accelerate the matching of a single instance. From this |
---|
35 | perspective, the recent work of Mytkowicz et al \cite{DBLP:conf/asplos/MytkowiczMS14} stands |
---|
36 | out as an important comparator in that it also focusses |
---|
37 | on acceleration of matching for a single input stream. |
---|
38 | Mytkowicz use the SIMD byte-shuffle capabilities found, for example, |
---|
39 | in the SSSE3 instruction sets to perform small-table parallel lookups for |
---|
40 | multiple potentially active states of a FSM. Data parallelism is |
---|
41 | achieved by initially considering all possible states at the |
---|
42 | beginning of each data segment, but then relying on convergence |
---|
43 | and range-coalescing optimizations to quickly reduce the number of active states in |
---|
44 | play. Examining a large collection of regular expressions used in |
---|
45 | practice, these techniques were found to be effective, allowing |
---|
46 | matching to proceed with just one or two shuffles per input |
---|
47 | symbol. |
---|
48 | |
---|
49 | However, the Mytkowicz approach is still fundamentally considering |
---|
50 | input elements one byte at a time and would be hard pressed to |
---|
51 | compete with our reported results of 1-4 CPU cycles per input byte. |
---|
52 | It is also dependent on the availability of the SIMD byte-shuffle |
---|
53 | operation, which is unavailable in SIMD instructions sets such as |
---|
54 | SSE2 and ARM Neon, for example. |
---|
55 | Our SIMD implementation relies only on the availability of SIMD pack |
---|
56 | operations to efficiently implement the Parabix transform; |
---|
57 | SIMD pack is widely |
---|
58 | available in current SIMD instruction sets. It is also a special |
---|
59 | case of the more general shuffle operations and hence available |
---|
60 | on any processor that supports byte shuffle. The Parabix |
---|
61 | approach also has the further advantage that performance scales |
---|
62 | with increasing SIMD instruction width, as illustrated |
---|
63 | by our AVX2 performance results in comparison to SSE2. |
---|
64 | |
---|
65 | It is perhaps surprising that the classic nrgrep application is still |
---|
66 | competitive in performance for expressions that allow the BNDM |
---|
67 | algorithm to perform significant character skipping. |
---|
68 | Although the length of possible skipping reduces with the |
---|
69 | complexity of the input expression considered, many applications |
---|
70 | of grep searching tend to use simple expressions in practice. |
---|
71 | Nevertheless, the Parabix approach offers consistently high |
---|
72 | performance often faster than nrgrep by a factor |
---|
73 | of 5X or more. |
---|
74 | |
---|
75 | \paragraph*{Ongoing and Future Work} Based on the techniques presented |
---|
76 | here a fully integrated |
---|
77 | grep version with a dynamic code generator implemented with LLVM |
---|
78 | is being developed by another team working with the Parabix |
---|
79 | technology (Dale Denis, Nick Sumner and Rob Cameron). |
---|
80 | An initial version is available at {\tt |
---|
81 | http://parabix.costar.sfu.ca/icGREP}. |
---|
82 | With icgrep-0.8, total compile-time overhead to translate our |
---|
83 | test expressions into executable x86 code ranges from 0.002 |
---|
84 | seconds to 0.008 seconds for our test cases. |
---|
85 | Although this represents a tolerable overhead |
---|
86 | of 0.64 cycles/byte for our 40 MB test file, |
---|
87 | we expect that a substantial reduction of this |
---|
88 | overhead is feasible. |
---|
89 | |
---|
90 | Further work on the compilation algorithms includes the extending the algorithms |
---|
91 | to use MatchStar in Kleene-* repetitions beyond those |
---|
92 | of single characters (bytes). Each such extension would |
---|
93 | replace while-loop iteration with addition and bitwise logic. |
---|
94 | For example, |
---|
95 | the repetitions of |
---|
96 | variable-length UTF-8 byte sequences |
---|
97 | can be compiled without while loops by taking advantage |
---|
98 | of synchronizing properties of UTF-8. |
---|
99 | |
---|
100 | Future work also includes the development of multicore versions of the |
---|
101 | underlying algorithms to further accelerate performance and to |
---|
102 | handle regular expression matching problems involving larger rule sets than are |
---|
103 | typically encountered in the grep problem. Such implementations |
---|
104 | could have useful application in tokenization and network |
---|
105 | intrusion detection for example. Additional GPU implementation |
---|
106 | work could take advantage of specialized instructions available |
---|
107 | on particular platforms but not generally avaiable through OpenCL. |
---|
108 | For both multicore and GPU implementations, |
---|
109 | data-parallel division of input streams could benefit from |
---|
110 | techniques such as the principled speculation of Zhao et al |
---|
111 | \cite{DBLP:conf/asplos/ZhaoWS14}, |
---|
112 | for example. |
---|
113 | |
---|
114 | Other area of interest include extending the capabilities of |
---|
115 | the underlying method with addition features for substring |
---|
116 | capture, zero-width assertions and possibly |
---|
117 | backreference matching. Adding Unicode support beyond |
---|
118 | basic Unicode character handling to include full Unicode |
---|
119 | character class support and normalization forms is also |
---|
120 | worth investigating. |
---|
121 | |
---|