source: docs/Working/re/conclusion.tex @ 3883

Last change on this file since 3883 was 3883, checked in by cameron, 5 years ago

Reference cleanups; use GPU instead of GPGPU

File size: 5.7 KB
2\paragraph*{Contributions} A new class of regular expression matching algorithm has been
3introduced based on the concept of bit-parallel data streams
4together with the MatchStar operation.  The algorithm is
5fully general for nondeterministic regular expression matching;
6however it does not address the nonregular extensions
7found in Perl-compatible backtracking implementations.
8Taking advantage of the SIMD features available on commodity
9processors, its implementation in grep offers consistently
10good performance in contrast to available alternatives. 
11For moderately complex expressions, 10X or better
12performance advantages over DFA-based gre2p and 5X performance
13advantage over nrgrep were frequently seen.
14While lacking some
15special optimizations found in other engines to deal with
16repeated substrings or to perform skipping actions based
17on fixed substrings, it nevertheless performs competitively
18in all cases. 
20A model for parallelized long-stream addition has also been presented
21in the paper, allowing our techniques to scale beyond
22the blocks of 128 bytes we use with the SSE2 implementation.
23This model allowed straightforward extension to the 256-byte
24block size used in our AVX2 implementation and should
25continue to scale well up for SIMD vectors up to 4096 bytes
26in length based on 64-bit additions.    The model also
27supports GPU implementation with some additional
30\paragraph*{Related Work} Much of the previous work in parallelizing of regular processing
31has dealt with the problem of using parallel resources to handle
32multiple instances of a matching problem in parallel.   It is thus
33complementary to our approach which focuses on parallelization
34to accelerate the matching of a single instance.    From this
35perspective, the recent work of Mytkowicz et al \cite{DBLP:conf/asplos/MytkowiczMS14} stands
36out as an important comparator in that it also focusses
37on acceleration of matching for a single input stream.
38Mytkowicz use the SIMD byte-shuffle capabilities found, for example,
39in the SSE3 instruction sets to perform small-table parallel lookups for
40multiple potentially active states of a FSM.    Data parallelism is
41achieved by initially considering all possible states at the
42beginning of each data segment, but then relying on convergence
43and range-coalescing optimizations to quickly reduce the number of active states in
44play.    Examining a large collection of regular expressions used in
45practice, these techniques were found to be effective, allowing
46matching to proceed with just one or two shuffles per input
49However, the Mytkowicz approach is still fundamentally considering
50input elements one byte at a time and would be hard pressed to
51compete with our reported results of 1-3 CPU cycles per input byte.
52It is also dependent on the availability of the SIMD byte-shuffle
53operation, which is unavailable in SIMD instructions sets such as
54SSE2 and ARM Neon, for example.
55Our SIMD implementation relies only on the availability of SIMD pack
56operations to efficiently implement the Parabix transform;
57SIMD pack is widely
58available in current SIMD instruction sets.   It is also a special
59case of the more general shuffle operations and hence available
60on any processor that supports byte shuffle.   The Parabix
61approach also has the further advantage that performance scales
62with increasing SIMD instruction width, as illustrated
63by our AVX2 performance results in comparison to SSE2.
65It is perhaps surprising that the classic nrgrep application is still
66competitive in performance for expressions that allow the BNDM
67algorithm to perform significant character skipping.     
68Although the length of possible skipping reduces with the
69complexity of the input expression considered, many applications
70of grep searching tend to use simple expressions in practice.
71Nevertheless, the Parabix approach offers consistently high
72performance often faster than nrgrep by a factor
73of 5X or more.
75\paragraph*{Ongoing and Future Work} Based on the techniques presented here a fully integrated
76grep version with a dynamic code generator implemented with LLVM
77is being developed by another team working with the Parabix
78technology (Dale Denis, Nick Sumner and Rob Cameron).
79An initial version is available at
82Further work includes the extending the algorithms
83to use MatchStar in Kleene-* repetitions beyond those
84of single characters (bytes).   Each such extension would
85replace while-loop iteration with addition and bitwise logic.
86The UTF-8 example shows this is possible for repetitions of
87variable-length byte sequences having particular synchronizing
88properties, for example.
90Future work also includes the development of multicore versions of the
91underlying algorithms to further accelerate performance and to
92handle regular expression matching problems involving larger rule sets than are
93typically encountered in the grep problem.  Such implementations
94could have useful application in tokenization and network
95intrusion detection for example.   Additional GPU implementation
96work could take advantage of specialized instructions available
97on particular platforms but not generally avaiable through OpenCL.
98For both multicore and GPU implementations,
99data-parallel division of input streams could benefit from
100techniques such as the principled speculation of Zhao et al
102for example.
104Other area of interest include extending the capabilities of
105the underlying method with addition features for substring
106capture, zero-width assertions and possibly
107backreference matching.  Adding Unicode support beyond
108basic Unicode character handling to include full Unicode
109character class support and normalization forms is also
110worth investigating.
Note: See TracBrowser for help on using the repository browser.