Changeset 4551
- Timestamp:
- May 14, 2015, 7:26:18 AM (4 years ago)
- Location:
- docs/Working/icGrep
- Files:
-
- 2 edited
Legend:
- Unmodified
- Added
- Removed
-
docs/Working/icGrep/introduction.tex
r4550 r4551 23 23 a compressed active state array as well as to handling matching of 24 24 multiple packet instances. 25 These works have not generally tackled Unicode matching problems.26 25 27 26 Using parallel methods to accelerate matching of a single pattern on a … … 35 34 use dynamic convergence and range coalescing to dramatically reduce the number 36 35 of states in play at any point, while Zhao et al. \cite{zhao2014challenging} address the problem using 37 principled speculation. Introducing a second line of research, 38 Cameron et al.~\cite{cameron2014bitwise} investigate new regular expression 39 algorithms based on the concept of bitwise data parallelism, together with 40 implementations that exploit short vector SIMD instructions to accelerate 41 regular expression matching even within a single core. 36 principled speculation. The second line of research is based on Parabix 37 technology: the use of short vector SIMD instructions (e.g., Intel SSE or AVX instructions) 38 to process bit streams in one-to-one correspondence with input character streams. 39 Following this approach, Cameron et al.~\cite{cameron2014bitwise} have recently 40 prototyped a new bitwise data parallel algorithm that shows significant 41 acceleration of regular expression matching performance compared to sequential 42 implementations even on a single core. 43 44 These previous works in applying parallel methods to regular expression matching 45 have generally focused on traditional ASCII-based matching problems. 46 However, documents and data formats increasingly use Unicode, particularly 47 in the form of UTF-8. W3Techs.com reports that the percentage of 48 websites reporting UTF-8 as the character encoding has reached 84\% as of May 2015. 49 But regular expression matching over UTF-8 introduces complexities 50 of variable-length encodings for characters. To process UTF-8 documents 51 directly, Unicode regular expressions must generally undergo considerable 52 state expansion to equivalent regular expressions over byte sequences. 53 Alternatively, if the price is paid to first transcode a UTF-8 document to UTF-16, 54 the problem of very large transition tables arises. 42 55 43 56 In this paper, we report on the use of the implementation of a full 44 U nicoderegular expression search tool, building on the bitwise57 UTF-8 regular expression search tool, building on the bitwise 45 58 data parallel methods of the Parabix framework combined with the 46 59 dynamic compilation capabilities of LLVM~\cite{lattner2004llvm}. 47 60 The result is \icGrep{}, 48 61 a high-performance, full-featured open-source grep implementation 49 with systematic support for Unicode regular expressions. As an alternative 62 with systematic support for Unicode regular expressions, fully implementing 63 Unicode level 1 requirements of Unicode Technical Standard \#18\cite{davis2012unicode}. 64 As an alternative 50 65 to classical grep implementations, \icGrep{} offers dramatic performance 51 66 acceleration in Unicode regular expression matching. 67 68 The main contributions reported here are the extension of the bitwise data 69 parallel methods for regular expression matching to handle UTF-8 natively, 70 validating the bitwise data parallel method with the first complete 71 implementation in a useful tool, and introducing \icGrep{} itself as 72 research artifact. From the latter perspective, \icGrep{} is significant 73 in several ways. First, it allows the performance results reported here to 74 be independently replicated. Secondly, it is a framework for further research 75 into regular expression matching using bitwise methods and is indeed being used 76 to investigate Unicode level 2 requirements in a project funded by Google. 77 Thirdly, it it fosters further research 78 into bitwise data parallel algorithms generally and how they may take advantage 79 of evolving architectural features. Finally, \icGrep{} has also been 80 designed as a teaching tool with many command line options to control 81 algorithm features and print out internal representations of regular expressions and 82 Parabix code. 52 83 53 84 The remainder of this paper is organized as follows. Section \ref{sec:background}
Note: See TracChangeset
for help on using the changeset viewer.