Changeset 4551

May 14, 2015, 7:26:18 AM (3 years ago)

Strengthen claimed contributions, especially icGrep as a research artifact.

2 edited


  • docs/Working/icGrep/introduction.tex

    r4550 r4551  
    2323a compressed active state array as well as to handling matching of
    2424multiple packet instances.
    25 These works have not generally tackled Unicode matching problems.
    2726Using parallel methods to accelerate matching of a single pattern on a
    3534use dynamic convergence and range coalescing to dramatically reduce the number
    3635of states in play at any point, while Zhao et al. \cite{zhao2014challenging} address the problem using
    37 principled speculation.   Introducing a second line of research,
    38 Cameron et al.~\cite{cameron2014bitwise} investigate new regular expression
    39 algorithms based on the concept of bitwise data parallelism, together with
    40 implementations that exploit short vector SIMD instructions to accelerate
    41 regular expression matching even within a single core.
     36principled speculation.   The second line of research is based on Parabix
     37technology: the use of short vector SIMD instructions (e.g., Intel SSE or AVX instructions)
     38to process bit streams in one-to-one correspondence with input character streams.
     39Following this approach, Cameron et al.~\cite{cameron2014bitwise} have recently
     40prototyped a new bitwise data parallel algorithm that shows significant
     41acceleration of regular expression matching performance compared to sequential
     42implementations even on a single core.
     44These previous works in applying parallel methods to regular expression matching
     45have generally focused on traditional ASCII-based matching problems. 
     46However, documents and data formats increasingly use Unicode, particularly
     47in the form of UTF-8. reports that the percentage of
     48websites reporting UTF-8 as the character encoding has reached 84\% as of May 2015.
     49But regular expression matching over UTF-8 introduces complexities
     50of variable-length encodings for characters.   To process UTF-8 documents
     51directly, Unicode regular expressions must generally undergo considerable
     52state expansion to equivalent regular expressions over byte sequences.
     53Alternatively, if the price is paid to first transcode a UTF-8 document to UTF-16,
     54the problem of very large transition tables arises.
    4356In this paper, we report on the use of the implementation of a full
    44 Unicode regular expression search tool, building on the bitwise
     57UTF-8 regular expression search tool, building on the bitwise
    4558data parallel methods of the Parabix framework combined with the
    4659dynamic compilation capabilities of LLVM~\cite{lattner2004llvm}.
    4760The result is \icGrep{},
    4861a high-performance, full-featured open-source grep implementation
    49 with systematic support for Unicode regular expressions.  As an alternative
     62with systematic support for Unicode regular expressions, fully implementing
     63Unicode level 1 requirements of Unicode Technical Standard \#18\cite{davis2012unicode}.
     64As an alternative
    5065to classical grep implementations, \icGrep{} offers dramatic performance
    5166acceleration in Unicode regular expression matching.
     68The main contributions reported here are the extension of the bitwise data
     69parallel methods for regular expression matching to handle UTF-8 natively,
     70validating the bitwise data parallel method with the first complete
     71implementation in a useful tool, and introducing \icGrep{} itself as
     72research artifact.   From the latter perspective, \icGrep{} is significant
     73in several ways.   First, it allows the performance results reported here to
     74be independently replicated.  Secondly, it is a framework for further research
     75into regular expression matching using bitwise methods and is indeed being used
     76to investigate Unicode level 2 requirements in a project funded by Google.
     77Thirdly, it it fosters further research
     78into bitwise data parallel algorithms generally and how they may take advantage
     79of evolving architectural features.   Finally, \icGrep{} has also been
     80designed as a teaching tool with many command line options to control
     81algorithm features and print out internal representations of regular expressions and
     82Parabix code.   
    5384The remainder of this paper is organized as follows.   Section \ref{sec:background}
Note: See TracChangeset for help on using the changeset viewer.