source: docs/Working/icGrep/introduction.tex @ 4489

Last change on this file since 4489 was 4489, checked in by nmedfort, 4 years ago

Minor revision to toUTF8 Transformation

File size: 4.1 KB
3Unix {\tt grep} is a tool widely used
4to search for lines in text files matching a given regular expression
5pattern.   Historical comments ...
7Unicode regular expression matching adds performance challenges...
9Efforts to improve the performance of regular expression matching through
10parallelization have generally concentrated on the use of SIMD, multicore
11or GPU technologies to accelerate multiple instances of independent
12matching problems. 
13Scarpazza~\cite{scarpazza2011top} used SIMD and multicore parallelism
14to accelerate small ruleset tokenization applications on the Cell
15Broadband Engine. Valaspura~\cite{salapura2012accelerating}
16built on these techniques to accelerate business analytics applications
17using SSE instructions on commodity processors.   
18Zu et al~\cite{zu2012gpu} use GPU technology to implement NFA-based regular
19expression matching with parallelism devoted both to processing
20a compressed active state array as well as to handling matching of
21multiple packet instances.
22These works have not generally tackled Unicode matching problems.
24Using parallel methods to accelerate matching of a single pattern on a
25single input stream is more difficult.  Indeed, of the 13 dwarves identified in the Berkeley overview of parallel computing research,
26finite state machines (FSMs) are considered the hardest to parallelize (embarassingly sequential) \cite{asanovic2006landscape}.
27However, some success has been reported recently along two independent lines of
28research.   Mytkowicz et al \cite{mytkowicz2014data} use SIMD shuffle operations
29to implement composable DFA transitions using dynamic convergence to
30reduce the number of states in play at any one time and range coalescing
31to compact the transition tables.   Unfortunately, the method seems unlikely
32to apply well to Unicode regular expression matching problems, which routinely
33require thousands of DFA states for named Unicode properties.
34Building on the Parabix framework, Cameron et al.~\cite{cameron2014bitwise} introduce
35regular expression matching using the bitwise
36data parallel approach together with the MatchStar primitive
37for efficient implementation of
38Kleene-* character-class repetitions.
40In this paper, we report on the use of the implementation of a full
41Unicode regular expression search tool, building on the bitwise
42data parallel methods of the Parabix framework combined with the
43dynamic compilation capabilities of LLVM~\cite{lattner2004llvm}.
44The result is \icGrep{},
45a high-performance, full-featured open-source grep implementation
46with systematic support for Unicode regular expressions addressing the
47requirements of Unicode Technical Standard \#18 \cite{davis2012unicode}.  As an alternative
48to classical grep implementations, \icGrep{} offers dramatic performance
49acceleration in ASCII-based and Unicode matching performance alike.
51The remainder of this paper is organized as follows.   Section \ref{sec:background}
52presents background material dealing with Unicode regular expressions,
54the Parabix framework and regular expression matching techniques
55using bitwise data parallelism.   
56Section \ref{sec:seqpar} expands on previous work on bitwise data
57parallelism by more fully characterizing the paradigm and documenting
58important techniques.
59Section \ref{sec:Unicode} addresses
60the issues and performance challenges associated with meeting Unicode
61regular expression requirements and presents the extensions to the
62Parabix techniques that we have developed to address them. 
63Section \ref{sec:architecture} describes the overall architecture of
64the \icGrep{} implementation with a focus on the integration of Parabix
65and LLVM technologies.
66Section \ref{sec:evaluation} evaluates the performance of \icGrep{} on
67several types of matching problems with contemporary competitors,
68including the latest versions of GNU grep, pcregrep, ugrep of the
69ICU (International Component for Unicode) and re2grep.
70Section \ref{sec:conclusion} concludes the paper with remarks on
71developing the Parabix+LLVM framework for other applications as well
72as identifying further research questions in Unicode regular expression
73matching with bitwise data parallelism.
Note: See TracBrowser for help on using the repository browser.