source: docs/Working/icGrep/introduction.tex @ 4468

Last change on this file since 4468 was 4452, checked in by cameron, 5 years ago

Intro

File size: 4.1 KB
Line 
1\section{Introduction}
2
3The venerable Unix {\tt grep} program is an everyday tool widely used
4to search for lines in text files matching a given regular expression
5pattern.   Historical comments ...
6
7Unicode regular expression matching adds performance challenges...
8
9Efforts to improve the performance of regular expression matching through
10parallelization have generally concentrated on the use of SIMD, multicore
11or GPU technologies to accelerate multiple instances of independent
12matching problems. 
13Scarpazza \cite{scarpazza2011top} used SIMD and multicore parallelism
14to accelerate small ruleset tokenization applications on the Cell
15Broadband Engine while Valaspura \cite{salapura2012accelerating}
16built on these techniques to accelerate business analytics applications
17using SSE instructions on commodity processors.   
18Zu et al \cite{zu2012gpu} use GPU technology to implement NFA-based regular
19expression matching with parallelism devoted both to processing
20a compressed active state array as well as to handling matching of
21multiple packet instances.
22These works have not generally tackled Unicode matching problems.
23
24Using parallel methods to accelerate matching of a single pattern on a
25single input stream is more difficult.  Indeed, of the 13 dwarves identified in the Berkeley overview of parallel computing research,
26finite state machines (FSMs) are considered the hardest to parallelize (embarassingly sequential) \cite{asanovic2006landscape}.
27However, some success has been reported recently along two independent lines of
28research.   Mytkowicz et al \cite{mytkowicz2014data} use SIMD shuffle operations
29to implement composable DFA transitions using dynamic convergence to
30reduce the number of states in play at any one time and range coalescing
31to compact the transition tables.   Unfortunately, the method seems unlikely
32to apply well to Unicode regular expression matching problems, which routinely
33require thousands of DFA states for named Unicode properties.
34Building on the Parabix framework, Cameron at al \cite{cameron2014bitwise} introduce
35regular expression matching using the bitwise
36data parallel approach together with the MatchStar primitive
37for efficient implementation of
38Kleene-* character-class repetitions.
39
40In this paper, we report on the use of the implementation of a full
41Unicode regular expression search tool, building on the bitwise
42data parallel methods of the Parabix framework combined with the
43dynamic compilation capabilities of LLVM.   The result is ICgrep,
44a high-performance, full-featured open-source grep implementation
45with systematic support for Unicode regular expressions addressing the
46requirements of Unicode Technical Standard \#18 \cite{davis2012unicode}.  As an alternative
47to classical grep implementations, ICgrep offers dramatic performance
48acceleration in ASCII-based and Unicode matching performance alike.
49
50The remainder of this paper is organized as follows.   Section \ref{sec:background}
51presents background material dealing with Unicode regular expressions,
52LLVM, the Parabix framework and regular expression matching techniques
53using bitwise data parallelism.   
54Section \ref{sec:seqpar} expands on previous work on bitwise data
55parallelism by more fully characterizing the paradigm and documenting
56important techniques.
57Section \ref{sec:Unicode} addresses
58the issues and performance challenges associated with meeting Unicode
59regular expression requirements and presents the extensions to the
60Parabix techniques that we have developed to address them. 
61Section \ref{sec:architecture} describes the overall architecture of
62the ICgrep implementation with a focus on the integration of Parabix
63and LLVM technologies.
64Section \ref{sec:evaluation} evaluates the performance of ICgrep on
65several types of matching problems with contemporary competitors,
66including the latest versions of GNU grep, pcregrep, ugrep of the
67ICU (International Component for Unicode) and re2grep.
68Section \ref{sec:conclusion} concludes the paper with remarks on
69developing the Parabix+LLVM framework for other applications as well
70as identifying further research questions in Unicode regular expression
71matching with bitwise data parallelism.
72
Note: See TracBrowser for help on using the repository browser.