source: docs/Working/icGrep/introduction.tex @ 4505

Last change on this file since 4505 was 4504, checked in by nmedfort, 4 years ago

Small fix for unicode matching example along with formatting changes

File size: 4.1 KB
Line 
1\section{Introduction}
2
3
4Although well-established technical standards exist for
5Unicode regular expressions \cite{davis2012unicode}, most of today's
6regular expression processing toolsets fail to support the full set of processing
7features even at the most basic level \cite{stewart2013unicode}
8One of the fundamental issues is performance and so it makes good sense
9to consider the ways in which parallel processing approaches can help
10address the gap.
11
12Efforts to improve the performance of regular expression matching through
13parallelization have generally concentrated on the use of SIMD, multicore
14or GPU technologies to accelerate multiple instances of independent
15matching problems. 
16Scarpazza~\cite{scarpazza2011top} used SIMD and multicore parallelism
17to accelerate small ruleset tokenization applications on the Cell
18Broadband Engine. Valaspura~\cite{salapura2012accelerating}
19built on these techniques to accelerate business analytics applications
20using SSE instructions on commodity processors.   
21Zu et al~\cite{zu2012gpu} use GPU technology to implement NFA-based regular
22expression matching with parallelism devoted both to processing
23a compressed active state array as well as to handling matching of
24multiple packet instances.
25These works have not generally tackled Unicode matching problems.
26
27Using parallel methods to accelerate matching of a single pattern on a
28single input stream is more difficult.  Indeed, of the 13 dwarves identified in the Berkeley overview of parallel computing research,
29finite state machines (FSMs) are considered the hardest to parallelize (embarrassingly sequential) \cite{asanovic2006landscape}.
30However, some success has been reported recently along two independent lines of
31research.   Mytkowicz et al \cite{mytkowicz2014data} use SIMD shuffle operations
32to implement composable DFA transitions using dynamic convergence to
33reduce the number of states in play at any one time and range coalescing
34to compact the transition tables.   Unfortunately, the method seems unlikely
35to apply well to Unicode regular expression matching problems, which routinely
36require thousands of DFA states for named Unicode properties.
37Building on the Parabix framework, Cameron et al.~\cite{cameron2014bitwise} introduce
38regular expression matching using a new bitwise
39data parallel approach.
40
41In this paper, we report on the use of the implementation of a full
42Unicode regular expression search tool, building on the bitwise
43data parallel methods of the Parabix framework combined with the
44dynamic compilation capabilities of LLVM~\cite{lattner2004llvm}.
45The result is \icGrep{},
46a high-performance, full-featured open-source grep implementation
47with systematic support for Unicode regular expressions.  As an alternative
48to classical grep implementations, \icGrep{} offers dramatic performance
49acceleration in Unicode regular expression matching.
50
51The remainder of this paper is organized as follows.   Section \ref{sec:background}
52presents background material dealing with Unicode regular expressions,
53%LLVM,
54the Parabix framework and regular expression matching techniques
55using bitwise data parallelism.   
56%Section \ref{sec:seqpar} expands on previous work on bitwise data
57%parallelism by more fully characterizing the paradigm and documenting
58%important techniques.
59Section \ref{sec:Unicode} addresses
60the issues and performance challenges associated with meeting Unicode
61regular expression requirements and presents the extensions to the
62Parabix techniques that we have developed to address them. 
63Section \ref{sec:architecture} describes the overall architecture of
64the \icGrep{} implementation with a focus on the integration of Parabix
65and LLVM technologies.
66Section \ref{sec:evaluation} evaluates the performance of \icGrep{} on
67several types of matching problems with contemporary competitors,
68including the latest versions of GNU grep, pcregrep, ugrep of the
69ICU (International Component for Unicode) and re2grep.
70Section \ref{sec:conclusion} concludes the paper with remarks on
71developing the Parabix+LLVM framework for other applications as well
72as identifying further research questions in Unicode regular expression
73matching with bitwise data parallelism.
74
Note: See TracBrowser for help on using the repository browser.