source: docs/Working/icGrep/introduction.tex @ 4782

Last change on this file since 4782 was 4782, checked in by cameron, 4 years ago

Figure placement according to Springer

File size: 6.2 KB
Line 
1\section{Introduction}
2
3Although well-established technical standards exist for
4Unicode regular expressions \cite{davis2012unicode}, most of today's
5regular expression processing toolsets fail to support the full set of processing
6features even at the most basic level \cite{stewart2013unicode}
7One of the fundamental issues is performance and so it makes good sense
8to consider the ways in which parallel processing approaches can help
9address the gap.
10
11Efforts to improve the performance of regular expression matching through
12parallelization have generally concentrated on the use of SIMD, multicore
13or GPU technologies to accelerate multiple instances of independent
14matching problems. 
15Scarpazza~\cite{scarpazza2011top} used SIMD and multicore parallelism
16to accelerate small ruleset tokenization applications on the Cell
17Broadband Engine. Salapura~\cite{salapura2012accelerating}
18built on these techniques to accelerate business analytics applications
19using SSE instructions on commodity processors.   
20Zu et al.~\cite{zu2012gpu} use GPU technology to implement NFA-based regular
21expression matching with parallelism devoted both to processing
22a compressed active state array as well as to handling matching of
23multiple packet instances.
24
25Using parallel methods to accelerate matching of a single pattern on a
26single input stream is more difficult.  Indeed, of the 13 dwarfs identified
27in the Berkeley overview of parallel computing research,
28finite state machines (FSMs) are considered the hardest to parallelize (embarrassingly sequential) \cite{asanovic2006landscape}.
29However, some success has been reported recently along two independent lines of
30research.   The first focuses on data-parallel division of an input stream into separately
31processed segments on multiple cores.   In this case, the challenge is
32to identify the appropriate starting state or states that need to be
33considered for all but the first segment.   Mytkowicz et al. \cite{mytkowicz2014data} 
34use dynamic convergence and range coalescing to dramatically reduce the number
35of states in play at any point, while Zhao et al. \cite{zhao2014challenging} address the problem using
36principled speculation.   The second line of research is based on Parabix
37technology: the use of short vector SIMD instructions (e.g., Intel SSE or AVX instructions)
38to process bit streams in one-to-one correspondence with input character streams.
39Following this approach, Cameron et al.~\cite{cameron2014bitwise} have recently
40prototyped a new bitwise data parallel algorithm that shows significant
41acceleration of regular expression matching performance compared to sequential
42implementations even on a single core.
43
44These previous works in applying parallel methods to regular expression matching
45have generally focused on traditional ASCII-based matching problems. 
46However, documents and data formats increasingly use Unicode, particularly
47in the form of UTF-8.   W3Techs.com reports that the percentage of
48websites reporting UTF-8 as the character encoding has reached 84\% as of May 2015.
49But regular expression matching over UTF-8 introduces complexities
50of variable-length encodings for characters.   To process UTF-8 documents
51directly, Unicode regular expressions must generally undergo considerable
52state expansion to equivalent regular expressions over byte sequences.
53Alternatively, if the price is paid to first transcode a UTF-8 document to UTF-16,
54the problem of very large transition tables arises.
55
56In this paper, we report on the use of the implementation of a full
57UTF-8 regular expression search tool, building on the bitwise
58data parallel methods of the Parabix framework combined with the
59dynamic compilation capabilities of LLVM~\cite{lattner2004llvm}.
60The result is \icGrep{},
61a high-performance, full-featured open-source grep implementation
62with systematic support for Unicode regular expressions, fully implementing
63Unicode level 1 requirements of Unicode Technical Standard \#18\cite{davis2012unicode}.
64As an alternative
65to classical grep implementations, \icGrep{} offers dramatic performance
66acceleration in Unicode regular expression matching.
67
68The main contributions reported here are the extension of the bitwise data
69parallel methods for regular expression matching to handle UTF-8 natively,
70validating the bitwise data parallel method with the first complete
71implementation in a useful tool, and introducing \icGrep{} itself as
72research artifact.   From the latter perspective, \icGrep{} is significant
73in several ways.   First, it allows the performance results reported here to
74be independently replicated.  Secondly, it is a framework for further research
75into regular expression matching using bitwise methods and is indeed being used
76to investigate Unicode level 2 requirements in a project funded by Google.
77Thirdly, it fosters further research
78into bitwise data parallel algorithms generally and how they may take advantage
79of evolving architectural features.   Finally, \icGrep{} has also been
80designed as a teaching tool with many command line options to control
81algorithm features and display internal representations of regular expressions and
82Parabix code.   
83
84The remainder of this paper is organized as follows.   Section \ref{sec:background}
85presents background material dealing with Unicode regular expressions,
86the Parabix framework and regular expression matching techniques
87using bitwise data parallelism.   
88Section \ref{sec:Unicode} addresses
89the issues and performance challenges associated with meeting Unicode
90regular expression requirements and presents extensions to the
91Parabix techniques that we have developed to address them. 
92Section \ref{sec:architecture} describes the overall architecture of
93the \icGrep{} implementation with a focus on the integration of Parabix
94and LLVM technologies.
95Section \ref{sec:evaluation} evaluates the performance of \icGrep{} on
96several types of matching problems with two contemporary competitors,
97 pcre2grep and ugrep of the
98ICU (International Component for Unicode) software library.
99Section \ref{sec:conclusion} concludes the paper with remarks on
100developing the Parabix+LLVM framework for other applications as well
101as identifying further research questions in Unicode regular expression
102matching with bitwise data parallelism.
103
Note: See TracBrowser for help on using the repository browser.