source: docs/ICA3PP2015/introduction.tex @ 5789

Last change on this file since 5789 was 4786, checked in by cameron, 4 years ago

More formatting for LNCS requirements

File size: 6.0 KB
Line 
1\section{Introduction}
2
3Although well-established technical standards exist for
4Unicode regular expressions \cite{davis2012unicode}, most of today's
5regular expression processing toolsets fail to support the full set of processing
6features even at the most basic level \cite{stewart2013unicode}
7One of the fundamental issues is performance and so it makes good sense
8to consider the ways in which parallel processing approaches can help
9address the gap.
10
11Efforts to improve the performance of regular expression matching through
12parallelization have generally concentrated on the use of SIMD, multicore
13or GPU technologies to accelerate multiple instances of independent
14matching problems. 
15Scarpazza~\cite{scarpazza2011top} used SIMD and multicore parallelism
16to accelerate small ruleset tokenization applications on the Cell
17Broadband Engine. Salapura~\cite{salapura2012accelerating}
18built on these techniques to accelerate business analytics applications
19using SSE instructions on commodity processors.   
20Zu et al.~\cite{zu2012gpu} use GPU technology to implement NFA-based regular
21expression matching with parallelism devoted both to processing
22a compressed active state array as well as to handling matching of
23multiple packet instances.
24
25Using parallel methods to accelerate matching of a single pattern on a
26single input stream is more difficult.  Indeed, of the 13 dwarfs identified
27in the Berkeley overview of parallel computing research,
28finite state machines (FSMs) are considered the hardest to parallelize (embarrassingly sequential) \cite{asanovic2006landscape}.
29However, some success has been reported recently along two independent lines of
30research.   The first focuses on data-parallel division of an input stream into separately
31processed segments on multiple cores.   In this case, the challenge is
32to identify the appropriate starting state or states that need to be
33considered for all but the first segment.   Mytkowicz et al. \cite{mytkowicz2014data} 
34use dynamic convergence and range coalescing to dramatically reduce the number
35of states in play at any point, while Zhao et al. \cite{zhao2014challenging} address the problem using
36principled speculation.   The second line of research is based on Parabix
37technology: the use of short vector SIMD instructions (e.g., Intel SSE or AVX instructions)
38to process bit streams in one-to-one correspondence with input character streams.
39Following this approach, Cameron et al.~\cite{cameron2014bitwise} have recently
40prototyped a new bitwise data parallel algorithm that shows significant
41acceleration of regular expression matching performance compared to sequential
42implementations even on a single core.
43
44These previous works in applying parallel methods to regular expression matching
45have generally focused on traditional ASCII-based matching problems. 
46However, documents and data formats increasingly use Unicode, particularly
47in the form of UTF-8.   W3Techs.com reports that the percentage of
48websites reporting UTF-8 as the character encoding has reached 84\% as of May 2015.
49But regular expression matching over UTF-8 introduces complexities
50of variable-length encodings for characters.   To process UTF-8 documents
51directly, Unicode regular expressions must generally undergo considerable
52state expansion to equivalent regular expressions over byte sequences.
53Alternatively, if the price is paid to first transcode a UTF-8 document to UTF-16,
54the problem of very large transition tables arises.
55
56In this paper, we report on the use of the implementation of a full
57UTF-8 regular expression search tool, building on the bitwise
58data parallel methods of the Parabix framework combined with the
59dynamic compilation capabilities of LLVM~\cite{lattner2004llvm}.
60The result is \icGrep{},
61a high-performance, full-featured open-source grep implementation
62with systematic support for Unicode regular expressions, fully implementing
63Unicode level 1 requirements of Unicode Technical Standard \#18\cite{davis2012unicode}.
64As an alternative
65to classical grep implementations, \icGrep{} offers dramatic performance
66acceleration in Unicode regular expression matching.
67
68The main contributions reported here are the extension of the bitwise data
69parallel methods for regular expression matching to handle UTF-8 natively,
70validating the bitwise data parallel method with the first complete
71implementation in a useful tool, and introducing \icGrep{} itself as
72research artifact.   From the latter perspective, \icGrep{} is significant
73in several ways.   First, it allows the performance results reported here to
74be independently replicated.  Secondly, it is a platform for further research
75into regular expression matching using bitwise methods and is indeed being used
76to investigate Unicode level 2 requirements in a project funded by Google.
77Finally, it is working example demonstrating the Parabix+LLVM framework,
78and useful for guiding further research
79into bitwise data parallel algorithms generally and how they may take advantage
80of evolving architectural features.   
81
82The remainder of this paper is organized as follows.   Section \ref{sec:background}
83presents background material dealing with Unicode regular expressions,
84the Parabix framework and regular expression matching techniques
85using bitwise data parallelism.   
86Section \ref{sec:Unicode} addresses
87the issues and performance challenges associated with meeting Unicode
88regular expression requirements and presents extensions to the
89Parabix techniques that we have developed to address them. 
90Section \ref{sec:architecture} describes the overall architecture of
91the \icGrep{} implementation with a focus on the integration of Parabix
92and LLVM technologies.
93Section \ref{sec:evaluation} evaluates the performance of \icGrep{} on
94several types of matching problems with two contemporary competitors,
95 pcre2grep and ugrep of the
96ICU (International Component for Unicode) software library.
97Section \ref{sec:conclusion} concludes the paper with remarks on
98developing the Parabix+LLVM framework for other applications as well
99as identifying further research questions in Unicode regular expression
100matching with bitwise data parallelism.
101
Note: See TracBrowser for help on using the repository browser.