source: docs/Working/icGrep/introduction.tex @ 4565

Last change on this file since 4565 was 4565, checked in by cameron, 4 years ago

Final changes pre-submission

File size: 6.3 KB
Line 
1\section{Introduction}
2
3
4Although well-established technical standards exist for
5Unicode regular expressions \cite{davis2012unicode}, most of today's
6regular expression processing toolsets fail to support the full set of processing
7features even at the most basic level \cite{stewart2013unicode}
8One of the fundamental issues is performance and so it makes good sense
9to consider the ways in which parallel processing approaches can help
10address the gap.
11
12Efforts to improve the performance of regular expression matching through
13parallelization have generally concentrated on the use of SIMD, multicore
14or GPU technologies to accelerate multiple instances of independent
15matching problems. 
16Scarpazza~\cite{scarpazza2011top} used SIMD and multicore parallelism
17to accelerate small ruleset tokenization applications on the Cell
18Broadband Engine. Valaspura~\cite{salapura2012accelerating}
19built on these techniques to accelerate business analytics applications
20using SSE instructions on commodity processors.   
21Zu et al.~\cite{zu2012gpu} use GPU technology to implement NFA-based regular
22expression matching with parallelism devoted both to processing
23a compressed active state array as well as to handling matching of
24multiple packet instances.
25
26Using parallel methods to accelerate matching of a single pattern on a
27single input stream is more difficult.  Indeed, of the 13 dwarves identified in the Berkeley overview of parallel computing research,
28finite state machines (FSMs) are considered the hardest to parallelize (embarrassingly sequential) \cite{asanovic2006landscape}.
29However, some success has been reported recently along two independent lines of
30research.   The first focuses on data-parallel division of an input stream into separately
31processed segments on multiple cores.   In this case, the challenge is
32to identify the appropriate starting state or states that need to be
33considered for all but the first segment.   Mytkowicz et al. \cite{mytkowicz2014data} 
34use dynamic convergence and range coalescing to dramatically reduce the number
35of states in play at any point, while Zhao et al. \cite{zhao2014challenging} address the problem using
36principled speculation.   The second line of research is based on Parabix
37technology: the use of short vector SIMD instructions (e.g., Intel SSE or AVX instructions)
38to process bit streams in one-to-one correspondence with input character streams.
39Following this approach, Cameron et al.~\cite{cameron2014bitwise} have recently
40prototyped a new bitwise data parallel algorithm that shows significant
41acceleration of regular expression matching performance compared to sequential
42implementations even on a single core.
43
44These previous works in applying parallel methods to regular expression matching
45have generally focused on traditional ASCII-based matching problems. 
46However, documents and data formats increasingly use Unicode, particularly
47in the form of UTF-8.   W3Techs.com reports that the percentage of
48websites reporting UTF-8 as the character encoding has reached 84\% as of May 2015.
49But regular expression matching over UTF-8 introduces complexities
50of variable-length encodings for characters.   To process UTF-8 documents
51directly, Unicode regular expressions must generally undergo considerable
52state expansion to equivalent regular expressions over byte sequences.
53Alternatively, if the price is paid to first transcode a UTF-8 document to UTF-16,
54the problem of very large transition tables arises.
55
56In this paper, we report on the use of the implementation of a full
57UTF-8 regular expression search tool, building on the bitwise
58data parallel methods of the Parabix framework combined with the
59dynamic compilation capabilities of LLVM~\cite{lattner2004llvm}.
60The result is \icGrep{},
61a high-performance, full-featured open-source grep implementation
62with systematic support for Unicode regular expressions, fully implementing
63Unicode level 1 requirements of Unicode Technical Standard \#18\cite{davis2012unicode}.
64As an alternative
65to classical grep implementations, \icGrep{} offers dramatic performance
66acceleration in Unicode regular expression matching.
67
68The main contributions reported here are the extension of the bitwise data
69parallel methods for regular expression matching to handle UTF-8 natively,
70validating the bitwise data parallel method with the first complete
71implementation in a useful tool, and introducing \icGrep{} itself as
72research artifact.   From the latter perspective, \icGrep{} is significant
73in several ways.   First, it allows the performance results reported here to
74be independently replicated.  Secondly, it is a framework for further research
75into regular expression matching using bitwise methods and is indeed being used
76to investigate Unicode level 2 requirements in a project funded by Google.
77Thirdly, it it fosters further research
78into bitwise data parallel algorithms generally and how they may take advantage
79of evolving architectural features.   Finally, \icGrep{} has also been
80designed as a teaching tool with many command line options to control
81algorithm features and print out internal representations of regular expressions and
82Parabix code.   
83
84The remainder of this paper is organized as follows.   Section \ref{sec:background}
85presents background material dealing with Unicode regular expressions,
86%LLVM,
87the Parabix framework and regular expression matching techniques
88using bitwise data parallelism.   
89%Section \ref{sec:seqpar} expands on previous work on bitwise data
90%parallelism by more fully characterizing the paradigm and documenting
91%important techniques.
92Section \ref{sec:Unicode} addresses
93the issues and performance challenges associated with meeting Unicode
94regular expression requirements and presents the extensions to the
95Parabix techniques that we have developed to address them. 
96Section \ref{sec:architecture} describes the overall architecture of
97the \icGrep{} implementation with a focus on the integration of Parabix
98and LLVM technologies.
99Section \ref{sec:evaluation} evaluates the performance of \icGrep{} on
100several types of matching problems with two contemporary competitors,
101 pcre2grep and ugrep of the
102ICU (International Component for Unicode) software library.
103Section \ref{sec:conclusion} concludes the paper with remarks on
104developing the Parabix+LLVM framework for other applications as well
105as identifying further research questions in Unicode regular expression
106matching with bitwise data parallelism.
107
Note: See TracBrowser for help on using the repository browser.