source: docs/Working/icGrep/introduction.tex @ 4446

Last change on this file since 4446 was 4446, checked in by cameron, 5 years ago

Initial check-in of icGrep working paper

File size: 3.0 KB
2The venerable Unix {\tt grep} program is an everyday tool widely used
3to search for lines in text files matching a given regular expression
4pattern.   Historical comments ...
6Unicode regular expression matching adds performance challenges...
8Efforts to improve the performance of regular expression matching through
9parallelization have generally concentrated on the use of SIMD, multicore
10or GPU technologies to accelerate multiple instances of independent
11matching problems.   Brief review ...
12These works have not generally tackled Unicode matching problems.
14Using parallel methods to accelerate matching of a single pattern on a
15single input stream is more difficult.  Indeed, the finite state machines (FSMs)
16underlying are considered the hardest to parallelize (embarassingly sequential)
17of the 13 dwarves identified in the Berkeley landscape of parallel computing research \cite{}.
18However, some success has been reported recently along two independent lines of
19research.   Mytkowicz...   Cameron/Parabix....
20These works do not address Unicode issues.
22In this paper, we report on the use of the implementation of a full
23Unicode regular expression search tool, building on the bitwise
24data parallel methods of the Parabix framework combined with the
25dynamic compilation capabilities of LLVM.   The result is ICgrep,
26a high-performance, full-featured open-source grep implementation
27with systematic support for Unicode regular expressions addressing the
28requirements of Unicode Technical Standard \#18 \cite{}.  As an alternative
29to classical grep implementations, ICgrep offers dramatic performance
30acceleration in ASCII-based and Unicode matching performance alike.
34The remainder of this paper is organized as follows.   Section \ref{sec:background}
35presents background material dealing with Unicode regular expressions,
36LLVM, the Parabix framework and regular expression matching techniques
37using bitwise data parallelism.   
38Section \ref{sec:seqpar} expands on previous work on bitwise data
39parallelism by more fully characterizing the paradigm and documenting
40important techniques.
41Section \ref{sec:Unicode} addresses
42the issues and performance challenges associated with meeting Unicode
43regular expression requirements and presents the extensions to the
44Parabix techniques that we have developed to address them. 
45Section \ref{sec:architecture} describes the overall architecture of
46the ICgrep implementation with a focus on the integration of Parabix
47and LLVM technologies.
48Section \ref{sec:evaluation} evaluates the performance of ICgrep on
49several types of matching problems with contemporary competitors,
50including the latest versions of GNU grep, pcregrep, ugrep of the
51ICU (International Component for Unicode) and re2grep.
52Section \ref{sec:conclusion} concludes the paper with remarks on
53developing the Parabix+LLVM framework for other applications as well
54as identifying further research questions in Unicode regular expression
55matching with bitwise data parallelism.
Note: See TracBrowser for help on using the repository browser.