wiki:ICgrep

icGrep: The Modern Grep Replacement

Building off the Parabix transform representation of text as well as the highly-regarded LLVM compiler infrastructure, icGrep is the modern replacement for the venerable grep utility, offering fundamentally parallel and consistently high-performance search as well as broad support for Unicode regular expressions.

icGrep embodies a completely new algorithmic approach to high-performance regular expression matching. In contrast to the byte-at-a-time approach of NFA, DFA, and backtracking matchers, icGrep 1.0 processes UTF-8 input streams 128 code units at a time using the SSE2 technology available on all x86-64 processors. The fundamental notion underlying icGrep is that of bitwise data parallelism as described in our 2014 PACT paper: Cameron, Robert D., Thomas C. Shermer, Arrvindh Shriraman, Kenneth S. Herdy, Dan Lin, Benjamin R. Hull, and Meng Lin.  Bitwise data parallelism in regular expression matching. In Proceedings of the 23rd international conference on Parallel architectures and compilation, pp. 139-150. ACM, 2014. Building on that research, Dale Denis implemented the first version of icGrep for his MSc thesis. Nigel Medforth and Rob Cameron with the assistance of Nick Sumner then moved on to do the additional software engineering work to complete version 1.0 including full support for Unicode level 1 requirements.

icGrep is under continuing active development as we work towards implementation of Unicode level 2 requirements as well as advancing the underlying technology.

Predictable, Consistent Performance

icGrep 1.0 Performance

While GNU grep and other grep utilities often exhibit dramatic slowdown for complex regular expressions involving ambiguous subexpressions, case insensitive matching or large Unicode character classes, icGrep offers consistent high performance in all cases. In many such cases, icGrep is more than 100X faster than the competition, although a 5-10X ratio is more common.

icGrep has a significant compilation overhead that may dominate searches in short files of less than 10MB in size.

icGrep also has the overhead of applying the Parabix transform to input text, roughly about 1 CPU cycle per byte on modern processors. However, once transformed, the bitwise data parallel methods of icGrep often allow searches to conclude within an additional 1 CPU cycle per byte, much faster than is possible with a traditional byte-at-a-time loop.

For simple, fixed string searches, grep and other utilities can employ Boyer-Moore type algorithms to sometimes outperform icGrep. But not by much.

Performance Roadmap

icGrep 1.0 is a single-threaded search tool relying on widely available SSE2 technology and the LLVM 3.5 compiler infrastructure. However, our research work has shown that we can accelerate icGrep using multithreading with pipeline parallelism, wider SIMD technologies such as the 256-bit AVX2 technology on current Intel processors, and improved compiler support for the Parabix primitives in our custom LLVM version.

Within icGrep 1.0, the MatchStar primitive elegantly handles Kleeene-* repetitions of character classes (e.g. [a-z]*) using a single long-stream addition operation together with bitwise logic. Other Kleene-* repetitions are currently handled using while loops, but it is known that the MatchStar technique can be adapted to apply in some cases. Codifying and extending these techniques is an important area for algorithmic work.

Unicode Support

ICgrep 1.0

icGrep 1.0 accepts ASCII or UTF-8 input files and provides a full suite of Unicode processing features meeting the requirements of Unicode Level 1 support of  Unicode Technical Standard #18. See Unicode Level 1 Support in icGrep for details.

icGrep Unicode Roadmap

We are pleased to be moving forward with implementation of Unicode Level 2 support within icGrep with the support of a  Google Faculty Research Award.

icGrep Regular Expression Syntax

icGrep uses an extended regular expression syntax broadly compatible with egrep and including both the "Perl-compatible" extensions of PCRE as well as the Unicode regular expression features of the ICU software distribution. icGrep does not support back references. For more details, see the ICgrepSyntax page.

icGrep Features and Options

Standard Grep Options

  -help                            - Display available options (-help-hidden for more)
  -version                         - Display the version of this program
  -H                               - Show the file name with each matching line.
  -c                               - Count and display the matching lines per file only.
  -n                               - Show the line number with each matching line.
  -normalize-line-breaks           - Normalize line breaks to std::endl.
  -e=<string>                      - Regular expression
  -f=<regex file>                  - Take regular expressions (one per line) from a file
  -i                               - Ignore case distinctions in the pattern and the file.

Algorithm Control Options

The following options are useful for experimentation and/or improved performance in some cases.

  -disable-CSE                     - Disable Pablo common subexpression elimination/dead code elimination
  -sinking                         - Moves all instructions into the innermost legal If-scope so that they are only executed when needed.
  -disable-Unicode-linebreak       - disable Unicode line breaks - use LF only
  -disable-Unicode-matchstar       - disable Unicode MatchStar optimization
  -disable-log2-bounded-repetition - disable log2 optimizations for bounded repetition of bytes
  -disable-matchstar               - disable MatchStar optimization
  -if-insertion-gap=<int>          - minimum number of nonempty elements between inserted if short-circuit tests

Educational and Debugging Options

The following options are useful for understanding how icGrep works or debugging.

  -print-REs                       - print regular expression passes
  -print-named-REs                 - print out named REs
  -print-parsed-REs                - print out parsed regular expressions
  -print-simplified-REs            - print out final simplified REs
  -print-stripped-REs              - print out REs with nullable prefixes/suffixes removed
  -print-utf8-REs                  - print out UTF-8 REs
  -print-CC-pablo                  - print Pablo output from character class compiler
  -print-RE-pablo                  - print Pablo output from the regular expression compiler
  -print-pablo                     - print final optimized Pablo code
  -dump-generated-IR               - print LLVM IR generated by RE compilation

icGrep Downloads

Browse the icgrep 1.0 source code!

Download precompiled icgrep software from  http://www.icgrep.com.

Download and build icgrep 1.0 (Linux or Mac OS X) using svn.

svn co http://parabix.costar.sfu.ca/svn/tags/icgrep1.0

Alternatively, download the development version.

svn co http://parabix.costar.sfu.ca/svn/icGREP/icgrep-devel

Parabix Technology

icGrep is built using Parabix technology, developed at Simon Fraser University and commercialized by  International Characters, Inc. Patents on Parabix technology have been dedicated for free use in open source software, research, teaching and experimentation  [covenant].