Version 2 (modified by cameron, 4 years ago) (diff)


ICgrep: The Modern Grep Replacement

Building off the Parabix transform representation of text as well as the highly-regarded LLVM compiler infrastructure, icGrep is the modern replacement for the venerable grep utility, offering fundamentally parallel and consistently high-performance search as well as broad support for Unicode regular expressions.

ICgrep embodies a completely new algorithmic approach to high-performance regular expression matching. In contrast to the byte-at-a-time approach of NFA, DFA, and backtracking matchers, icGrep 1.0 processes UTF-8 input streams 128 code units at a time using the SSE2 technology available on all x86-64 processors. The fundamental notion underlying icGrep is that of bitwise data parallelism as described in our 2014 PACT paper: Cameron, Robert D., Thomas C. Shermer, Arrvindh Shriraman, Kenneth S. Herdy, Dan Lin, Benjamin R. Hull, and Meng Lin. "Bitwise data parallelism in regular expression matching." In Proceedings of the 23rd international conference on Parallel architectures and compilation, pp. 139-150. ACM, 2014.

Predictable, Consistent Performance

ICgrep 1.0 Performance

While GNU grep and other grep utilities often exhibit dramatic slowdown for complex regular expressions involving ambiguous subexpressions, case insensitive matching or large Unicode character classes, icGrep offers consistent high performance in all cases. In many such cases, icGrep is more than 100X faster than the competition, although a 5-10X ratio is more common.

ICgrep has a significant compilation overhead that may dominate searches in short files of less than 10MB in size.

ICgrep also has the overhead of applying the Parabix transform to input text, roughly about 1 CPU cycle per byte on modern processors. However, once transformed, the bitwise data parallel methods of icGrep often allow searches to conclude within an additional 1 CPU cycle per byte, much faster than is possible with a traditional byte-at-a-time loop.

For simple, fixed string searches, grep and other utilities can employ Boyer-Moore type algorithms to sometimes outperform icGrep. But not by much.

Performance Roadmap

ICgrep 1.0 is a single-threaded search tool relying on widely available SSE2 technology and the LLVM 3.5 compiler infrastructure. However, our research work has shown that we can accelerate icGrep using multithreading with pipeline parallelism, wider SIMD technologies such as the 256-bit AVX2 technology on current Intel processors, and improved compiler support for the Parabix primitives in our custom LLVM version.

Unicode Support

ICgrep 1.0

ICgrep 1.0 accepts ASCII or UTF-8 input files and provides a full suite of Unicode processing features meeting the requirements of Unicode Level 1 support of Unicode Technical Standard #18. See Unicode Level 1 Support in icGrep for details.

ICgrep Unicode Roadmap

We are pleased to be moving forward with implementation of Unicode Level 2 support within icGrep with the support of a Google Faculty Research Award.

ICgrep Downloads

Browse the icgrep 1.0 source code!

Download precompiled icgrep software from [].

Download and build icgrep 1.0 (Linux or Mac OS X) using svn.

svn co

Alternatively, download the development version.

svn co