wiki:IcGrep

Version 4 (modified by cameron, 5 years ago) (diff)

--

IcGrep: Blazingly Fast Regular Expression Search

Building off the Parabix transform representation of text, icGrep embodies a completely new algorithmic approach to high-performance regular expression matching. In contrast to the byte-at-a-time approach of NFA, DFA, and backtracking matchers, icGrep processes UTF-8 input streams 128 code units at a time (using SSE2 technology, or 256 code units at a time with Intel's new AVX2 instructions).

Unicode Support

icGrep accepts ASCII or UTF-8 input files and provides a full suite of Unicode processing features meeting the requirements of Unicode Level 1 support of Unicode Technical Standard #18. See Unicode Level 1 Support in icGrep for details.

Linux comparison: icGrep vs egrep

Here's an interesting example of >100X speedup vs. egrep.

perf stat ./icgrep -c '[ ](0x)?([a-fA-F0-9][a-fA-F0-9])+[.:,?! ]' ../performance/data/howto 
130243

 Performance counter stats for './icgrep -c [ ](0x)?([a-fA-F0-9][a-fA-F0-9])+[.:,?! ] ../performance/data/howto':

         33.084992 task-clock                #    0.989 CPUs utilized          
                 6 context-switches          #    0.000 M/sec                  
                 0 CPU-migrations            #    0.000 M/sec                  
            12,132 page-faults               #    0.367 M/sec                  
       120,678,424 cycles                    #    3.648 GHz                     [75.94%]
        47,567,214 stalled-cycles-frontend   #   39.42% frontend cycles idle    [84.44%]
        40,414,969 stalled-cycles-backend    #   33.49% backend  cycles idle    [75.91%]
       182,642,376 instructions              #    1.51  insns per cycle        
                                             #    0.26  stalled cycles per insn [87.96%]
        14,480,982 branches                  #  437.690 M/sec                   [87.99%]
           509,098 branch-misses             #    3.52% of all branches         [80.41%]

       0.033460665 seconds time elapsed
perf stat egrep -c '[ ](0x)?([a-fA-F0-9][a-fA-F0-9])+[.:,?! ]' ../performance/data/howto 
130243

 Performance counter stats for 'egrep -c [ ](0x)?([a-fA-F0-9][a-fA-F0-9])+[.:,?! ] ../performance/data/howto':

       4019.404953 task-clock                #    0.996 CPUs utilized          
               342 context-switches          #    0.000 M/sec                  
                 5 CPU-migrations            #    0.000 M/sec                  
               321 page-faults               #    0.000 M/sec                  
    14,845,377,498 cycles                    #    3.693 GHz                     [83.31%]
     2,246,155,037 stalled-cycles-frontend   #   15.13% frontend cycles idle    [83.34%]
       699,469,381 stalled-cycles-backend    #    4.71% backend  cycles idle    [66.69%]
    36,936,781,254 instructions              #    2.49  insns per cycle        
                                             #    0.06  stalled cycles per insn [83.34%]
     6,793,009,741 branches                  # 1690.054 M/sec                   [83.34%]
        20,084,331 branch-misses             #    0.30% of all branches         [83.34%]

       4.034369628 seconds time elapsed

Browse the source code!