wiki:GigabytePerSecondGrep

Version 1 (modified by cameron, 2 years ago) (diff)

--

Gigabyte Per Second Grep

Search Example: Greek Text in Arabic Wikipedia

Let's look for lines containing Greek characters in 2.78 GB Arabic language Wikipedia file.

We'll use a tiny Intel NUC box with a low-power Intel Core i3-5010U CPU @ 2.10GHz and a Samsung SM 951 SSD (256GB solid-state drive with PCI express interface).

We'll also use stock single-threaded icgrep 1.0 compiled with BLOCK_SIZE=128.

Here's our search using the Unicode property expression \p{Greek}.

icgrep 1.0

$ perf stat ./icgrep -c '\p{Greek}' ~/arwiki-20150901-pages-articles.xml 
13315

 Performance counter stats for './icgrep -c \p{Greek} /home/cameron/arwiki-20150901-pages-articles.xml':

       2582.128839      task-clock (msec)         #    0.963 CPUs utilized          
             1,149      context-switches          #    0.445 K/sec                  
                 4      cpu-migrations            #    0.002 K/sec                  
            53,705      page-faults               #    0.021 M/sec                  
     5,389,976,796      cycles                    #    2.087 GHz                    
    15,608,445,449      instructions              #    2.90  insns per cycle        
       757,755,466      branches                  #  293.462 M/sec                  
         6,690,654      branch-misses             #    0.88% of all branches        

       2.682235100 seconds time elapsed

On large files, icgrep can often perform complex regular expression searches at faster than 1.0 GB/second.

grep -P

For interest's sake, let's look at the same search with grep -P (necessary for Unicode property support).

$ perf stat grep -P -c '\p{Greek}' ~/arwiki-20150901-pages-articles.xml 
13315

 Performance counter stats for 'grep -P -c \p{Greek} /home/cameron/arwiki-20150901-pages-articles.xml':

      73181.965194      task-clock (msec)         #    1.000 CPUs utilized          
             1,450      context-switches          #    0.020 K/sec                  
                 9      cpu-migrations            #    0.000 K/sec                  
               137      page-faults               #    0.002 K/sec                  
   153,230,499,385      cycles                    #    2.094 GHz                    
   498,018,858,602      instructions              #    3.25  insns per cycle        
   110,260,648,689      branches                  # 1506.664 M/sec                  
       347,769,208      branch-misses             #    0.32% of all branches        

      73.153257137 seconds time elapsed

Dramatically slower than icgrep.

RE2

Or if we use the RE2 engine with the gre2p driver, the search is still 7X slower than icgrep.

$ gre2p -c '\p{Greek}' ~/arwiki-20150901-pages-articles.xml 
13315

 Performance counter stats for 'gre2p/gre2p -c \p{Greek} /home/cameron/arwiki-20150901-pages-articles.xml':

      16913.945598      task-clock (msec)         #    1.000 CPUs utilized          
               553      context-switches          #    0.033 K/sec                  
                 3      cpu-migrations            #    0.000 K/sec                  
           502,919      page-faults               #    0.030 M/sec                  
    35,409,521,317      cycles                    #    2.094 GHz                    
    74,641,629,425      instructions              #    2.11  insns per cycle        
    20,436,313,593      branches                  # 1208.252 M/sec                  
        17,952,867      branch-misses             #    0.09% of all branches        

      16.909611459 seconds time elapsed

icgrep 4850/AVX2

We're always striving to improve performance. With revision 4850 compiled to use the AVX2 SIMD features, performance is over 1.8 GB/second!

$ perf stat ./icgrep4850 -c '\p{Greek}' ~/arwiki-20150901-pages-articles.xml 
13315

 Performance counter stats for './icgrep4850 -c \p{Greek} /home/cameron/arwiki-20150901-pages-articles.xml':

       1523.205983      task-clock (msec)         #    1.000 CPUs utilized          
                40      context-switches          #    0.026 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
            43,240      page-faults               #    0.028 M/sec                  
     3,186,663,876      cycles                    #    2.092 GHz                    
     7,074,846,613      instructions              #    2.22  insns per cycle        
       327,301,431      branches                  #  214.877 M/sec                  
         2,814,058      branch-misses             #    0.86% of all branches        

       1.523021857 seconds time elapsed