    136136in $\lceil\log_{64}{n}\rceil$ steps.   We ultimately apply this technique,
    137137for example, to perform
    138 synchronized 4096-bit addition on GPGPU warps of 64 threads.
     138synchronized 4096-bit addition on GPGPU wavefronts of 64 threads.
    140140There is also a strong keyword match between the bit-parallel
    584584in OpenCL.   
    585585We arranged for 64 work groups each having 64 threads.
    586 The size of work group and number of work groups is choosen
     586The size of work group and number of work groups is chosen
    587587to provide the best occupancy calculated by AMD App Profiler.
    588588Input files are divided in data parallel fashion among
    61361364 work groups with 4096 bytes each at a time rather than 128 bytes
    614614at a time on SSE or 256 bytes at a time on AVX, the performance
    615 improvement is less than 55\%. The first reason is hardware
     615improvement is less than 60\%. The first reason is hardware
    616616limitations. Our kernel occupancy is limited by register usage
    617617and not all the work groups can be scheduled at the same time.
    623623further processing rather than jump to the next block with a
    624624simple IF test. Therefore, the performance of different
    625 regular expresions is dependent on the number of
     625regular expressions is dependent on the number of
    626626long-stream addition operations and the total number of matches
    627627of a given input.
