# Changeset 3490 for docs/Working/re/ppopp-re.tex

Ignore:
Timestamp:
Sep 15, 2013, 9:16:37 AM (6 years ago)
Message:

 r3486 \documentclass{sigplanconf} \documentclass[preprint]{sigplanconf} % The following \documentclass options may be useful: \setlength{\pdfpagewidth}{\paperwidth} \conferenceinfo{PPoPP 2014}{February 15-19, 2013, Orland, Florida, United States} \conferenceinfo{PPoPP 2014}{February 15-19, 2014, Orlando, Florida, United States} \copyrightyear{2013} \copyrightdata{978-1-nnnn-nnnn-n/yy/mm} % short abstracts) \titlebanner{banner above paper title}        % These are ignored unless \preprintfooter{short description of paper}   % 'preprint' option specified. \titlebanner{}        % These are ignored unless \preprintfooter{Bitwise Data Parallel Grep}   % 'preprint' option specified. \title{Bitwise Data Parallelism in Regular Expression Matching} \subtitle{Subtitle Text, if any} \authorinfo{Robert D. Cameron \and Kenneth S. Herdy \and Dan Lin \and Meng Lin \and Ben Hull \and Thomas S. Shermer \and Arrvindh Shriraman} {Simon Fraser University} {\{cameron,ksherdy,lindanl,linmengl,bhull,shermer,ashriram\}@cs.sfu.ca} %\subtitle{Subtitle Text, if any} \authorinfo{Anonymous Authors}{Institutions}{emails} %\authorinfo{Robert D. Cameron \and Kenneth S. Herdy \and Dan Lin \and Meng Lin \and Ben Hull \and Thomas S. Shermer \and Arrvindh Shriraman} %          {Simon Fraser University} %           {\{cameron,ksherdy,lindanl,linmengl,bhull,shermer,ashriram\}@cs.sfu.ca} \maketitle Thompson \cite{thompson1968} is credited with the first construction to convert regular expressions to nondeterministic finite automata (NFA). Following Thompson's approach, a regular expression of length m is first converted to an NFA with O(m) nodes. It is then possible to search a text of length n using the NFA in worst case O(mn) time. Often, a more efficient choice Following Thompson's approach, a regular expression of length $m$ is first converted to an NFA with $O(m)$ nodes. It is then possible to search a text of length $n$ using the NFA in worst case $O(mn)$ time. Often, a more efficient choice is to convert the NFA into a DFA. A DFA has only a single active state at any time in the matching process and hence it is possible to search a text at of length n in worst-case O(n) optimal. hence it is possible to search a text at of length $n$ in $O(n)$ time. However, it is well known that the conversion of an NFA to an equivalent DFA may result in state explosion. That is, the number of resultant DFA states may increase exponentially. In \cite{Baeza-yates_anew} a new approach to text searching was proposed based on bit-parallelism \cite{baeza1992new}. This technique takes advantage of the intrinsic parallelism of bitwise operations within a computer word. Given a w-bit word, the Shift-Or algorithm \cite{Baeza-yates_anew} algorithm uses the within a computer word. Given a $w$-bit word, the Shift-Or algorithm \cite{Baeza-yates_anew} algorithm uses the bit-parallel approach to simulate an NFA in O($nm/w$) worst-case time. simulate an NFA in $O(nm/w)$ worst-case time. A disadvantage of the bit-parallel Shift-Or pattern matching approach and hence the name nrgrep \cite{navarro2000}. {\em a brief review} There has been considerable interest in using parallelization techniques to improve the performance of regular expression matching on parallel hardware such as multi-core processors (CPUs), graphics processing units (GPUs), field-programmable gate arrays (FPGAs), and even more exotic architectures such as field-programmable gate arrays (FPGAs), and specialized architectures such as the Cell Broadband Engine (Cell BE). % FPGA results (synthesis of patterns into logic circuits) vs. memory based approaches (STTs in memory) %CPU processing of data stream elements.  Indeed, our most abstract model is that of unbounded data parallelism: processing all elements of the input data stream simultaneously.   In essence, we view data streams as (very large) integers.   The fundamental operations we apply are based on bitwise logic and long-stream addition. the input data stream simultaneously.   In essence, data streams are viewed as (very large) integers.   The fundamental operations are bitwise logic, stream shifting and long-stream addition. Depending on the available parallel processing resources, an actual case, we rely on the Parabix tool chain \cite{lin2012parabix} to handle the details of compilation to block-by-block processing. As we show later, however, we have also adapted Parabix technology to processing blocks of 4K bytes at time in our GPGPU implementation, relying on the application of our long-stream addition technique to perform 4096-bit additions using 64 threads working in lock-step SIMT fashion each on 64-bit processors. For our GPGPU implementation, we have developed a long-stream addition technique that allows us to perform 4096-bit additions using 64 threads working in lock-step SIMT fashion.  Using scripts to modify the output of the Parabix tools, we effectively divide the input into blocks of 4K bytes processed in a fully data-parallel manner. A key concept in this streaming approach is the derivation of bit streams or not.  Figure \ref{fig:streams} shows an example of an input byte stream in ASCII, the eight basis bit streams of the transposed representation, and several character class bit streams transposed representation, and the character class bit streams \verb:[a]:, \verb:[z]:, and \verb:[0-9]: that may be computed from the basis bit streams using bitwise logic. Transposition and character class construction are straightforward using the Parabix tool chain \cite{lin2012parabix}. \begin{figure}[tbh] \begin{center} \begin{tabular}{cr}\\ input data  & \verba4534q--b29z---az---a4q--bca22z--\\ $B_7$ & \verb.................................\\ $B_6$ & \verb1....1..1..1...11...1.1..111..1..\\ $B_5$ & \verb111111111111111111111111111111111\\ $B_4$ & \verb.11111...111....1....11.....111..\\ $B_3$ & \verb......11..11111.1111...11.....111\\ $B_2$ & \verb.11.1.11....111..111.1.11......11\\ $B_1$ & \verb...1....11.1....1........11.111..\\ $B_0$ & \verb1.11.111..1.1111.1111.111.11...11\\ \verb:[a]: & \verb1..............1....1......1.....\\ \verb:[z]: & \verb...........1....1.............1..\\ \verb:[0-9]: & \verb.1111....11..........1......11... \end{tabular} \end{center} \caption{Basis and Character Class Streams} \label{fig:streams} \end{figure} \paragraph*{Marker Streams.}  Now consider how bit-parallel data \item  \verb#hsimd<64>::mask(X)# : gathering the high bit of each 64-bit field into a single compressed $f$-bit mask value, and \item normal bitwise logic operations on $f$-bit masks. \item normal bitwise logic operations on $f$-bit masks, and \item  \verb#simd<64>::spread(x)# : distributing the bits of an $f$ bit mask, one bit each to the $f$ 64-bit fields of a vector, and an $f$ bit mask, one bit each to the $f$ 64-bit fields of a vector. \end{itemize}