source: docs/Working/icGrep/background.tex @ 4488

Last change on this file since 4488 was 4488, checked in by cameron, 4 years ago

Initial bitwise example, table placeholder

File size: 6.5 KB
Line 
1\section{Background} \label{sec:background}
2\subsection{Unicode Regular Expression Requirements}
3Unicode is system for organizing the characters from all languages
4and symbol systems into a single numeric space and encoding those
5values in convenient formats for computerized processing.
6The numeric values form a space of integer {\em codepoints} from 0
7through hexadecimal 10FFFF.  The computerized formats represent
8codepoint values as (possibly variable length) sequences of fixed-width {\em code units}.
9UTF-8 represents each codepoint using a sequence of one to four octets (8-bit bytes),
10UTF-16 represents each codepoint using one or two 16-bit code units and UTF-32
11represents each codepoint as a single 32-bit unit.  The format used most
12often for storage and transmission of Unicode data is UTF-8; this is the format
13assumed through this paper.
14
15Traditional grep syntax is oriented towards
16string search using regular expressions over ASCII or extended-ASCII byte sequences.
17A grep search for a line beginning with a capitalized word might use the
18pattern ``\verb:^[A-Z][a-z]+:'' (``extended'' syntax).  Here, ``\verb:^:'' is a zero-width assertion
19matching only at the start of a line, ``\verb:[A-Z]:'' is a character class
20that matches any single character in the contiguous range of characters from A through Z,
21while the  plus operator in ``\verb:[a-z]+:'' denotes repetition of one or more lower
22case ASCII letters.   
23
24While explicit listing of characters of interest is
25practical with ASCII, it is less so with Unicode.   In the Unicode 7.0 database,
26there are 1490 characters categorized as upper case and 1841 categorized as lower case.
27Rather than explicit listing of all characters of interest, then, it is more
28practical to use named character classes, such as \verb:Lu: for upper case letters and
29\verb:Ll: for lower case letters.   Using these names, our search might be rewritten
30to find capitalized words in any language as ``\verb!^[[:Lu:]][[:Ll:]]+!'' (Posix
31syntax)  or ``\verb:^\p{Lu}\p{Ll}+:'' (Perl-compatible syntax).   
32The Unicode consortium has defined an extensive list of named properties that can
33be used in regular expressions.
34
35Beyond named properties, Unicode Technical Standard \#18 defines additional
36requirements for Unicode regular expressions, at three levels of
37complexity~\cite{davis2012unicode}.   Level 1 generally relates to
38properties expressed in terms of individual Unicode codepoints, while level 2
39introduces complexities due to codepoint sequences that form grapheme clusters,
40and level 3 relates to tailored locale-specific support.
41We consider only Unicode level 1 requirements
42in this paper, as most grep implementations are incomplete with respect
43the requirements even at this level.   
44The additional level 1 regular expression requirements primarily relate to larger
45classes of characters that are used in identifying line breaks, word breaks
46and case-insensitive matching.   
47Beyond this, there is one important syntactic extension: the ability to refine
48character class specifications using set
49intersection and subtraction.   For example, \verb:[\p{Greek}&&\p{Lu}]:
50denotes the class of upper case Greek letters, while \verb:[\p{Ll}--\p{ASCII}]:
51denotes the class of all non-ASCII lower case letters.
52
53\subsection{Parabix}
54
55The Parabix toolchain is a set of compilers and run-time libraries designed
56to take advantage of the SIMD features of commodity processors
57to support high-performance streaming text processing.
58% based on a bit-parallel
59%transform representation of text.
60Parabix leverages a bit-parallel transform representation of text, where a
61text $T$ is represented as a set of parallel
62bit streams $B_i$, such that bit $j$ of stream $B_i$ is the $i^{\mbox{th}}$ bit of
63character code unit $j$ of $T$.
64The Parabix methods have been used to
65accelerate Unicode transcoding~\cite{cameron2008case}, protein search~\cite{green2009modeling},
66XML parsing~\cite{cameron2011parallel}, and, most recently, regular expression search~\cite{cameron2014bitwise}.
67
68
69\begin{figure}
70\begin{center}
71\begin{tabular}{cr}\\
72input data  & \verb`dead dreams defeated.`\\
73$M_1 = {\text{Advance}}({\text{\tt [d]}})$ & \verb`.1..1.1......1......1`\\
74$M_2 = {\text{MatchStar}}(M_1, {\text{\tt [a-z]}})$ & \verb`.1111.111111.11111111`\\
75$M_3 = {\text{Advance}}(M_2 \wedge {\text{\tt [e]}})$ & \verb`..1.....1.....1.1..1.`\\
76$M_4 = {\text{Advance}}(M_3 \wedge {\text{\tt [d]}})$ & \verb`....................1`
77\end{tabular}
78\end{center}
79\caption{Matching with Bitwise Data Parallelism}\label{fig:bitwisematch}
80\end{figure}
81
82
83The central concept in regular expression matching is that of marker bitstreams.
84At each step in the matching process, a marker bitstream indicates the matches
85that have been found to this point.   The bitwise matching method uses the
86operations Advance to move an entire stream of markers one step forward, and MatchStar
87to find all positions matched after a character class repetition.
88For example,
89Figure \ref{fig:bitwisematch} shows how the regular expression {\tt d[a-z]*ed} is matched
90against some input text using bitwise methods. 
91In the first step the character class stream
92{\tt [d]} is matched and the results advanced one position to produce marker bitstream $M_1$.   Four match results are now in play
93simultaneously. 
94The next step applies the
95MatchStar operation to find all the matches that may then be reached with the Kleene-* repetition
96{\tt [a-z]*} ($M_2$).  We now have pending matches at almost all possible positions.   However, most of
97the positions are filtered out with the match to {\tt [e]} in the computation of $M_3$.
98The final step produces marker stream $M_4$ indicating that single position at which the entire regular expression is matched.
99
100
101
102\comment{
103
104\subsection{LLVM}
105
106The LLVM compiler infrastructure is a set of modular compiler components and tools
107organized around a powerful generic intermediate representation (LLVM IR) that is
108agnostic with respect to source language and code-generation targets.   
109Beginning as an MSc research project at the University of Illinois \cite{lattner2004llvm},
110LLVM is now an open-source codebase supported by a broad community of
111researchers, developers and commercial organizations.   
112
113LLVM features a flexible multi-stage compilation structure that can be organized in
114passes in many ways, including fully dynamic just-in-time compilation.   Of particular
115importance to the \icGrep{} project, the LLVM IR supports arbitrarily-sized vectors of
116arbitrary-width integers, well suited to code generation
117targetting the SIMD integer instructions of commodity processors.
118
119\subsection{Parabix Regular Expression Matching}
120}
Note: See TracBrowser for help on using the repository browser.