source: docs/Working/icGrep/background.tex @ 4450

Last change on this file since 4450 was 4450, checked in by cameron, 4 years ago

More background

File size: 4.8 KB
Line 
1\section{Background} \label{sec:background}
2\subsection{Unicode Regular Expression Requirements}
3Unicode is system for organizing the characters from all languages
4and symbol systems into a single numeric space and encoding those
5values in convenient formats for computerized processing.
6The numeric values form a space of integer {\em codepoints} from 0
7through hexadecimal 10FFFF.  The computerized formats represent
8codepoint values as (possibly variable length) sequences of fixed-width {\em code units}.
9UTF-8 represents each codepoint using a sequence of one to four octets (8-bit bytes),
10UTF-16 represents each codepoint using one or two 16-bit code units and UTF-32
11represents each codepoint as a single 32-bit unit.  The format used most
12often for storage and transmission of Unicode data is UTF-8; this is the format
13assumed through this paper.
14
15Traditional grep syntax is oriented towards
16string search using regular expressions over ASCII or extended-ASCII byte sequences.
17A grep search for a line beginning with a capitalized word might use the
18pattern ``\verb:^[A-Z][a-z]+:'' (``extended'' syntax).  Here, ``\verb:^:'' is a zero-width assertion
19matching only at the start of a line, ``\verb:[A-Z]:'' is a character class
20that matches any single character in the contiguous range of characters froms A through Z,
21while the  plus operator in ``\verb:[a-z]+:'' denotes repetition of one or more lower
22case ASCII letters.   
23
24While explicit listing of characters of interest is
25practical with ASCII, it is less so with Unicode.   In the Unicode 7.0 database,
26there are 1490 characters categorized as upper case and 1841 categorized as lower case.
27Rather than explicit listing of all characters of interest, then, it is more
28practical to use named character classes, such as \verb:Lu: for upper case letters and
29\verb:Ll: for lower case letters.   Using these names, our search might be rewritten
30to find capitalized words in any language as ``\verb!^[[:Lu:]][[:Ll:]]+!'' (Posix
31syntax)  or ``\verb:^\p{Lu}\p{Ll}+:'' (Perl-compatible syntax).   
32The Unicode consortium has defined an extensive list of named properties that can
33be used in regular expressions.
34
35Beyond named properties, Unicode Technical Standard \#18 defines additional
36requirements for Unicode regular expressions, at three levels of
37complexity \cite{davis2012unicode}.   Level 1 generally relates to
38properties expressed in terms of individual Unicode codepoints, while level 2
39introduces complexities due to codepoint sequences that form grapheme clusters,
40and level 3 relates to tailored locale-specific support.
41We consider only Unicode level 1 requirements
42in this paper, as most grep implementations are incomplete with respect
43the requirements even at this level.   
44The additional level 1 regular expression requirements primarily relate to larger
45classes of characters that are used in identifying line breaks, word breaks
46and case-insensitive matching.   
47Beyond this, there is one important syntactic extension: the ability to refine
48character class specifications using set
49intersection and subtraction.   For example, \verb:[\p{Greek}&&\p{Lu}]:
50denotes the class of upper case Greek letters, while \verb:[\p{Ll}--\p{ASCII}]:
51denotes the class of all non-ASCII lower case letters.
52
53\subsection{Parabix}
54
55The Parabix toolchain is a set of compilers and run-time libraries designed
56to take advantage of the SIMD features of commodity processors
57to support high-performance streaming text processing based on a bit-parallel
58transform represenation of text.  In this representation, a text $T$ is represented as a set of parallel
59bit streams $B_i$, such that bit $j$ of stream $B_i$ is the $i^{\text{th}}$ bit of
60character code unit $j$ of $T$.  The Parabix methods have been used to
61accelerate Unicode transcoding \cite{cameron2008case}, protein search \cite{green2009modeling},
62XML parsing \cite{cameron2011parallel}, and, most recently, regular expression search \cite{cameron2014bitwise}.
63
64
65
66\subsection{LLVM}
67
68The LLVM compiler infrastructure is a set of modular compiler components and tools
69organized around a powerful generic intermediate representation (LLVM IR) that is
70agnostic with respect to source language and code-generation targets.   
71Beginning as an MSc research project at the University of Illinois \cite{lattner2004llvm},
72LLVM is now an open-source codebase supported by a broad community of
73researchers, developers and commercial organizations.   
74
75LLVM features a flexible multi-stage compilation structure that can be organized in
76passes in many ways, including fully dynamic just-in-time compilation.   Of particular
77importance to the icGrep project, the LLVM IR supports arbitrarily-sized vectors of
78arbitrary-width integers, well suited to code generation
79targetting the SIMD integer instructions of commodity processors.
80
81\subsection{Parabix Regular Expression Matching}
Note: See TracBrowser for help on using the repository browser.