# Changeset 4450

Ignore:
Timestamp:
Feb 3, 2015, 10:14:39 AM (4 years ago)
Message:

More background

File:
1 edited

### Legend:

Unmodified
 r4449 be used in regular expressions. Beyond named properties, Unicode Technical Standard \#18 defines a number of additional requirements for Unicode regular expressions, at three levels of complexity \cite{davis2012unicode}.   We consider only Unicode level 1 requirements in this paper, as most grep implementations are incomplete with respect to Unicode requirements at this level.   At level 1, the primary additional requirements relate to more complicated rules rules for identifying line breaks, word breaks Beyond named properties, Unicode Technical Standard \#18 defines additional requirements for Unicode regular expressions, at three levels of complexity \cite{davis2012unicode}.   Level 1 generally relates to properties expressed in terms of individual Unicode codepoints, while level 2 introduces complexities due to codepoint sequences that form grapheme clusters, and level 3 relates to tailored locale-specific support. We consider only Unicode level 1 requirements in this paper, as most grep implementations are incomplete with respect the requirements even at this level. The additional level 1 regular expression requirements primarily relate to larger classes of characters that are used in identifying line breaks, word breaks and case-insensitive matching. Beyond this, there is one important syntactic extension: the ability to refine character class specifications using set intersectiona and subtraction.   For example, \verb:[\p{Greek}&&\p{Lu}]: Beyond this, there is one important syntactic extension: the ability to refine character class specifications using set intersection and subtraction.   For example, \verb:[\p{Greek}&&\p{Lu}]: denotes the class of upper case Greek letters, while \verb:[\p{Ll}--\p{ASCII}]: denotes the class of all non-ASCII lower case letters. \subsection{Parabix} The Parabix toolchain is a set of compilers and run-time libraries designed to take advantage of the SIMD features of commodity processors to support high-performance streaming text processing based on a bit-parallel transform represenation of text.  In this representation, a text $T$ is represented as a set of parallel bit streams $B_i$, such that bit $j$ of stream $B_i$ is the $i^{\text{th}}$ bit of character code unit $j$ of $T$.  The Parabix methods have been used to accelerate Unicode transcoding \cite{cameron2008case}, protein search \cite{green2009modeling}, XML parsing \cite{cameron2011parallel}, and, most recently, regular expression search \cite{cameron2014bitwise}. \subsection{LLVM} The LLVM compiler infrastructure is a set of modular compiler components and tools organized around a powerful generic intermediate representation (LLVM IR) that is agnostic with respect to source language and code-generation targets. Beginning as an MSc research project at the University of Illinois \cite{lattner2004llvm}, LLVM is now an open-source codebase supported by a broad community of researchers, developers and commercial organizations. \subsection{Parabix} LLVM features a flexible multi-stage compilation structure that can be organized in passes in many ways, including fully dynamic just-in-time compilation.   Of particular importance to the icGrep project, the LLVM IR supports arbitrarily-sized vectors of arbitrary-width integers, well suited to code generation targetting the SIMD integer instructions of commodity processors. \subsection{Parabix Regular Expression Matching}