Changeset 4450 for docs


Ignore:
Timestamp:
Feb 3, 2015, 10:14:39 AM (4 years ago)
Author:
cameron
Message:

More background

File:
1 edited

Legend:

Unmodified
Added
Removed
  • docs/Working/icGrep/background.tex

    r4449 r4450  
    3333be used in regular expressions.
    3434
    35 Beyond named properties, Unicode Technical Standard \#18 defines a number of
    36 additional requirements for Unicode regular expressions, at three levels of
    37 complexity \cite{davis2012unicode}.   We consider only Unicode level 1 requirements in this paper,
    38 as most grep implementations are incomplete with respect to Unicode requirements
    39 at this level.   At level 1, the primary additional requirements relate to
    40 more complicated rules rules for identifying line breaks, word breaks
     35Beyond named properties, Unicode Technical Standard \#18 defines additional
     36requirements for Unicode regular expressions, at three levels of
     37complexity \cite{davis2012unicode}.   Level 1 generally relates to
     38properties expressed in terms of individual Unicode codepoints, while level 2
     39introduces complexities due to codepoint sequences that form grapheme clusters,
     40and level 3 relates to tailored locale-specific support.
     41We consider only Unicode level 1 requirements
     42in this paper, as most grep implementations are incomplete with respect
     43the requirements even at this level.   
     44The additional level 1 regular expression requirements primarily relate to larger
     45classes of characters that are used in identifying line breaks, word breaks
    4146and case-insensitive matching.   
    42 Beyond this, there is one important syntactic
    43 extension: the ability to refine character class specifications using set
    44 intersectiona and subtraction.   For example, \verb:[\p{Greek}&&\p{Lu}]:
     47Beyond this, there is one important syntactic extension: the ability to refine
     48character class specifications using set
     49intersection and subtraction.   For example, \verb:[\p{Greek}&&\p{Lu}]:
    4550denotes the class of upper case Greek letters, while \verb:[\p{Ll}--\p{ASCII}]:
    4651denotes the class of all non-ASCII lower case letters.
    4752
     53\subsection{Parabix}
     54
     55The Parabix toolchain is a set of compilers and run-time libraries designed
     56to take advantage of the SIMD features of commodity processors
     57to support high-performance streaming text processing based on a bit-parallel
     58transform represenation of text.  In this representation, a text $T$ is represented as a set of parallel
     59bit streams $B_i$, such that bit $j$ of stream $B_i$ is the $i^{\text{th}}$ bit of
     60character code unit $j$ of $T$.  The Parabix methods have been used to
     61accelerate Unicode transcoding \cite{cameron2008case}, protein search \cite{green2009modeling},
     62XML parsing \cite{cameron2011parallel}, and, most recently, regular expression search \cite{cameron2014bitwise}.
     63
     64
     65
    4866\subsection{LLVM}
    4967
     68The LLVM compiler infrastructure is a set of modular compiler components and tools
     69organized around a powerful generic intermediate representation (LLVM IR) that is
     70agnostic with respect to source language and code-generation targets.   
     71Beginning as an MSc research project at the University of Illinois \cite{lattner2004llvm},
     72LLVM is now an open-source codebase supported by a broad community of
     73researchers, developers and commercial organizations.   
    5074
    51 \subsection{Parabix}
     75LLVM features a flexible multi-stage compilation structure that can be organized in
     76passes in many ways, including fully dynamic just-in-time compilation.   Of particular
     77importance to the icGrep project, the LLVM IR supports arbitrarily-sized vectors of
     78arbitrary-width integers, well suited to code generation
     79targetting the SIMD integer instructions of commodity processors.
     80
    5281\subsection{Parabix Regular Expression Matching}
Note: See TracChangeset for help on using the changeset viewer.