Changeset 4449

Feb 3, 2015, 7:30:07 AM (4 years ago)

Unicode regexp background

1 added
1 edited


  • docs/Working/icGrep/background.tex

    r4446 r4449  
    99UTF-8 represents each codepoint using a sequence of one to four octets (8-bit bytes),
    1010UTF-16 represents each codepoint using one or two 16-bit code units and UTF-32
    11 represents each codepoint as a single 32-bit unit.
     11represents each codepoint as a single 32-bit unit.  The format used most
     12often for storage and transmission of Unicode data is UTF-8; this is the format
     13assumed through this paper.
    13 Traditional grep implementations (GNU grep, BSD grep, pcregrep) are oriented towards
     15Traditional grep syntax is oriented towards
    1416string search using regular expressions over ASCII or extended-ASCII byte sequences.
     17A grep search for a line beginning with a capitalized word might use the
     18pattern ``\verb:^[A-Z][a-z]+:'' (``extended'' syntax).  Here, ``\verb:^:'' is a zero-width assertion
     19matching only at the start of a line, ``\verb:[A-Z]:'' is a character class
     20that matches any single character in the contiguous range of characters froms A through Z,
     21while the  plus operator in ``\verb:[a-z]+:'' denotes repetition of one or more lower
     22case ASCII letters.   
     24While explicit listing of characters of interest is
     25practical with ASCII, it is less so with Unicode.   In the Unicode 7.0 database,
     26there are 1490 characters categorized as upper case and 1841 categorized as lower case.
     27Rather than explicit listing of all characters of interest, then, it is more
     28practical to use named character classes, such as \verb:Lu: for upper case letters and
     29\verb:Ll: for lower case letters.   Using these names, our search might be rewritten
     30to find capitalized words in any language as ``\verb!^[[:Lu:]][[:Ll:]]+!'' (Posix
     31syntax)  or ``\verb:^\p{Lu}\p{Ll}+:'' (Perl-compatible syntax).   
     32The Unicode consortium has defined an extensive list of named properties that can
     33be used in regular expressions.
     35Beyond named properties, Unicode Technical Standard \#18 defines a number of
     36additional requirements for Unicode regular expressions, at three levels of
     37complexity \cite{davis2012unicode}.   We consider only Unicode level 1 requirements in this paper,
     38as most grep implementations are incomplete with respect to Unicode requirements
     39at this level.   At level 1, the primary additional requirements relate to
     40more complicated rules rules for identifying line breaks, word breaks
     41and case-insensitive matching.   
     42Beyond this, there is one important syntactic
     43extension: the ability to refine character class specifications using set
     44intersectiona and subtraction.   For example, \verb:[\p{Greek}&&\p{Lu}]:
     45denotes the class of upper case Greek letters, while \verb:[\p{Ll}--\p{ASCII}]:
     46denotes the class of all non-ASCII lower case letters.
    17 \subsection{LLVM}
    1952\subsection{Parabix Regular Expression Matching}
Note: See TracChangeset for help on using the changeset viewer.