source: docs/Working/icGrep/background.tex @ 4449

Last change on this file since 4449 was 4449, checked in by cameron, 4 years ago

Unicode regexp background

File size: 3.0 KB
1\section{Background} \label{sec:background}
2\subsection{Unicode Regular Expression Requirements}
3Unicode is system for organizing the characters from all languages
4and symbol systems into a single numeric space and encoding those
5values in convenient formats for computerized processing.
6The numeric values form a space of integer {\em codepoints} from 0
7through hexadecimal 10FFFF.  The computerized formats represent
8codepoint values as (possibly variable length) sequences of fixed-width {\em code units}.
9UTF-8 represents each codepoint using a sequence of one to four octets (8-bit bytes),
10UTF-16 represents each codepoint using one or two 16-bit code units and UTF-32
11represents each codepoint as a single 32-bit unit.  The format used most
12often for storage and transmission of Unicode data is UTF-8; this is the format
13assumed through this paper.
15Traditional grep syntax is oriented towards
16string search using regular expressions over ASCII or extended-ASCII byte sequences.
17A grep search for a line beginning with a capitalized word might use the
18pattern ``\verb:^[A-Z][a-z]+:'' (``extended'' syntax).  Here, ``\verb:^:'' is a zero-width assertion
19matching only at the start of a line, ``\verb:[A-Z]:'' is a character class
20that matches any single character in the contiguous range of characters froms A through Z,
21while the  plus operator in ``\verb:[a-z]+:'' denotes repetition of one or more lower
22case ASCII letters.   
24While explicit listing of characters of interest is
25practical with ASCII, it is less so with Unicode.   In the Unicode 7.0 database,
26there are 1490 characters categorized as upper case and 1841 categorized as lower case.
27Rather than explicit listing of all characters of interest, then, it is more
28practical to use named character classes, such as \verb:Lu: for upper case letters and
29\verb:Ll: for lower case letters.   Using these names, our search might be rewritten
30to find capitalized words in any language as ``\verb!^[[:Lu:]][[:Ll:]]+!'' (Posix
31syntax)  or ``\verb:^\p{Lu}\p{Ll}+:'' (Perl-compatible syntax).   
32The Unicode consortium has defined an extensive list of named properties that can
33be used in regular expressions.
35Beyond named properties, Unicode Technical Standard \#18 defines a number of
36additional requirements for Unicode regular expressions, at three levels of
37complexity \cite{davis2012unicode}.   We consider only Unicode level 1 requirements in this paper,
38as most grep implementations are incomplete with respect to Unicode requirements
39at this level.   At level 1, the primary additional requirements relate to
40more complicated rules rules for identifying line breaks, word breaks
41and case-insensitive matching.   
42Beyond this, there is one important syntactic
43extension: the ability to refine character class specifications using set
44intersectiona and subtraction.   For example, \verb:[\p{Greek}&&\p{Lu}]:
45denotes the class of upper case Greek letters, while \verb:[\p{Ll}--\p{ASCII}]:
46denotes the class of all non-ASCII lower case letters.
52\subsection{Parabix Regular Expression Matching}
Note: See TracBrowser for help on using the repository browser.