Version 2 (modified by cameron, 4 years ago) (diff)


Unicode Level 1 Support in icGrep

RL1.1 Hex Notation

icGrep implements RL1.1 using backslash escape sequences beginning \x, \u and \U. An arbitrary Unicode codepoint may be represented by 1 to 6 hexadecimal digits enclosed in braces following either the \x, \u escapes. Alternatively, a codepoint may be represented using exactly 8 hexadecimal digits following the \U escape, without braces. Thus, the forms \x{1D11E}, \u{1D11E} and \U0001D11E}, all represent U+1D11E (musical symbol G clef).

For compatibility with legacy implementations, icgrep also accepts short forms without braces, consisting of 1 or 2 hex digits following \x or exactly 4 hex digits following \u.

Also for compatibility with legacy implementations, icgrep accepts octal notation. An arbitrary codepoint may be represented by 1 to 8 octal digits enclosed in braces following the \o escape. The short form consisting of 0 to 3 octal digits following \0 (without braces) is also recognized.

RL1.2 Properties

icGrep implements the full set of Unicode properties required by RL1.2, using full property names or their aliases or any variation thereof in accord with the matching rules of Unicode Standard Annex #44. The following syntactic alternatives are supported.

  • \p{property-name} for binary properties
  • \p{property-name=property-value}
  • \p{property-value} for values of the General_Category or Script properties.

Following Perl syntactic conventions, negated forms of property expressions (matching all values not having the specified property) use the \P syntax.

1.2.1 General_Category

icGrep implements the General_Category property using full property-value names, or the standard one- or two-letter codes. For example, the following notations all represent expressions matching any codepoint in the general category Letter: \p{Letter}, \p{General_Category=Letter}, \p{L}, \p{generalcategory=l}.

In addition, icGrep implements \p{ANY}, \p{ASCII}, and \p{ASSIGNED} as equivalent to [\u{0}-\u{10FFFF}], \p{[\u{0}-\u{7F}]}, and \P{GC=Unassigned} respectively.

1.2.2 Script and Script Extensions Properties

Codepoints having particular Script property values may be specified by the script name or its 4-letter code. \p{Arab}, \p{script=Arabic}, \p{sc=arab} are all equivalent script designations.

To specify codepoints whose Script_Extensions property includes a particular value, the property name or its short form scx must be specified, for example \p{scx=arab}.