Version 7 (modified by cameron, 4 years ago) (diff)


Unicode Level 1 Support in icGrep

RL1.1 Hex Notation

icGrep implements RL1.1 using backslash escape sequences beginning \x, \u and \U. An arbitrary Unicode codepoint may be represented by 1 to 6 hexadecimal digits enclosed in braces following either the \x, \u escapes. Alternatively, a codepoint may be represented using exactly 8 hexadecimal digits following the \U escape, without braces. Thus, the forms \x{1D11E}, \u{1D11E} and \U0001D11E}, all represent U+1D11E (musical symbol G clef).

For compatibility with legacy implementations, icgrep also accepts short forms without braces, consisting of 1 or 2 hex digits following \x or exactly 4 hex digits following \u.

Also for compatibility with legacy implementations, icgrep accepts octal notation. An arbitrary codepoint may be represented by 1 to 8 octal digits enclosed in braces following the \o escape. The short form consisting of 0 to 3 octal digits following \0 (without braces) is also recognized.

RL1.2 Properties

icGrep implements the full set of Unicode properties required by RL1.2, using full property names or their aliases or any variation thereof in accord with the matching rules of Unicode Standard Annex #44. The following syntactic alternatives are supported.

  • \p{property-name} for binary properties
  • \p{property-name=property-value}
  • \p{property-value} for values of the General_Category or Script properties.

Following Perl syntactic conventions, negated forms of property expressions (matching all values not having the specified property) use the \P syntax.

1.2.1 General_Category

icGrep implements the General_Category property using full property-value names, or the standard one- or two-letter codes. For example, the following notations all represent expressions matching any codepoint in the general category Letter: \p{Letter}, \p{General_Category=Letter}, \p{L}, \p{generalcategory=l}.

In addition, icGrep implements \p{ANY}, \p{ASCII}, and \p{ASSIGNED} as equivalent to [\u{0}-\u{10FFFF}], \p{[\u{0}-\u{7F}]}, and \P{GC=Unassigned} respectively.

1.2.2 Script and Script Extensions Properties

Codepoints having particular Script property values may be specified by the script name or its 4-letter code. \p{Arab}, \p{script=Arabic}, \p{sc=arab} are all equivalent script designations.

To specify codepoints whose Script_Extensions property includes a particular value, the property name or its short form scx must be specified, for example \p{scx=arab}.

1.2.3 Other Properties

icGrep implements each of the binary properties required by RL1.2 (Alphabetic, Uppercase, Lowercase, White_Space, Noncharacter_Code_Point, Default_Ignorable_Code_Point). When the property name is specified alone, codepoints having the value Y for the property are selected. The property-value may be specified if desired. The following are all equivalent for codepoints that are not Uppercase: \p{Uppercase=N}, \P{uppercase=true}, \P{upper}.

Many other binary properties specified by the Unicode Database (UCD) are also supported by icGrep. These include all the binary properties specified in the PropList?.txt and DerivedCoreProperties?.txt files.

1.2.4 Age

icGrep does not implement the Age property (not required at Unicode Level 1).

1.2.5 Blocks

icGrep implements the Block property. The property name Block or its short form blk must be used in property designations. For example, codepoints in the Greek_and_Coptic block [0370..03FF] may be specified using \p{blk=Greek}. The notation \p{Greek} specifies codepoints in the Greek script, which omits unassigned codepoints within the Greek block, and includes codepoints from other blocks such as the Greek_Extended block.

RL1.3 Subtraction and Intersection

icGrep implements set subtraction and intersection within character class expressions. The -- operator specifies subtraction, while the && operator specifies intersection. For example, Greek upper case letters may be specified using [\p{Greek}&&\p{Lu}].

RL1.4 Simple Word Boundaries

\b: description to be written

RL1.5 Simple Loose Matches

(?i) description to be written

RL1.6 Line Boundaries

icGrep implements full Unicode line boundaries. Additional description to follow.

RL1.7 Supplementary Code Points

icGrep supports the full range of Unicode codepoints including codepoints in the supplementary planes above U+FFFF. Isolated surrogate code points occurring in a UTF-8 file may be matched using \u{D800}, for example.