wiki:IcGrepUnicodeLevel1

Unicode Level 1 Support in icGrep

RL1.1 Hex Notation

icGrep implements RL1.1 using backslash escape sequences beginning \x, \u and \U. An arbitrary Unicode codepoint may be represented by 1 to 6 hexadecimal digits enclosed in braces following either the \x, \u escapes. Alternatively, a codepoint may be represented using exactly 8 hexadecimal digits following the \U escape, without braces. Thus, the forms \x{1D11E}, \u{1D11E} and \U0001D11E}, all represent U+1D11E (musical symbol G clef).

For compatibility with legacy implementations, icgrep also accepts short forms without braces, consisting of 1 or 2 hex digits following \x or exactly 4 hex digits following \u.

Also for compatibility with legacy implementations, icgrep accepts octal notation. An arbitrary codepoint may be represented by 1 to 8 octal digits enclosed in braces following the \o escape. The short form consisting of 0 to 3 octal digits following \0 (without braces) is also recognized.

RL1.2 Properties

icGrep implements the full set of Unicode properties required by RL1.2, using full property names or their aliases or any variation thereof in accord with the matching rules of Unicode Standard Annex #44. The following syntactic alternatives are supported.

  • \p{property-name} for binary properties
  • \p{property-name=property-value}
  • \p{property-value} for values of the General_Category or Script properties.

Following Perl syntactic conventions, negated forms of property expressions (matching all values not having the specified property) use the \P syntax.

RL1.2a Compatibility Properties

Except for \X for extended grapheme clusters, icGrep implements the full set of compatibility properties in Annex C of Unicode Technical Standard #18, following Unicode definitions in preference to Posix definitions. Posix bracket expressions such as [:punct:] may be used within character classes.

1.2.1 General_Category

icGrep implements the General_Category property using full property-value names, or the standard one- or two-letter codes. For example, the following notations all represent expressions matching any codepoint in the general category Letter: \p{Letter}, \p{General_Category=Letter}, \p{L}, \p{generalcategory=l}, [[:gc=l:]].

In addition, icGrep implements \p{ANY}, \p{ASCII}, and \p{ASSIGNED} as equivalent to [\u{0}-\u{10FFFF}], [\u{0}-\u{7F}], and \P{GC=Unassigned} respectively.

1.2.2 Script and Script Extensions Properties

Codepoints having particular Script property values may be specified by the script name or its 4-letter code. \p{Arab}, \p{script=Arabic}, \p{sc=arab} are all equivalent script designations.

To specify codepoints whose Script_Extensions property includes a particular value, the property name or its short form scx must be specified, for example \p{scx=arab}.

1.2.3 Other Properties

icGrep implements each of the binary properties required by RL1.2 (Alphabetic, Uppercase, Lowercase, White_Space, Noncharacter_Code_Point, Default_Ignorable_Code_Point). When the property name is specified alone, codepoints having the value Y for the property are selected. The property-value may be specified if desired. The following are all equivalent for codepoints that are not Uppercase: \p{Uppercase=N}, \P{uppercase=true}, \P{upper}.

Many other binary properties specified by the Unicode Database (UCD) are also supported by icGrep. These include all the binary properties specified in the PropList.txt and DerivedCoreProperties.txt files.

1.2.4 Age

icGrep does not implement the Age property (not required at Unicode Level 1).

1.2.5 Blocks

icGrep implements the Block property. The property name Block or its short form blk must be used in property designations. For example, codepoints in the Greek_and_Coptic block [0370..03FF] may be specified using \p{blk=Greek}. Note that the notation \p{Greek} specifies codepoints in the Greek script, which omits unassigned codepoints within the Greek block, and includes codepoints from other blocks such as the Greek_Extended block.

RL1.3 Subtraction and Intersection

icGrep implements set subtraction and intersection within character class expressions. The -- operator specifies subtraction, while the && operator specifies intersection. For example, Greek upper case letters may be specified using [\p{Greek}&&\p{Lu}].

RL1.4 Simple Word Boundaries

\b: icGrep implements zero-width word-boundary assertions using the Unicode definition of word characters.

RL1.5 Simple Loose Matches

icGrep implements case-insensitive matching using Unicode simple case-folding rules. Case-insensitive matching is specified with the -i command-line parameter or applied within individual parts of a regular expression using the (?i) and (?i:<regexp>) notations. Case-insensitive matching applies to literally and numerically specified characters; icgrep does not apply case-insensitivity rules to change the interpretation of property expressions.

RL1.6 Line Boundaries

icGrep implements full Unicode line boundaries.

RL1.7 Supplementary Code Points

icGrep supports the full range of Unicode codepoints including codepoints in the supplementary planes above U+FFFF. Isolated surrogate code points occurring in a UTF-8 file may be matched using \u{D800}, for example.

Summary

In meeting each of the requirements of RL1.1 through RL 1.7, icGrep fully meets the Unicode Level 1 requirements of Unicode Technical Standard #18.

Beyond Unicode Level 1

Unicode Level 2 Support in icGrep

Last modified 20 months ago Last modified on Nov 1, 2015, 10:38:05 AM