wiki:CharacterClassCompiler

The Parabix Character Class Compilers

Basic Concepts

Character Class Bit Streams

Given an input stream of character code units, a character class bit stream is a stream of bits defined in one-to-one correspondence with the input stream such that one bits mark instances of character code units within the class, and zero bits mark instances of character code units outside the class.

For example, consider the ASCII character class expression [abc] standing for the class comprising ASCII bytes having code unit values 97 (ASCII value for a), 98 (ASCII value for b) or 99 (ASCII value for c). The following example shows the [abc] character class bit stream aligned with an example ASCII input stream.

input:  This is an example ASCII byte stream.  ASCII is an abbreviation.  
[abc]:  ........1....1...........1........1.............1..111..........

By convention, zero bits within a character class bit stream are marked with periods, so that the one bits (each marked with the digit 1) stand out.

Character Class Expressions

Following traditional regular expression notation, character classes can be defined by listing individual characters and ranges of characters within square brackets. For example, the set of delimiter characters consisting of colon, period, comma, semicolon may be denoted [:.,;], while the alphanumeric ASCII characters (any uppercase or lowercase ASCII letter or any ASCII digit) is represented [A-Za-z0-9].

Characters vs. Code Units

Code units are the fixed-size units that are used in defining a character encoding system. Very often, 8-bit code units (bytes) are used as the basis of an encoding system. But in some cases, such as the UTF-8 representation of Unicode, multiple code units may be required to define a single character. In UTF-8, characters are encoded using sequences that are either one, two, three, or four code units in length.

At the fundamental level, the Parabix character class compilers operate as compilers for identifying individual code units. Defining characters that are comprised of sequences of code units involves an additional transformation structure.

The Compilers

The Python Static Character Class Compiler

The python character class compiler charsetcompiler.py takes a data file of character class definitions as input and produces a set of bitwise logic equations as output.

For example, consider the input file "delim_and_alphanum" consisting of the following definitions:

delimiters = [:.,;]
alphanumeric = [A-Za-z0-9]

A set of equations to compute these character classes from the eight basis bit streams can then be produced by the running the compiler as follows. python charset_compiler.py delim_and_alphanum The following results are produced.

	temp1 = (basis_bits.bit_0 | basis_bits.bit_1)
	temp2 = (basis_bits.bit_2 &~ basis_bits.bit_3)
	temp3 = (temp2 &~ temp1)
	temp4 = (basis_bits.bit_4 & basis_bits.bit_5)
	temp5 = (basis_bits.bit_6 | basis_bits.bit_7)
	temp6 = (basis_bits.bit_6 &~ basis_bits.bit_7)
	temp7 = (temp5 &~ temp6)
	temp8 = (temp4 &~ temp7)
	temp9 = (temp3 & temp8)
	temp10 = (basis_bits.bit_2 & basis_bits.bit_3)
	temp11 = (temp10 &~ temp1)
	temp12 = (basis_bits.bit_4 &~ basis_bits.bit_5)
	temp13 = (temp12 & basis_bits.bit_6)
	temp14 = (temp11 & temp13)
	delimiters = (temp9 | temp14)
	temp15 = (basis_bits.bit_5 | basis_bits.bit_6)
	temp16 = (basis_bits.bit_4 & temp15)
	temp17 = (temp11 &~ temp16)
	temp18 = (basis_bits.bit_1 &~ basis_bits.bit_0)
	temp19 = (temp18 &~ basis_bits.bit_2)
	temp20 = (basis_bits.bit_6 & basis_bits.bit_7)
	temp21 = (basis_bits.bit_5 | temp20)
	temp22 = (basis_bits.bit_4 & temp21)
	temp23 = (~temp22)
	temp24 = (basis_bits.bit_4 | basis_bits.bit_5)
	temp25 = (temp24 | temp5)
	temp26 = ((basis_bits.bit_3 & temp23)|(~(basis_bits.bit_3) & temp25))
	temp27 = (temp19 & temp26)
	temp28 = (temp17 | temp27)
	temp29 = (temp18 & basis_bits.bit_2)
	temp30 = (temp29 & temp26)
	alphanumeric = (temp28 | temp30)

The python compiler supports several options which can be displayed with the following help command. python charset_compiler.py -h

The Parabix+LLVM Dynamic Character Class Compiler

The Parabix+LLVM toolchain incorporates a dynamic character class compiler that allows applications to be built when the definition of a character class is determined at run-time. For example, this supports regular expression applications such as icgrep, in which users may provide arbitrary regular expressions involving arbitrary character classes as input.

Construction of Character Class Objects

  1. Character classes may be built-up using the operations defined in icGREP/icgrep-devel/icgrep/re/re_cc.h, as follows.
    1. Basic character classes may be constructed using a single codepoint or a codepoint range, using re::makeCC(cp) or re:makeCC(lo_cp, hi_cp), respectively.
    2. Character classes may be further constructed by union of two character classes using re:makeCC(cc1, cc2).
    3. Subtraction and intersection of character classes is also supported.
  2. Character classes may also be constructed by parsing a character class expression as a regular expression object, using the re::parse routine of icGREP/icgrep/icgrep-devel/re/re_parser.h
  3. In the Parabix+LLVM framework, the current character class objects are based on Unicode codepoints in the range 0 to 0x10FFFF.
    1. Character class definition based on other alphabets can also potentially be supported - future work.

Compilation of Character Code Unit Classes

  1. When a character class is confined to a single code unit (byte, presently), the compileCC operations of icGREP/icgrep/icgrep-devel/cc/cc_compiler.h can be used to generate Pablo code, in a manner analogous to the Python character class compiler.

Compilation of Full UTF-8 Character Classes

  1. Character class objects include code unit values from the full space of Unicode codepoints.
  2. Compiling full Unicode definitions can be performed by the UCD compiler.
Last modified 19 months ago Last modified on Mar 8, 2016, 3:54:06 PM