Version 4 (modified by cameron, 3 years ago) (diff)


The Parabix Character Class Compilers

Basic Concepts

Character Class Bit Streams

Given an input stream of character code units, a character class bit stream is a stream of bits defined in one-to-one correspondence with the input stream such that one bits mark instances of character code units within the class, and zero bits mark instances of character code units outside the class.

For example, consider the ASCII character class expression [abc] standing for the class comprising ASCII bytes having code unit values 97 (ASCII value for a), 98 (ASCII value for b) or 99 (ASCII value for c). The following example shows the [abc] character class bit stream aligned with an example ASCII input stream.

input:  This is an example ASCII byte stream.  ASCII is an abbreviation.  
[abc]:  ........1....1...........1........1.............1..111..........

By convention, zero bits within a character class bit stream are marked with periods, so that the one bits (each marked with the digit 1) stand out.

Character Class Expressions

Following traditional regular expression notation, character classes can be defined by listing individual characters and ranges of characters within square brackets. For example, the set of delimiter characters consisting of colon, period, comma, semicolon may be denoted [:.,;], while the alphanumeric ASCII characters (any uppercase or lowercase ASCII letter or any ASCII digit) is represented [A-Za-z0-9].

Characters vs. Code Units

Code units are the fixed-size units that are used in defining a character encoding system. Very often, 8-bit code units (bytes) are used as the basis of an encoding system. But in some cases, such as the UTF-8 representation of Unicode, multiple code units may be required to define a single character. In UTF-8, characters are encoded using sequences that are either one, two, three, or four code units in length.

At the fundamental level, the Parabix character class compilers operate as compilers for identifying individual code units. Defining characters that are comprised of sequences of code units involves an additional transformation structure.

The Compilers

The Python Static Character Class Compiler

The python character class compiler takes a data file of character class definitions as input and produces a set of bitwise logic equations as output.

For example, consider the input file "delim_and_alphanum" consisting of the following definitions:

delimiters = [:.,;]
alphanumeric = [A-Za-z0-9]

A set of equations to compute these character classes from the eight basis bit streams can then be produced by the running the compiler as follows. python delim_and_alphanum The following results are produced.

	temp1 = (basis_bits.bit_0 | basis_bits.bit_1)
	temp2 = (basis_bits.bit_2 &~ basis_bits.bit_3)
	temp3 = (temp2 &~ temp1)
	temp4 = (basis_bits.bit_4 & basis_bits.bit_5)
	temp5 = (basis_bits.bit_6 | basis_bits.bit_7)
	temp6 = (basis_bits.bit_6 &~ basis_bits.bit_7)
	temp7 = (temp5 &~ temp6)
	temp8 = (temp4 &~ temp7)
	temp9 = (temp3 & temp8)
	temp10 = (basis_bits.bit_2 & basis_bits.bit_3)
	temp11 = (temp10 &~ temp1)
	temp12 = (basis_bits.bit_4 &~ basis_bits.bit_5)
	temp13 = (temp12 & basis_bits.bit_6)
	temp14 = (temp11 & temp13)
	delimiters = (temp9 | temp14)
	temp15 = (basis_bits.bit_5 | basis_bits.bit_6)
	temp16 = (basis_bits.bit_4 & temp15)
	temp17 = (temp11 &~ temp16)
	temp18 = (basis_bits.bit_1 &~ basis_bits.bit_0)
	temp19 = (temp18 &~ basis_bits.bit_2)
	temp20 = (basis_bits.bit_6 & basis_bits.bit_7)
	temp21 = (basis_bits.bit_5 | temp20)
	temp22 = (basis_bits.bit_4 & temp21)
	temp23 = (~temp22)
	temp24 = (basis_bits.bit_4 | basis_bits.bit_5)
	temp25 = (temp24 | temp5)
	temp26 = ((basis_bits.bit_3 & temp23)|(~(basis_bits.bit_3) & temp25))
	temp27 = (temp19 & temp26)
	temp28 = (temp17 | temp27)
	temp29 = (temp18 & basis_bits.bit_2)
	temp30 = (temp29 & temp26)
	alphanumeric = (temp28 | temp30)

The python compiler supports several options which can be displayed with the following help command. python -h

The Parabix+LLVM Dynamic Character Class Compiler