Changes between Version 2 and Version 3 of CharacterClassCompiler


Ignore:
Timestamp:
Mar 8, 2016, 1:16:31 PM (4 years ago)
Author:
cameron
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • CharacterClassCompiler

    v2 v3  
    1 = The Parabix Character Class Compiler =
     1= The Parabix Character Class Compilers =
    22
    3 == Character Class Bit Streams ==
     3== Basic Concepts ==
     4
     5=== Character Class Bit Streams ===
    46
    57Given an input stream of character code units, a ''character class'' bit stream
     
    2123By convention, zero bits within a character class bit stream are marked with periods,
    2224so that the one bits (each marked with the digit 1) stand out.
     25
     26=== Character Class Expressions ===
     27
     28Following traditional regular expression notation, character classes can be
     29defined by listing individual characters and ranges of characters within
     30square brackets.   For example, the set of delimiter characters consisting
     31of colon, period, comma, semicolon may be denoted {{{[:.,;]}}}, while the
     32alphanumeric ASCII characters (any uppercase or lowercase ASCII letter
     33or any ASCII digit) is represented {{{[A-Za-z0-9]}}}. 
     34
     35=== Characters vs. Code Units ===
     36
     37''Code units'' are the fixed-size units that are used in defining a character
     38encoding system.   Very often, 8-bit code units (bytes) are used as the
     39basis of an encoding system.   But in some cases, such as the UTF-8 representation
     40of Unicode, multiple code units may be required to define a single character.
     41In UTF-8, characters are encoded using sequences that are either one, two,
     42three, or four code units in length.
     43
     44At the fundamental level, the Parabix character class compilers operate
     45as compilers for identifying individual code units.   Defining characters
     46that are comprised of sequences of code units involves an additional transformation
     47structure.
     48
     49
     50== The Compilers ==
     51
     52=== The Python Character Class Compiler ===
     53
     54The python character class compiler {{{charsetcompiler.py}}} takes a data file
     55of character class definitions as input and produces a set of bitwise logic equations
     56as output.
     57
     58For example, consider the input file "{{{delim_and_alphanum}}}" consisting of the following definitions:
     59{{{
     60delimiters = [:.,;]
     61alphanumeric = [A-Za-z0-9]
     62}}}
     63
     64A set of equations to compute these character classes from the eight basis
     65bit streams can then be produced by the running the compiler as follows.
     66{{{python charset_compiler.py delim_and_alphanum}}}
     67The following results are produced.
     68{{{
     69        temp1 = (basis_bits.bit_0 | basis_bits.bit_1)
     70        temp2 = (basis_bits.bit_2 &~ basis_bits.bit_3)
     71        temp3 = (temp2 &~ temp1)
     72        temp4 = (basis_bits.bit_4 & basis_bits.bit_5)
     73        temp5 = (basis_bits.bit_6 | basis_bits.bit_7)
     74        temp6 = (basis_bits.bit_6 &~ basis_bits.bit_7)
     75        temp7 = (temp5 &~ temp6)
     76        temp8 = (temp4 &~ temp7)
     77        temp9 = (temp3 & temp8)
     78        temp10 = (basis_bits.bit_2 & basis_bits.bit_3)
     79        temp11 = (temp10 &~ temp1)
     80        temp12 = (basis_bits.bit_4 &~ basis_bits.bit_5)
     81        temp13 = (temp12 & basis_bits.bit_6)
     82        temp14 = (temp11 & temp13)
     83        delimiters = (temp9 | temp14)
     84        temp15 = (basis_bits.bit_5 | basis_bits.bit_6)
     85        temp16 = (basis_bits.bit_4 & temp15)
     86        temp17 = (temp11 &~ temp16)
     87        temp18 = (basis_bits.bit_1 &~ basis_bits.bit_0)
     88        temp19 = (temp18 &~ basis_bits.bit_2)
     89        temp20 = (basis_bits.bit_6 & basis_bits.bit_7)
     90        temp21 = (basis_bits.bit_5 | temp20)
     91        temp22 = (basis_bits.bit_4 & temp21)
     92        temp23 = (~temp22)
     93        temp24 = (basis_bits.bit_4 | basis_bits.bit_5)
     94        temp25 = (temp24 | temp5)
     95        temp26 = ((basis_bits.bit_3 & temp23)|(~(basis_bits.bit_3) & temp25))
     96        temp27 = (temp19 & temp26)
     97        temp28 = (temp17 | temp27)
     98        temp29 = (temp18 & basis_bits.bit_2)
     99        temp30 = (temp29 & temp26)
     100        alphanumeric = (temp28 | temp30)
     101}}}
     102
     103set of bitstream equations