| 25 | |
| 26 | === Character Class Expressions === |
| 27 | |
| 28 | Following traditional regular expression notation, character classes can be |
| 29 | defined by listing individual characters and ranges of characters within |
| 30 | square brackets. For example, the set of delimiter characters consisting |
| 31 | of colon, period, comma, semicolon may be denoted {{{[:.,;]}}}, while the |
| 32 | alphanumeric ASCII characters (any uppercase or lowercase ASCII letter |
| 33 | or any ASCII digit) is represented {{{[A-Za-z0-9]}}}. |
| 34 | |
| 35 | === Characters vs. Code Units === |
| 36 | |
| 37 | ''Code units'' are the fixed-size units that are used in defining a character |
| 38 | encoding system. Very often, 8-bit code units (bytes) are used as the |
| 39 | basis of an encoding system. But in some cases, such as the UTF-8 representation |
| 40 | of Unicode, multiple code units may be required to define a single character. |
| 41 | In UTF-8, characters are encoded using sequences that are either one, two, |
| 42 | three, or four code units in length. |
| 43 | |
| 44 | At the fundamental level, the Parabix character class compilers operate |
| 45 | as compilers for identifying individual code units. Defining characters |
| 46 | that are comprised of sequences of code units involves an additional transformation |
| 47 | structure. |
| 48 | |
| 49 | |
| 50 | == The Compilers == |
| 51 | |
| 52 | === The Python Character Class Compiler === |
| 53 | |
| 54 | The python character class compiler {{{charsetcompiler.py}}} takes a data file |
| 55 | of character class definitions as input and produces a set of bitwise logic equations |
| 56 | as output. |
| 57 | |
| 58 | For example, consider the input file "{{{delim_and_alphanum}}}" consisting of the following definitions: |
| 59 | {{{ |
| 60 | delimiters = [:.,;] |
| 61 | alphanumeric = [A-Za-z0-9] |
| 62 | }}} |
| 63 | |
| 64 | A set of equations to compute these character classes from the eight basis |
| 65 | bit streams can then be produced by the running the compiler as follows. |
| 66 | {{{python charset_compiler.py delim_and_alphanum}}} |
| 67 | The following results are produced. |
| 68 | {{{ |
| 69 | temp1 = (basis_bits.bit_0 | basis_bits.bit_1) |
| 70 | temp2 = (basis_bits.bit_2 &~ basis_bits.bit_3) |
| 71 | temp3 = (temp2 &~ temp1) |
| 72 | temp4 = (basis_bits.bit_4 & basis_bits.bit_5) |
| 73 | temp5 = (basis_bits.bit_6 | basis_bits.bit_7) |
| 74 | temp6 = (basis_bits.bit_6 &~ basis_bits.bit_7) |
| 75 | temp7 = (temp5 &~ temp6) |
| 76 | temp8 = (temp4 &~ temp7) |
| 77 | temp9 = (temp3 & temp8) |
| 78 | temp10 = (basis_bits.bit_2 & basis_bits.bit_3) |
| 79 | temp11 = (temp10 &~ temp1) |
| 80 | temp12 = (basis_bits.bit_4 &~ basis_bits.bit_5) |
| 81 | temp13 = (temp12 & basis_bits.bit_6) |
| 82 | temp14 = (temp11 & temp13) |
| 83 | delimiters = (temp9 | temp14) |
| 84 | temp15 = (basis_bits.bit_5 | basis_bits.bit_6) |
| 85 | temp16 = (basis_bits.bit_4 & temp15) |
| 86 | temp17 = (temp11 &~ temp16) |
| 87 | temp18 = (basis_bits.bit_1 &~ basis_bits.bit_0) |
| 88 | temp19 = (temp18 &~ basis_bits.bit_2) |
| 89 | temp20 = (basis_bits.bit_6 & basis_bits.bit_7) |
| 90 | temp21 = (basis_bits.bit_5 | temp20) |
| 91 | temp22 = (basis_bits.bit_4 & temp21) |
| 92 | temp23 = (~temp22) |
| 93 | temp24 = (basis_bits.bit_4 | basis_bits.bit_5) |
| 94 | temp25 = (temp24 | temp5) |
| 95 | temp26 = ((basis_bits.bit_3 & temp23)|(~(basis_bits.bit_3) & temp25)) |
| 96 | temp27 = (temp19 & temp26) |
| 97 | temp28 = (temp17 | temp27) |
| 98 | temp29 = (temp18 & basis_bits.bit_2) |
| 99 | temp30 = (temp29 & temp26) |
| 100 | alphanumeric = (temp28 | temp30) |
| 101 | }}} |
| 102 | |
| 103 | set of bitstream equations |