Changes between Version 4 and Version 5 of ParabixRegexRoadMap


Ignore:
Timestamp:
Apr 22, 2016, 3:51:35 PM (17 months ago)
Author:
cameron
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • ParabixRegexRoadMap

    v4 v5  
    1 == Parabix Regular Expression Road Map ==
     1= Parabix Regular Expression Road Map =
    22
    3 Parabix regular expression matching offers consistent, ultra-fast regular expression matching performance,
    4 dramatically outperforming other regular expression technologies.   
     3The overall Parabix regular expression project will move beyond
     4our current [wiki:ICgrep icgrep] application to consider extended regular expression support for
     5a broad array of applications.
     6
     7 * Regular expressions over arbitrary alphabets and character encodings.
     8 * Systematic support for disciplined subexpression capture.
     9 * Disciplined backreference support.
     10 * Extended RE operators including intersection, difference and permutation.
     11 * New regular expression syntax to improve readability and composability of regular expressions.
     12 * Systematic naming and namespace rules for libraries of important regular expressions.
     13 * A file modeling language based on regular expressions and new tools based on these models.
     14 * Additional algorithmic and architectural work to improve performance.
     15 * Support for new general purpose regular expression matching APIs.
     16 * Systematic investigation of regular expression support requirements for important application areas.
     17
     18== Regular Expression Abstract Data Type ==
     19
     20The current implementation of the regular expression data type is found within the [source:icGREP/icgrep-devel/icgrep/re icgrep/re] directory.
     21
     22The following roadmap outlines the plans for the further development of regular
     23expression support within the Parabix-LLVM framework.
     24
     25 1.  Abstraction over Alphabet (Character Set)
     26
     27 At present, icgrep supports regular expressions over the UTF-8 representation of Unicode.   However,
     28 there are many applications for regular expression processing over different alphabets.   These
     29 include:
     30       * Regular expressions over extended ASCII byte sequences
     31       * Regular expressions over DNA nucleotides (A, C, G, T)
     32       * Regular expressions over amino acid codes
     33       * Regular expressions over the UTF-16 representation of Unicode
     34 A mechanism for specializing regular expression support for any given alphabet will be introduced.
     35 2.  Names
     36
     37 We will introduce a systematic approach to named character classes and regular expressions.   
     38 including both local and external names.   Local names will be used to create modular
     39 regular expressions: in essence we will have regular grammars, with names used for substructures
     40 as desired.   
     41
     42     External names will be defined within a namespace system.   POSIX character classes
     43     and Unicode property support will fit under this approach.
     44
     45     External names of full REs will also be developed, so that developers can simply
     46     refer to defined REs rather than reinventing them all the time.
     47
     48 3.  Backreferences
     49
     50 We will introduce a RE subtype for backreferences.   It is not possible to implement efficient
     51 backreference support in the general case.   However, we may be able to develop techniques
     52 for important specical cases.   In addition, we can apply a systematic two-pass approach
     53 with an external tool for full backreference support.   In the first pass, the backreference
     54 is treated merely as copy of the referenced RE, so that the substrings matched by the original
     55 subexpression and the backreference are not necessarily the same.   In the second pass, the
     56 matches are filted by string equality  on these two components.    For example, icgrep could
     57 be used to perform the initial matching, after which the icgrep output is piped to grep
     58 to complete the backreference support.
     59
     60 4.  Unification of Assertions
     61
     62 At present, icgrep supports the following zero-width assertions:  Start, End, !GraphemeClusterBoundary,
     63 Assertion(Lookahead, Positive), Assertion(Lookahead, Negative), Assertion(Lookbehind, Positive), Assertion(Lookbehind, Negative).
     64
     65 Start is a positive lookbehind assertion for start of line or string, depending on context.   End is a similar
     66 lookahead assertion for end of line or string.    Grapheme cluster boundary is a special assertion for Unicode
     67 processing, composed of a hierarchy of primitive assertions.
     68
     69 We also have the set operators Intersect and Diff.   When applied to individual characters, Intersect is
     70 a positive lookbehind assertion, while diff is a negative lookbehind assertion.
     71
     72 A unification of zero-width assertions would simplify analysis algorithms as well as compilation
     73 algorithms, by eliminating duplicate logic.
     74
     75 5.  Arbitrary Lookahead Assertion Support
     76
     77 Arbitrary lookahed assertion support will be implemented using RE transformations, (this can build on the regular expression derivatives concept).
     78
     79 This is necessary for implementation of some Unicode level 2 requirements, including
     80 the 2-codepoint lookahead for word and sentence boundaries.
     81
     82 6.  Lookaround Assertion Optimization
     83
     84 Lookaround assertions can be optimizied by partial evaluation in the context of
     85 the particular regular expression components to which they apply.   This will be useful
     86 in implementing arbitrary lookahead assertion support as well as implementing Unicode level 2
     87 break properties (graphemes, words, sentences).
     88
     89 7.  Permutations
     90
     91 A permutation RE allows items to appear in any order.  For example, {{{a||b||c||d}}} stands for the 24 different strings having one each of the letters {{{a}}} through {{{d}}}.   Permutations may address problems such as matching equivalent grapheme clusters.
     92
     93 8.  Immutability
     94
     95 Regular expression objects will be immutable.   Algorithms that transform regular expressions
     96 can use shared structure for components that do not change, creating new structures when they do change.
     97 A slab allocator strategy will be used for regular expressions.
     98
     99 Immutability will be useful for many cases where optimized REs are used in the initial
     100 search pass, wheres as the full final RE may be needed for matched string identification,
     101 for example, for colorization.
     102
     103 9.  Consolidation
     104 
     105 A single header file will be introduced for the regular expression ADT.
     106
     107 10.  Parsers
     108
     109 We will introduce different parsers for regular expressions in different notations.   
     110   * Posix BRE
     111   * Posix ERE
     112   * PCRE
     113   * PROSITE protein patterns
     114
    5115
    6116== icgrep 2.0 Road Map ==
     
    20130 * fully dynamic processor detection and SIMD feature selection
    21131
    22 == Beyond icgrep ==
    23 
    24 Leveraging the Parabix performance advantage, our overall regular expression project will move beyond
    25 our current [wiki:ICgrep icgrep] application to consider extended regular expression support for
    26 a broad array of applications.
    27 
    28  * New regular expression syntax to improve readability and composability of regular expressions.
    29  * Systematic naming and namespace rules for libraries of important regular expressions.
    30  * Systematic support for disciplined subexpression capture.
    31  * Disciplined backreference support.
    32  * A file modeling language based on regular expressions and new tools based on these models.
    33  * Additional algorithmic and architectural work to improve performance.
    34  * Support for new general purpose regular expression matching APIs.
    35  * Systematic investigation of regular expression support requirements for important application areas.
    36 
    37132== Bioinformatics ==
    38133