Parabix Regular Expression Road Map
The overall Parabix regular expression project will move beyond our current icgrep application to consider extended regular expression support for a broad array of applications.
- Regular expressions over arbitrary alphabets and character encodings.
- Systematic support for disciplined subexpression capture.
- Disciplined backreference support.
- Extended RE operators including intersection, difference and permutation.
- New regular expression syntax to improve readability and composability of regular expressions.
- Systematic naming and namespace rules for libraries of important regular expressions.
- A file modeling language based on regular expressions and new tools based on these models.
- Additional algorithmic and architectural work to improve performance.
- Support for new general purpose regular expression matching APIs.
- Systematic investigation of regular expression support requirements for important application areas.
Regular Expression Abstract Data Type
The current implementation of the regular expression data type is found within the icgrep/re directory.
The following roadmap outlines the plans for the further development of regular expression support within the Parabix-LLVM framework.
- Abstraction over Alphabet (Character Set)
At present, icgrep supports regular expressions over the UTF-8 representation of Unicode. However, there are many applications for regular expression processing over different alphabets. These include:
- Regular expressions over extended ASCII byte sequences
- Regular expressions over DNA nucleotides (A, C, G, T)
- Regular expressions over amino acid codes
- Regular expressions over the UTF-16 representation of Unicode
A mechanism for specializing regular expression support for any given alphabet will be introduced.
We will introduce a systematic approach to named character classes and regular expressions. including both local and external names. Local names will be used to create modular regular expressions: in essence we will have regular grammars, with names used for substructures as desired.
External names will be defined within a namespace system. POSIX character classes and Unicode property support will fit under this approach.
External names of full REs will also be developed, so that developers can simply refer to defined REs rather than reinventing them all the time.
We will introduce a RE subtype for backreferences. It is not possible to implement efficient backreference support in the general case. However, we may be able to develop techniques for important specical cases. In addition, we can apply a systematic two-pass approach with an external tool for full backreference support. In the first pass, the backreference is treated merely as copy of the referenced RE, so that the substrings matched by the original subexpression and the backreference are not necessarily the same. In the second pass, the matches are filted by string equality on these two components. For example, icgrep could be used to perform the initial matching, after which the icgrep output is piped to grep to complete the backreference support.
- Unification of Assertions
At present, icgrep supports the following zero-width assertions: Start, End, GraphemeClusterBoundary, Assertion(Lookahead, Positive), Assertion(Lookahead, Negative), Assertion(Lookbehind, Positive), Assertion(Lookbehind, Negative).
Start is a positive lookbehind assertion for start of line or string, depending on context. End is a similar lookahead assertion for end of line or string. Grapheme cluster boundary is a special assertion for Unicode processing, composed of a hierarchy of primitive assertions.
We also have the set operators Intersect and Diff. When applied to individual characters, Intersect is a positive lookbehind assertion, while diff is a negative lookbehind assertion.
A unification of zero-width assertions would simplify analysis algorithms as well as compilation algorithms, by eliminating duplicate logic.
- Arbitrary Lookahead Assertion Support
Arbitrary lookahed assertion support will be implemented using RE transformations, (this can build on the regular expression derivatives concept).
This is necessary for implementation of some Unicode level 2 requirements, including the 2-codepoint lookahead for word and sentence boundaries.
- Lookaround Assertion Optimization
Lookaround assertions can be optimizied by partial evaluation in the context of the particular regular expression components to which they apply. This will be useful in implementing arbitrary lookahead assertion support as well as implementing Unicode level 2 break properties (graphemes, words, sentences).
A permutation RE allows items to appear in any order. For example, a||b||c||d stands for the 24 different strings having one each of the letters a through d. Permutations may address problems such as matching equivalent grapheme clusters.
Regular expression objects will be immutable. Algorithms that transform regular expressions can use shared structure for components that do not change, creating new structures when they do change. A slab allocator strategy will be used for regular expressions.
Immutability will be useful for many cases where optimized REs are used in the initial search pass, wheres as the full final RE may be needed for matched string identification, for example, for colorization.
A single header file will be introduced for the regular expression ADT.
We will introduce different parsers for regular expressions in different notations.
- Posix BRE
- Posix ERE
- PROSITE protein patterns
icgrep 2.0 Road Map
Building on the solid foundation of icgrep 1.0, our current development plan for icgrep 2.0 includes the following features.
- Extending Unicode support from the existing Unicode Level 1 conformance to Unicode Level 2 conformance UTS #18.
See Unicode Level 2 Support in icGrep for current status.
- recursive searches with -r flag
- multithreaded searches for improved performance on multi-pattern matching and multi-file matching
- fully dynamic processor detection and SIMD feature selection
- PROSITE regular expression patterns and matching