source: docs/Working/backref.tex @ 6183

Last change on this file since 6183 was 6183, checked in by cameron, 5 months ago

Backreference working document

File size: 4.7 KB
8\title{Parabix Methods for Backreferences}
9\author{Robert D. Cameron}
16The {\em backreference} problem is to support patterns of the
17form \verb'('$R_1$\verb')'$R_2$\verb'\1', where $R_1$ and $R_2$ are regular expressions
18and \verb'\1' is a backreference that must match the same string matched by $R_1$.
19Thus the pattern \verb:([a-z])\1: matches consecutive pairs of characters consisting of the same
20two lower case letters.  The subexpression enclosed within parentheses is said to
21{\em capture} a string to be later matched by a corresponding numbered backreference.
22Multiple parenthesized capturing groups and corresponding backreferences may be used within a
24For example, consider the following pattern for 6-letter palindromes:
25\verb:\b([a-z])(a-z)(a-z)\3\2\1\b: where \verb:\b: is a zero-width assertion
26matching a word boundary.   Here there are three capturing groups and
27three corresponding backreferences to the captured substrings in reversed
28order.  A search for matches to this pattern within
29the string ``red, redder and reddest'' will find the substring ``redder.''
30%Another example is a simplified pattern to match XML elements having simple ASCII names:
32%Here the backreference enforces the requirement that the end tag of the element has
33%the same element name as the start tag.
35Backreferences are supported by a number of libraries and tools for
36regular expression processing, including classical tools such as grep.
37However, from a theoretical perspective, backreferences are not regular.   
38That is, they are not features that can be implemented by simple translation to
39other regular expression features.   Consequently, backreferences are
40not generally supported by regular expression implementations that are
41strictly based on theoretical frameworks such the correspondence between regular
42expressions and finite automata.  Implementations that do support
43backreferences typically use a backtracking strategy, with exponential
44worst-case computational cost.
47\paragraph*{Two-Pass Backreference Support}
49One approach to backreference support is to use a two-pass approach.
50In the first pass, potential matches are found without enforcing the
51backreference constraint.   In the second pass, the potential matches
52are filtered to include only those that do properly satisfy the
53backreference constraint.   In the first pass, matching proceeds
54using a modified pattern, in which each backreference is replaced
55by a copy of the corresponding capture expression.  This will ensure that
56any string matching the original pattern is matched, but will also
57potentially match many strings such that the substring matched by the
58capture expression is not the same as the substring matched by the
59copied expression.  A second pass is then necessary to discard
60matches that violate the backreference constraint.
62The two-pass approach has two principal benefits.   The first is
63that the modified pattern used in the first pass is a proper
64regular expression.   Regular expression processing may thus
65proceed with any of the standard implementation methods.
66Secondly, the computational cost associated with backreferences
67are eliminated in the first pass.   In the second pass,
68the cost of backreference constraint verification is only paid
69when a substring is found that otherwise fully matches the original
70pattern.   By only considering these instances, the cost of
71backreference processing may be considerably reduced.
73\paragraph{Fixed-Length Capture and Gap Expressions}
75Consider the pattern \verb'('$R_1$\verb')'$R_2$\verb'\1' where
76the parenthesized capture subexpression $R_1$ and the gap subexpression
77$R_2$ each match strings of a known fixed length $L_1$ and $L_2$
78respectively.  Using Parabix methods,
79let $B_i$ be the $i^\text{th}$ basis bitstream of an input text stream $T$.
80Then compute \[S = \bigwedge_{i = 0}^{7} B_i = \textrm{Advance}(B_i, L_1+L_2) \] 
81$S$ is the bitstream identifying all positions at which a given byte is
82the same as the one $L_1+L_2$ positions earlier.   Further, let
83$M_2$ be the bitstream marking positions at which substrings
84matching the regular expression $R_1R_2$ occur.  $M_2$ is computed
85by in the usual way by Parabix regular expression methods.   The
86matches to \verb'('$R_1$\verb')'$R_2$\verb'\1' may then be found
87as those positions of $\textrm{Advance}(M_2, L_1)$ at which
88a run of $L_1$ consecutive one bits is found in $S$.   
90Fortunately, the problem of finding in parallel the runs
91of $n$ consecutive one bits can be implemented in $\log_2 n$
92steps using the Parabix techniques for bounded repetition.
Note: See TracBrowser for help on using the repository browser.