1 | \documentclass{article} |
---|
2 | \usepackage{amsmath} |
---|
3 | \usepackage{fullpage} |
---|
4 | \usepackage{amssymb} |
---|
5 | \newcommand\ceil[1]{\lceil#1\rceil} |
---|
6 | \DeclareMathOperator\shl{\texttt{<<}} |
---|
7 | |
---|
8 | \title{Parabix Methods for Backreferences} |
---|
9 | \author{Robert D. Cameron} |
---|
10 | |
---|
11 | \begin{document} |
---|
12 | \maketitle |
---|
13 | |
---|
14 | \paragraph*{Backreferences.} |
---|
15 | |
---|
16 | The {\em backreference} problem is to support patterns of the |
---|
17 | form \verb'('$R_1$\verb')'$R_2$\verb'\1', where $R_1$ and $R_2$ are regular expressions |
---|
18 | and \verb'\1' is a backreference that must match the same string matched by $R_1$. |
---|
19 | Thus the pattern \verb:([a-z])\1: matches consecutive pairs of characters consisting of the same |
---|
20 | two lower case letters. The subexpression enclosed within parentheses is said to |
---|
21 | {\em capture} a string to be later matched by a corresponding numbered backreference. |
---|
22 | Multiple parenthesized capturing groups and corresponding backreferences may be used within a |
---|
23 | pattern. |
---|
24 | For example, consider the following pattern for 6-letter palindromes: |
---|
25 | \verb:\b([a-z])(a-z)(a-z)\3\2\1\b: where \verb:\b: is a zero-width assertion |
---|
26 | matching a word boundary. Here there are three capturing groups and |
---|
27 | three corresponding backreferences to the captured substrings in reversed |
---|
28 | order. A search for matches to this pattern within |
---|
29 | the string ``red, redder and reddest'' will find the substring ``redder.'' |
---|
30 | %Another example is a simplified pattern to match XML elements having simple ASCII names: |
---|
31 | %\verb'<([A-Za-z]*)>.*</\1>'. |
---|
32 | %Here the backreference enforces the requirement that the end tag of the element has |
---|
33 | %the same element name as the start tag. |
---|
34 | |
---|
35 | Backreferences are supported by a number of libraries and tools for |
---|
36 | regular expression processing, including classical tools such as grep. |
---|
37 | However, from a theoretical perspective, backreferences are not regular. |
---|
38 | That is, they are not features that can be implemented by simple translation to |
---|
39 | other regular expression features. Consequently, backreferences are |
---|
40 | not generally supported by regular expression implementations that are |
---|
41 | strictly based on theoretical frameworks such the correspondence between regular |
---|
42 | expressions and finite automata. Implementations that do support |
---|
43 | backreferences typically use a backtracking strategy, with exponential |
---|
44 | worst-case computational cost. |
---|
45 | |
---|
46 | |
---|
47 | \paragraph*{Two-Pass Backreference Support} |
---|
48 | |
---|
49 | One approach to backreference support is to use a two-pass approach. |
---|
50 | In the first pass, potential matches are found without enforcing the |
---|
51 | backreference constraint. In the second pass, the potential matches |
---|
52 | are filtered to include only those that do properly satisfy the |
---|
53 | backreference constraint. In the first pass, matching proceeds |
---|
54 | using a modified pattern, in which each backreference is replaced |
---|
55 | by a copy of the corresponding capture expression. This will ensure that |
---|
56 | any string matching the original pattern is matched, but will also |
---|
57 | potentially match many strings such that the substring matched by the |
---|
58 | capture expression is not the same as the substring matched by the |
---|
59 | copied expression. A second pass is then necessary to discard |
---|
60 | matches that violate the backreference constraint. |
---|
61 | |
---|
62 | The two-pass approach has two principal benefits. The first is |
---|
63 | that the modified pattern used in the first pass is a proper |
---|
64 | regular expression. Regular expression processing may thus |
---|
65 | proceed with any of the standard implementation methods. |
---|
66 | Secondly, the computational cost associated with backreferences |
---|
67 | are eliminated in the first pass. In the second pass, |
---|
68 | the cost of backreference constraint verification is only paid |
---|
69 | when a substring is found that otherwise fully matches the original |
---|
70 | pattern. By only considering these instances, the cost of |
---|
71 | backreference processing may be considerably reduced. |
---|
72 | |
---|
73 | \paragraph{Fixed-Length Capture and Gap Expressions} |
---|
74 | |
---|
75 | Consider the pattern \verb'('$R_1$\verb')'$R_2$\verb'\1' where |
---|
76 | the parenthesized capture subexpression $R_1$ and the gap subexpression |
---|
77 | $R_2$ each match strings of a known fixed length $L_1$ and $L_2$ |
---|
78 | respectively. Using Parabix methods, |
---|
79 | let $B_i$ be the $i^\text{th}$ basis bitstream of an input text stream $T$. |
---|
80 | Then compute \[S = \bigwedge_{i = 0}^{7} B_i = \textrm{Advance}(B_i, L_1+L_2) \] |
---|
81 | $S$ is the bitstream identifying all positions at which a given byte is |
---|
82 | the same as the one $L_1+L_2$ positions earlier. Further, let |
---|
83 | $M_2$ be the bitstream marking positions at which substrings |
---|
84 | matching the regular expression $R_1R_2$ occur. $M_2$ is computed |
---|
85 | by in the usual way by Parabix regular expression methods. The |
---|
86 | matches to \verb'('$R_1$\verb')'$R_2$\verb'\1' may then be found |
---|
87 | as those positions of $\textrm{Advance}(M_2, L_1)$ at which |
---|
88 | a run of $L_1$ consecutive one bits is found in $S$. |
---|
89 | |
---|
90 | Fortunately, the problem of finding in parallel the runs |
---|
91 | of $n$ consecutive one bits can be implemented in $\log_2 n$ |
---|
92 | steps using the Parabix techniques for bounded repetition. |
---|
93 | \end{document} |
---|
94 | |
---|