source: docs/Working/re/compilation.tex @ 3653

Last change on this file since 3653 was 3623, checked in by cameron, 6 years ago

Explain while loop termination; tone down long-addition claims

File size: 3.8 KB
3Using the marker stream and MatchStar concept, we now
4outline our compilation algorithm.  This is implemented
5in a Java program.   First the regular expression is
6parsed and represented as an abstract syntax tree.
7Second, the various character classes used in the
8regular expression are extracted.  The character
9class compiler of the Parabix framework is invoked
10to generate the bit stream equations required for
11each character class.  Then the syntax tree is
12walked to generate code for each type of regular expression
13structure as follows.
16\item An initial marker stream $M_0$ is set to be all
17ones, indicating that every position in the input file
18is a potential  match if we have not yet examined any
19pattern elements.
21\item If we have a regular expression formed as an
22alternation of subexpressions, we compile each of these
23in turn, providing the current input marker stream as input
24to each of them.  The final marker streams of the
25compiled forms of each subexpression are then just combined
26using a bitwise-or to produce the overall final marker
27stream of the alternation.  That is, a match occurs at
28any position that can be reached by matching any one of the
31\item If we have a marker stream formed as a concatenation of
32subexpressions, then we compile each of these in turn, providing
33the output marker stream of each compilation as the input
34marker stream for the compilation of the next pattern element.
36\item If a regular expression is a character class expression
37or a single character, then we form the bitwise-and of the
38current marker stream and the character class stream to
39filter out current marker positions that do not have
40a match to the class.   The result is then shifted forward
41one position to identify the successful matches.
43\item If a regular expression is an optional expression of the
44form {\tt R?} for some subexpression {\tt R}, then the
45output marker stream is simply formed as the bitwise-or
46of the input marker stream (zero occurrences of {\tt R}
47matched) and the output stream produced by compiling {\tt R}
48in the context of the current input marker stream (one
49occurrence matched).
51\item If a regular expression is a repetition of a character
52class of the form {\tt C*}, then the compiled form uses the MatchStar operation
53to produce the output marker stream from the input stream
54and the compiled stream for character class {\tt C}.
56\item If a regular expression is a repetition of a non
57character class of the form {\tt R*}, then a Pablo while loop
58is created conditioned on a control marker stream still
59having bits marking match positions to  be considered.   The body of the
60while consists of the compiled form of the expression {\tt R},
61taking as input the marker stream at the beginning of the iteration
62and producing as output all positions that can be reached from
63the input positions in a single step.   These output positions
64are candidates for further iteration, but positions that
65have already been considered are removed.  This guarantees
66termination of the loop; iteration continues only if a new
67marker position is reached that has not been previously
68considered as an output.
69The final output is the bitwise-or of matches determined in each loop
72\item If a regular expression is a bounded repetition of the form \verb:R{m,n}:,
73then it is compiled according to the equivalent form
74consisting of $m$ concatenations of \verb:R: followed
75by $n-m$ concatenations of \verb:R?:.
77\item If a regular expression is a bounded repetition of the form \verb:R{m,}:,
78then it is compiled according to the equivalent form
79consisting of $m$ concatenations of \verb:R: followed
80by \verb:R*:.
83The output of the regular expression compiler is then fed
84as input to the Pablo compiler of the Parabix tool chain.
85The result is then compiled with a C++ compiler linked with
86the Parabix run-time libraries.
Note: See TracBrowser for help on using the repository browser.