source: proto/pebble/trunk/src/lexicalStream/readme.txt @ 1249

Last change on this file since 1249 was 1249, checked in by shermer, 8 years ago

improvements to lexicalStream package.
The unused position-tracking code in CharClassDescriptor? was removed.
LexicalDescriptor? was not doing much, so its functionality was pulled down and removed.
other minor changes.

  • Property svn:mime-type set to text/plain
File size: 6.8 KB
Line 
1Package lexicalStream
2---------------------
3
4
5------------------
6Contents
7------------------
8
9This package contains objects for handling lexical streams in pebble.
10See the terminology section below for any unexplained terms.
11
12It contains an ASTVisitor:
13
14        MoveLexdefsToBranchStarts
15                                This replaces any lexical node by a temporary variable and
16                                moves the computation of the node's stream (as an assignment
17                                to the temporary) to the beginning of the nearest enclosing
18                                branch.  It ensures that all character classes (shifted or
19                                not) required for any sequences present are also defined.
20                                It then removes any redundancy in definitions (including
21                                that at sub-branches) and expands the definitions
22                                for the lexical types into basic boolean bitwise and shift
23                                operations.  The tree is left without any lexical node types.
24                               
25                               
26The other classes are:
27
28        LexicalCollector
29                                This is essentially a modified type-aware symbol table for
30                                lexical streams.  There is one of these objects for each
31                                branch in the program.  It collects information about the
32                                lexical streams in the branch and does the heavy lifting
33                                for eliminating redundancy, figuring out shifting, and
34                                placing streams in the correct order on a branch.  These
35                                structures get linked into a tree by MoveLexdefs...
36       
37        CharClassDescriptor
38                                A class that describes a character class.  This class, in
39                                conjunction with util.Range, creates the AST of boolean operations
40                                to represent a character class.  (It converts from specifier
41                                to bitwise boolean AST).
42       
43        SequenceDescriptor
44                                A class that describes a sequence composed of characters or
45                                character classes.  It creates the expressions that form the
46                                stream that marks occurrences of its sequence.
47                               
48                               
49                               
50
51------------------
52Examples
53------------------
54       
55Two pebble token types create the lexical streams.  They are the character-
56class tokens, which are enclosed in square brackets, and the sequence tokens,
57enclosed in double-quotes.
58
59An example character-class specifier is
60        [h-kH-K]
61
62It is composed of two ranges, (h-k and H-K), but it needs only a single
63lexical stream [H-Kh-k] to represent it.
64
65However, a sequence specifier
66        "abc"
67needs the three lexical streams [a], [b], and [c], shifted forward 2, 1,
68and 0, respectively.  It also needs a lexical (sequence) stream marking
69the end of each a-b-c sequence.
70
71Character classes are allowed as elements in a sequence. For instance,
72the sequence
73        "a[0-9][0-9]"
74matches the character 'a' followed by two digits.
75
76
77
78------------------
79Ordinary operation
80------------------
81
82This is an older description of the operation of this package that
83may need further updating to be accurate.
84
85In operation, the visitor MoveLexdefsToBranchStarts is applied directly to the
86output AST of the grammar Pebble.  From there the transformations in
87ast.transforms get applied.
88
89In walking the AST, MoveLexdefsToBranchStarts finds any occurrences of
90character-class or sequence literals, and then it hands said literal to
91LexicalCollector for processing.
92
93LexicalCollector first checks to see if the given literal is already
94known.  If so, it returns the descriptor (instance of subclass of
95LexicalDescriptor) for that known literal.  If not, it creates a
96descriptor and returns that.  For sequence literals, it also recursively
97creates the necessary character-class stream descriptors.
98
99The returned descriptor is stored in hash table in LexicalCollector, with
100the key being the specifier for the stream.  Since specifiers are unique, this
101enables later stages to access the descriptors quickly.  (a previous implementation
102had ...TokenData classes store the descriptor, but I eliminated that to help decouple
103the lexical stream descriptor classes from the token data classes.)
104
105MoveLexdefsToBranchStarts also links the Collectors into a tree structure
106reflecting the static nesting of their containing branches.
107
108After the visitor-driven processing, the LexicalCollector may
109be asked to generate the lexical struct classes (for the pablo/C++ backend)
110which are required to store the lexical computation.  This should only be
111invoked for the top LexicalCollector in a tree.
112
113
114
115------------------
116Terminology
117------------------
118
119The term "specifier" is used as a unique string per lexical stream.  No
120two different lexical streams should yield the same specifier.  Ensure
121that this property stays intact.
122
123The term "variableName" is used as a unique identifier for a lexical
124stream.  It is be formed from the specifier by util.VariableNamer.
125
126The term "descriptor" is used for an instance of CharClassDescriptor or
127SequenceDescriptor, and these instances sort-of contain and coordinate
128all of the information about the lexical stream.  I note exceptions in
129that (1) whether a character class is complemented is not stored in the
130CharClassDescriptor, and (2) positions at which a character class are
131required are stored in a LexicalCollector.CharClassPositionsPair, not in
132the CharClass Descriptor.
133
134The term "canonical" means a unique description of a lexical stream; I
135use "canonize" to mean the process of turning a description into a
136canonical one.
137
138The term "branch" is used to mean a section of code that is either
139        one of the blockStatements (then clause or else clause) of an IfStatement, or
140        an ExecutionBlock.
141
142By "Lexical node type", we mean either of the two AST node types
143         CharClassASTB or
144         SequenceASTB.   
145We do not consider ScanThroughASTB or ShiftASTB as lexical node types.
146They are considered "manual" non-bitwise operations.
147
148On output, lexical streams (streams derived from lexical node types)
149are divided into three groups:
150                Character classes,
151                Shifted character classes, and
152                Sequences
153               
154
155
156
157------------------
158Architecture notes
159------------------
160
161I am currently unsure about the distinction between the lexical stream
162descriptor classes and the ...TokenData classes in the parser package.  The
163descriptors are meant to be attached to canonized objects rather than the
164rough parsed objects. There should be one descriptor per lexical stream
165needed, even if it occurs in many places.  This is currently not the case,
166in that (say) a character class needed in one sub-branch of the AST does not
167have the same descriptor as the same character class needed in an unrelated
168sub-branch of the AST.   
169
170Anyhow, a clean distinction between the two types of classes(TokenData and
171Descriptors), or a merging of them, might be helpful.
172
173The collector now uses a (descriptor, positionList) pair for character classes.
174This adds complicating logic to the character-class half with no corresponding
175complication for the sequence half.  The collector is getting fairly complicated
176and probably should be split into smaller objects.
177
178
179
Note: See TracBrowser for help on using the repository browser.