wiki:compiler/newdesign

Version 12 (modified by eamiri, 9 years ago) (diff)

--

Visitor Pattern

Usually compiler has a number of passes and each pass traverses the code line by line and each line of code has a specific type from among a few different types. The work that a pass does on each line of code depends on the type of the line. The simplest implementation for this is that each pass, checks the type of each line of code. The main drawback of this solution is that the compiler code cannot be extended easily by introducing new operations.

The solution to this problem is visitor pattern. A good explanation of visitor pattern is available here. To implement it a python ideas suggested here and here are useful.

Python AST module has two classes NodeVisitor? and NodeTransformer? that implement visitor pattern. Subclasses of the first class cannot modify the node they are visiting and for this purpose the second class should be subclassed. Visitor methods for nodes of type X are named visit_X and there is a generic_visit() method as well.

Symbol Table

1- Must support line numbering and collapsing.
2- Given varname should return uses and/or defs (how to handle use of similar names in different scopes?)
3- Given line number should return variables used and/or defined in that line.

To handle 1, the numbering will be cascaded.
numbers 1,2,3,... show lines of code with all bodies collapsed.
if line m is a while loop, m-n means line n in the body of while.
if line m is an if statement then m-0-i refers to line i in the true branch and m-1-i refers to line i in false branch. The problem with this scheme is that numbers in a tuple are not semantically homogeneous.
Implementation of line numbers will use tuples.

Common Subexpression Elimination

This should be written as an independent pass that can be called whenever needed.

Boolean Simplification

In a couple of places in the code we need to use Boolean logic to simplify the code. This happens in Normalization and in Code Simplification. In both cases we need simplify one formula. In code simplification this pretty straightforward, as the formula has only one operator. In Normalization it might be of arbitrary width. It's important to apply proper Boolean algebra rules in situation.

AST

Currently we translate python AST to our AST. Is this necessary? An alternative solution might be using python AST and potentially subclassing it.

Code Inlining

Implementation note: PyPy? supports code inlining, but after a first look at the source look, it seems that we cannot use their code. Inlining in PyPy? is considered a backend optimization. After they construct a flow graph and annotating it with information about variable types, they pass it to the code inliner. If we want to use inlining we may need to accept other data structures and passes of their translation process.

Unique Prefix for Names

Sometimes compiler needs to generate prefixes for names to make sure the name is unique. Prefixes of some names may need to be consistent with each other. (For example when modify variable names in a function to inline the function, we want a unique prefix for all variables). We need a component that generates a unique name for a given name. This unique name might be based on some extra input (like function name in inlining example).

Supporting Different Architectures

This is a draft. We want to be able to configure the compiler to use different architectures and different implementations for each high level operation. The following steps need to be taken.

  1. We need a description of high level operations.
  2. We need a generic family of registers with generic instruction set. High level operations should be implemented on this generic machine.(BitBlaze? Vine IL may be a starting point for this).
  3. For a given machine, each family of registers and operations available on them should be specified. We need a grammar for this spec.
  4. Instructions to communicate between register families and memory should be in spec.
  5. Given 2,3,4 we need a program that generates proper low level implementations for high level operations.
  6. Compiler decides which low level implementation for each high level operation is chosen. This may depend on input provided by the user in a config file. For example there are two implementations for Advance. User may specify preferences among these implementations and the compiler uses the first available implementation.

Note that the C output of the compiler is mostly a series of assembly instructions (intrinsics) for which we don't need to worry about register allocation. Also we have a couple of high level constructs for conditionals and loops. So the above scheme works for C as well.

So we have the following elements

  1. High level operations, available to the programmer.
  2. Low level implementations.
  3. A description of low level implementation of each high level operation. So the compiler can choose one if there is more than one implementation available. The choice of compiler can be guided by a config file edited by programmer.