DLR - Build Your Own Language (part 3)

In our simple calculator so far, we’ve been able to evaluate expressions like 1+2/3 or (1+2)/3 - but not much else. But even the simplest calculators have a ’memory’ function that allows you to store and retrieve one result of a calculation. This month, we’’ generalize this memory facility by the means of the more general ’assignment statement’.

In most (if not all) programming languages, an assignment statement is one of the most fundamental concepts. When in C# or Java or Visual Basic, we say x = 1 we are using a shorthand for a complex sequence of smaller, simpler instructions. And when you get down to the microcode, it can be a very complex sequence indeed. But in essence, we’re asking the computer to give a name x to an item of memory (a ’location’) and to set this location to the value 1.

Further, we want to be able to get the value back at some future time, so that we can write 1 = x and get the value 2 returned to us.

To do this, we need to modify the original ANTLR3 lexer to allow some new constructs, specifically an assignment using the = operator.

First, we need to add something that recognises the = and also something that handles names such as x. The assignment bit is simple – just add a rule like this (note that lexer rules always start with a capital letter):

This says (in Antlr speak) that an identifier starts with a lower-case letter and can be followed by zero or more lower case letters or digits.

fragment

DIGIT : '0'..'9';



fragment

LCLETTER : 'a'..'z';

Notice that each of these sub-rules is prefixed by the keyword fragment. This tells Antlr not to match this rule directly, but only to use this rule from within another rule (in this case IDENTIFIER). The reason is that if this wasn’t the case, both IDENTIFIER and LCLETTER would match the input z – and this would be ambiguous: ANTLR would not know what to do. You can see the effect by leaving out the fragment keyword and running the resultant grammar through Antlr. In other words, the effect of the fragment keyword is to protect a sub-rule from being used directly.

Next to the parser. The IDENTIFIER lexer rule is used in two places. First, in the parser rule primary. Now when you look at grammars whether they are old style yacc or the fancy new ANTLR3, you’ll see lots of ‘primary’s around. A primary is a sort of stand-alone thing, such as a number or an identifier. In this simple case, our primary is something like ‘x’, or ‘123’ or a parenthesised expression ‘(x + 2)’. The other place IDENTIFIER is used is in the ‘expr’ rule:

This says that a valid use of an identifier is ‘x = 2’ say. But there’s another part to this rule as well. That’s the part that follows the -> operator. This is a ‘re-write rule’ – and it’s one of the things that make Antlr incredibly powerful. If I didn’t have the re-write rule, the output here would be three tokens: x, =, 2. With the re-write rule, we get a tree (an Abstract Syntax Tree or AST) with the = as the root node and x and 2 as sub-nodes. At first sight, this may seem to be rather an odd thing to want to do, but in fact it’s exactly what’s needed for the next stage of the business – ‘walking the tree’.

Walking the Dog

Having got an AST in the format we want, the next thing is to ‘walk’ it – that is, to traverse the tree, emitting code (or whatever) as we do so. In our case, we don’t want to emit code, but rather we want to emit instructions to the DLR. There are two ways to walk the tree. The first (old) way is by recursive descent – write some code that starts at the root of the tree and works its way down. This is hard work – having done several tree walkers like this, I can vouch for the fact. In addition, it’s tedious and error prone.

The other way is to use Antlr’s tree grammar. There’s nothing magical about a tree grammar. A tree grammar is just a parser that matches an AST rather than a lexical script. So if I have a tree that consists of a root node = and two child nodes x and 2, I want a rule that matches just that. And here is one:

program returns [MyLStatement result]

        : e=expr  {$result = new MyLPrint(new SourceSpan(), $e.result);}

        | ^(ASSIGN lvalue=IDENTIFIER rvalue=expr) {

                        MyLIdentifier id;

                        MyLAssign assign;

                        

                        id = new MyLIdentifier(new SourceSpan(), $lvalue.Text);

                        assign = new MyLAssign(new SourceSpan(), id, $rvalue.result);

                        $result = new MyLPrint(new SourceSpan(), assign);

                        }

        ;

This is a rule that matches either an expression or an assignment. Look at the second part. This matches an ASSIGN symbol as a root (that’s what the ^ operator means) followed by an IDENTIFIER and an expr as child nodes. Two labels, lvalue and rvalue, are set as labels so that the identifier and expression can be referred to subsequently. Following the ^(ASSIGN lvalue=IDENTIFIER rvalue=expr) you’ll see something in curly braces that looks suspiciously like C# code. In fact, it is C# code and with a couple of substitutions it is what ANTLR will generate as the code for this particular rule. When the AST (that’s what the parser has generated, remember) is matched for this rule, this generated code is executed.

Looking at the rule above, you’ll see a couple of things like $lvalue.Text. This is an instruction to Antlr to use the text of the identifier – the label lvalue was assigned to the identifier by this fragment:

The dollar symbol simply tells ANTLR that lvalue is a label which it knows about and, in this case, to retrieve the text.

But what’s going on here $rvalue.result? Well, if you look at the following rule:

expr returns [MyLExpression result]

        : ^(op=(PLUS|MINUS|MULTIPLY|DIVIDE) left=expr right=expr) {$result = new MyLBinary(new SourceSpan(), $op.Token.Type, $left.result, $right.result);}

        | ^(op=(UNARY_PLUS|UNARY_MINUS) arg=expr) {$result = new MyLUnary(new SourceSpan(), $op.Token.Type, $arg.result);}

        | id=IDENTIFIER {$result = new MyLIdentifier(new SourceSpan(), $id.Text);}

        | n=NUMBER {$result = new MyLConstant(new SourceSpan(), double.Parse($n.Text));}

        ;

...you’ll see that expr returns something – that’s what the syntax expr returns [MyLExpression result] means. Here, I’m returning a DLR type MyLExpression. So the syntax $rvalue.result gets the result (a MyLExpression returned by the rule expr. Looking again at the first program rule above, that also returns something. In this case, it’s a MyLStatement and I can refer to it in C# code like this:

DLR Parser

Lets now turn to the DLR parts and look at the ParseSourceCode routine in the file MyLanguageContext.cs. This is the key part of assembling all the ANTLR bits I’ve discussed above and hooking up to the DLR. I’ll go through each step.

This initializes the symbol table. It’s not really relevant to hooking this up, so I’ll leave discussion of that till later.

lexer = new MyLLexer(new ANTLRStringStream(context.SourceUnit.GetCode()));

tokenStream = new CommonTokenStream(lexer);

parser = new MyLParser(tokenStream);

This code creates a new lexer object with an input stream that in reality will be a line from the console, then creates an ANTLR CommonTokenStream from it and then, finally, creates a new parser object using that as the input. This is just the standard way of creating a lexer/parser combination in ANTLR. Note that at this point we haven’t actually done any parsing or lexing – we’ve just set things up so that the next step will work:

This invokes the lexer and parser and if we haven’t made any typos in the single input line from the console when we run the program (line-feed will terminate the parse) we will get an AST in the variable r. Now we need to do some more ANTLR wiring up:

t = (CommonTree)r.Tree;

n = new CommonTreeNodeStream(t);

w = new MyLTree(n);

At this point, we have a nice new AST in w ready for the tree grammar to do its magic. Note again that at this point the tree has not been ‘walked’. This is done next:

That’s it: tree walked, job done! Well, not quite – we’ve now got a MyLStatement in s, so we need to ‘generate’ the code for the DLR:

From the DLR side, this walks the tree created by our tree walker (in the statement s = w.program().result; and calls, for each node in the tree, the associated Generate code. If you look inside a DLR class say a nice simple one like MyLConstant (which handles simple numbers like 1.234), you’ll see that there are two methods.

class MyLConstant : MyLExpression {



                private readonly object _value;



                public MyLConstant(SourceSpan span, object value)

                        : base(span) {

                        _value = value;

                }



                protected internal override Expression Generate (MyLLanguageContext context) {

                        return Expression.Constant(_value);

                }

        }

The first is the constructor method. This is called by the ANTLR tree walker like this:

n=NUMBER {$result = new MyLConstant(new SourceSpan(), double.Parse($n.Text));}

The second is the Generate method called by the DLR when it walks our tree. This method returns (in this case) a Constant DLR Expression that will contain our value.

When I first came across this, I had the feeling that someone had performed a magic trick and a rabbit had appeared from somewhere: I just could not see what had gone on before my eyes, so I’ll summarize what we’ve done:

we’ve created a basic AST using an ANTLR parser/lexer.

next, we’ve transformed that AST into a DLR tree using an ANTLR tree grammar. The nodes of our new tree were created by the constructors of the DLR entities we want to handle (such as MyLConstant.

this DLR tree is then walked again by the DLR by calling the Generate methods of the tree node. It’s this last bit that is the real code generator and this creates the DLR ‘expression tree’ that is executed.

What’s in a Name?

There’s one bit I skimmed over at little earlier – names. We need somewhere to store the identifier names that we are using so that if we have two statements like:

...the DLR needs to be able to retrieve the name ‘x’ and its associated value for use in the second statement. This is where the ‘symbol table’ comes in. A symbol table is just a dictionary of names. In more complicated languages, names have ‘scopes’; that is, they can come into and go out of existence depending on where they are used or declared. But here, I’ve just got one scope – the ‘global’ scope. So when a name like x is declared, it stays declared.

This is handled by the machinery in the classes MyLSymbolTable and MyLScope. Since there is only every one symbol table, it’s a good idea to make its methods static:

class MyLSymbolTable {



                private static MyLScope _globalScope;

                public static MyLScope GlobalScope {

                        get { return MyLSymbolTable._globalScope; }

                }



                public static void Initialize () {



                        _globalScope = new MyLScope("Global");

                }



                public static Expression LookupName (string name) {



                        return _globalScope.LookupName(name);



                }



                public static Expression GetOrMakeGlobal (string name) {



                        return _globalScope.GetOrMakeGlobal(name, typeof(object));

                }



        }

Here you can see that a single scope of type MyLScope is created by Initialize. I could have made the methods of MyLScope static as well and combined them into the MyLSymbolTable class, since there’s also only one scope, but I may want to be a bit cleverer later on, so I’ve declared MyLScope as:

class MyLScope {





                private Dictionary<string, Expression> _variables = new Dictionary<string, Expression>();



                private LambdaBuilder _block;

                public LambdaBuilder Block {

                        get { return _block; }

                }



                public MyLScope(string name) {



                        _block = Utils.Lambda(typeof(object), name, Annotations.Empty);

                }



                public Expression GetOrMakeGlobal (string name, Type type) {

                        Expression v;



                        if (!_variables.TryGetValue(name, out v)) {

                                v = _block.CreateGlobalVariable(SymbolTable.StringToId(name), type);

                                _variables[name] = v;

                        }

                        return v;

                }



                public Expression LookupName (string name) {

                        Expression v;



                        _variables.TryGetValue(name, out v);

                        return v;

                }



                public LambdaBuilder FinishScope (Expression body) {



                        _block.Body = body;

                        return _block;

                }

        }

There are just two DLR bits here, but they are quite fundamental. The first is in the constructor where a LamdaBuilder is created. The main purpose of a LambdaBuilder is to create a DLR LambdaExpression but before this can happen, the LambdaBuilder (here _block) needs to be set to the expression tree that we’ve generated via Antlr and the DLR Generate command.

The very last bit is then to create a LambaExpression from the LambaBuilder in :

lb =  MyLSymbolTable.GlobalScope.FinishScope(e);

le = lb.MakeLambda();

This two lines return a complete LambdaExpression to the DLR, a LambdaExpression corresponding to a .NET method body.

Running the Program

To run the program, just compile the C# solution in Visual Studio 2008 and press F5. You should see a console appear into which you can type simple assignment expressions like:

Dermot Hogan is the chief architect of the Ruby In Steel IDE and he is currently involved in the design and implementation of the new Sapphire language for the Dynamic Language Runtime.