Section :: programming

DLR - Build Your Own Language (part 3)

by Dermot Hogan

Assignment statements

Monday 1 September 2008.

Dermot Hogan looks at what’s required to build your very own computer language using two new – and remarkable – tools. Microsoft’s Dynamic Language Runtime and ANTLR3 by Terrence Parr from the University of San Francisco.

Download The Source Code

In our simple calculator so far, we’ve been able to evaluate expressions like 1+2/3 or (1+2)/3 - but not much else. But even the simplest calculators have a ’memory’ function that allows you to store and retrieve one result of a calculation. This month, we’’ generalize this memory facility by the means of the more general ’assignment statement’.

In most (if not all) programming languages, an assignment statement is one of the most fundamental concepts. When in C# or Java or Visual Basic, we say x = 1 we are using a shorthand for a complex sequence of smaller, simpler instructions. And when you get down to the microcode, it can be a very complex sequence indeed. But in essence, we’re asking the computer to give a name x to an item of memory (a ’location’) and to set this location to the value 1.

Further, we want to be able to get the value back at some future time, so that we can write 1 = x and get the value 2 returned to us.

To do this, we need to modify the original ANTLR3 lexer to allow some new constructs, specifically an assignment using the = operator.

First, we need to add something that recognises the = and also something that handles names such as x. The assignment bit is simple – just add a rule like this (note that lexer rules always start with a capital letter):

ASSIGN : '=' ;

and an identifier is defined by this:

IDENTIFIER : LCLETTER (LCLETTER | DIGIT)*;

This says (in Antlr speak) that an identifier starts with a lower-case letter and can be followed by zero or more lower case letters or digits.

The sub rules for lower-case letters and digits are:

fragment

DIGIT : '0'..'9';



fragment

LCLETTER : 'a'..'z';

Notice that each of these sub-rules is prefixed by the keyword fragment. This tells Antlr not to match this rule directly, but only to use this rule from within another rule (in this case IDENTIFIER). The reason is that if this wasn’t the case, both IDENTIFIER and LCLETTER would match the input z – and this would be ambiguous: ANTLR would not know what to do. You can see the effect by leaving out the fragment keyword and running the resultant grammar through Antlr. In other words, the effect of the fragment keyword is to protect a sub-rule from being used directly.

Next to the parser. The IDENTIFIER lexer rule is used in two places. First, in the parser rule primary. Now when you look at grammars whether they are old style yacc or the fancy new ANTLR3, you’ll see lots of ‘primary’s around. A primary is a sort of stand-alone thing, such as a number or an identifier. In this simple case, our primary is something like ‘x’, or ‘123’ or a parenthesised expression ‘(x + 2)’. The other place IDENTIFIER is used is in the ‘expr’ rule:

IDENTIFIER ASSIGN expr -> ^(ASSIGN IDENTIFIER expr)

This says that a valid use of an identifier is ‘x = 2’ say. But there’s another part to this rule as well. That’s the part that follows the -> operator. This is a ‘re-write rule’ – and it’s one of the things that make Antlr incredibly powerful. If I didn’t have the re-write rule, the output here would be three tokens: x, =, 2. With the re-write rule, we get a tree (an Abstract Syntax Tree or AST) with the = as the root node and x and 2 as sub-nodes. At first sight, this may seem to be rather an odd thing to want to do, but in fact it’s exactly what’s needed for the next stage of the business – ‘walking the tree’.

Walking the Dog

Having got an AST in the format we want, the next thing is to ‘walk’ it – that is, to traverse the tree, emitting code (or whatever) as we do so. In our case, we don’t want to emit code, but rather we want to emit instructions to the DLR. There are two ways to walk the tree. The first (old) way is by recursive descent – write some code that starts at the root of the tree and works its way down. This is hard work – having done several tree walkers like this, I can vouch for the fact. In addition, it’s tedious and error prone.

The other way is to use Antlr’s tree grammar. There’s nothing magical about a tree grammar. A tree grammar is just a parser that matches an AST rather than a lexical script. So if I have a tree that consists of a root node = and two child nodes x and 2, I want a rule that matches just that. And here is one:

program returns [MyLStatement result]

        : e=expr  {$result = new MyLPrint(new SourceSpan(), $e.result);}

        | ^(ASSIGN lvalue=IDENTIFIER rvalue=expr) {

                        MyLIdentifier id;

                        MyLAssign assign;

                        

                        id = new MyLIdentifier(new SourceSpan(), $lvalue.Text);

                        assign = new MyLAssign(new SourceSpan(), id, $rvalue.result);

                        $result = new MyLPrint(new SourceSpan(), assign);

                        }

        ;

This is a rule that matches either an expression or an assignment. Look at the second part. This matches an ASSIGN symbol as a root (that’s what the ^ operator means) followed by an IDENTIFIER and an expr as child nodes. Two labels, lvalue and rvalue, are set as labels so that the identifier and expression can be referred to subsequently. Following the ^(ASSIGN lvalue=IDENTIFIER rvalue=expr) you’ll see something in curly braces that looks suspiciously like C# code. In fact, it is C# code and with a couple of substitutions it is what ANTLR will generate as the code for this particular rule. When the AST (that’s what the parser has generated, remember) is matched for this rule, this generated code is executed.

Looking at the rule above, you’ll see a couple of things like $lvalue.Text. This is an instruction to Antlr to use the text of the identifier – the label lvalue was assigned to the identifier by this fragment:

lvalue=IDENTIFIER

The dollar symbol simply tells ANTLR that lvalue is a label which it knows about and, in this case, to retrieve the text.

But what’s going on here $rvalue.result? Well, if you look at the following rule:

expr returns [MyLExpression result]

        : ^(op=(PLUS|MINUS|MULTIPLY|DIVIDE) left=expr right=expr) {$result = new MyLBinary(new SourceSpan(), $op.Token.Type, $left.result, $right.result);}

        | ^(op=(UNARY_PLUS|UNARY_MINUS) arg=expr) {$result = new MyLUnary(new SourceSpan(), $op.Token.Type, $arg.result);}

        | id=IDENTIFIER {$result = new MyLIdentifier(new SourceSpan(), $id.Text);}

        | n=NUMBER {$result = new MyLConstant(new SourceSpan(), double.Parse($n.Text));}

        ;

...you’ll see that expr returns something – that’s what the syntax expr returns [MyLExpression result] means. Here, I’m returning a DLR type MyLExpression. So the syntax $rvalue.result gets the result (a MyLExpression returned by the rule expr. Looking again at the first program rule above, that also returns something. In this case, it’s a MyLStatement and I can refer to it in C# code like this:

s = w.program().result;

This is what we want - the link between Antlr and the DLR.

DLR Parser

Lets now turn to the DLR parts and look at the ParseSourceCode routine in the file MyLanguageContext.cs. This is the key part of assembling all the ANTLR bits I’ve discussed above and hooking up to the DLR. I’ll go through each step.

MyLSymbolTable.Initialize();

This initializes the symbol table. It’s not really relevant to hooking this up, so I’ll leave discussion of that till later.

lexer = new MyLLexer(new ANTLRStringStream(context.SourceUnit.GetCode()));

tokenStream = new CommonTokenStream(lexer);

parser = new MyLParser(tokenStream);

This code creates a new lexer object with an input stream that in reality will be a line from the console, then creates an ANTLR CommonTokenStream from it and then, finally, creates a new parser object using that as the input. This is just the standard way of creating a lexer/parser combination in ANTLR. Note that at this point we haven’t actually done any parsing or lexing – we’ve just set things up so that the next step will work:

r = parser.program();

This invokes the lexer and parser and if we haven’t made any typos in the single input line from the console when we run the program (line-feed will terminate the parse) we will get an AST in the variable r. Now we need to do some more ANTLR wiring up:

t = (CommonTree)r.Tree;

n = new CommonTreeNodeStream(t);

w = new MyLTree(n);

At this point, we have a nice new AST in w ready for the tree grammar to do its magic. Note again that at this point the tree has not been ‘walked’. This is done next:

s = w.program().result;

That’s it: tree walked, job done! Well, not quite – we’ve now got a MyLStatement in s, so we need to ‘generate’ the code for the DLR:

e = s.Generate(this);

From the DLR side, this walks the tree created by our tree walker (in the statement s = w.program().result; and calls, for each node in the tree, the associated Generate code. If you look inside a DLR class say a nice simple one like MyLConstant (which handles simple numbers like 1.234), you’ll see that there are two methods.

class MyLConstant : MyLExpression {



                private readonly object _value;



                public MyLConstant(SourceSpan span, object value)

                        : base(span) {

                        _value = value;

                }



                protected internal override Expression Generate (MyLLanguageContext context) {

                        return Expression.Constant(_value);

                }

        }

The first is the constructor method. This is called by the ANTLR tree walker like this:

n=NUMBER {$result = new MyLConstant(new SourceSpan(), double.Parse($n.Text));}

The second is the Generate method called by the DLR when it walks our tree. This method returns (in this case) a Constant DLR Expression that will contain our value.

When I first came across this, I had the feeling that someone had performed a magic trick and a rabbit had appeared from somewhere: I just could not see what had gone on before my eyes, so I’ll summarize what we’ve done:

we’ve created a basic AST using an ANTLR parser/lexer.
next, we’ve transformed that AST into a DLR tree using an ANTLR tree grammar. The nodes of our new tree were created by the constructors of the DLR entities we want to handle (such as MyLConstant.
this DLR tree is then walked again by the DLR by calling the Generate methods of the tree node. It’s this last bit that is the real code generator and this creates the DLR ‘expression tree’ that is executed.

What’s in a Name?

There’s one bit I skimmed over at little earlier – names. We need somewhere to store the identifier names that we are using so that if we have two statements like:

x = 1

1 + x

...the DLR needs to be able to retrieve the name ‘x’ and its associated value for use in the second statement. This is where the ‘symbol table’ comes in. A symbol table is just a dictionary of names. In more complicated languages, names have ‘scopes’; that is, they can come into and go out of existence depending on where they are used or declared. But here, I’ve just got one scope – the ‘global’ scope. So when a name like x is declared, it stays declared.

This is handled by the machinery in the classes MyLSymbolTable and MyLScope. Since there is only every one symbol table, it’s a good idea to make its methods static:

class MyLSymbolTable {



                private static MyLScope _globalScope;

                public static MyLScope GlobalScope {

                        get { return MyLSymbolTable._globalScope; }

                }



                public static void Initialize () {



                        _globalScope = new MyLScope("Global");

                }



                public static Expression LookupName (string name) {



                        return _globalScope.LookupName(name);



                }



                public static Expression GetOrMakeGlobal (string name) {



                        return _globalScope.GetOrMakeGlobal(name, typeof(object));

                }



        }

Here you can see that a single scope of type MyLScope is created by Initialize. I could have made the methods of MyLScope static as well and combined them into the MyLSymbolTable class, since there’s also only one scope, but I may want to be a bit cleverer later on, so I’ve declared MyLScope as:

class MyLScope {





                private Dictionary<string, Expression> _variables = new Dictionary<string, Expression>();



                private LambdaBuilder _block;

                public LambdaBuilder Block {

                        get { return _block; }

                }



                public MyLScope(string name) {



                        _block = Utils.Lambda(typeof(object), name, Annotations.Empty);

                }



                public Expression GetOrMakeGlobal (string name, Type type) {

                        Expression v;



                        if (!_variables.TryGetValue(name, out v)) {

                                v = _block.CreateGlobalVariable(SymbolTable.StringToId(name), type);

                                _variables[name] = v;

                        }

                        return v;

                }



                public Expression LookupName (string name) {

                        Expression v;



                        _variables.TryGetValue(name, out v);

                        return v;

                }



                public LambdaBuilder FinishScope (Expression body) {



                        _block.Body = body;

                        return _block;

                }

        }

There are just two DLR bits here, but they are quite fundamental. The first is in the constructor where a LamdaBuilder is created. The main purpose of a LambdaBuilder is to create a DLR LambdaExpression but before this can happen, the LambdaBuilder (here _block) needs to be set to the expression tree that we’ve generated via Antlr and the DLR Generate command.

The very last bit is then to create a LambaExpression from the LambaBuilder in :

lb =  MyLSymbolTable.GlobalScope.FinishScope(e);

le = lb.MakeLambda();

This two lines return a complete LambdaExpression to the DLR, a LambdaExpression corresponding to a .NET method body.

Running the Program

To run the program, just compile the C# solution in Visual Studio 2008 and press F5. You should see a console appear into which you can type simple assignment expressions like:

x = 1

y = 3

x + y

...and so on.

Next month...

We’ll look at adding functions like ’sqrt’.

Dermot Hogan is the chief architect of the Ruby In Steel IDE and he is currently involved in the design and implementation of the new Sapphire language for the Dynamic Language Runtime.

Add to del.icio.us

Forum

DLR - Build Your Own Language (part 3)
1 September 2008, by raven

Hi,

I’d like to ask a few questions.

1. What version of DLR is this program using? Seems like the newest files in the DLR date to June 9, which is pretty far from being latest. The namespace of the DLR AST hasn’t switched to System.Linq.Expression, and core DLR functionality hasn’t moved into System.Scripting. I can’t find the corresponding DLR version from IronPython source code changesets released on CodePlex.

2. What’s the merit of separating ANTLR AST construction from the language’s own AST construction? If the tree grammar generated DLR AST directly instead of the custom MyL AST, it saves a whole pass of tree walking, and it also saves the effort to write this whole custom AST class hierarchy, doesn’t it? Parsing support for language tools (such as IDEs) can still be done with ANTLR AST, I guess. I’d really like to know if there’s anything I missed. If it’s about code maintainability, I’d take second thoughts on whether a whole pass of tree walking is worth it.
- DLR - Build Your Own Language (part 3)
  1 September 2008
  
  The DLR is still changing at a rapid rate. I got this DLR from Iron Python beta 2, the logic being that if the DLR works for Iron Python, its going to work for me (though in fact, it turned out that ToyScript was broken). I tend to avoid the latest and greatest DLR since there’s a good chance that something will have broken or changed, and to be honest, I don’t want to be the person who finds out. With zero documentation and just slightly more comments in the DLR code, it can be very timeconsuming
  
  I don’t think you can really generate the DLR tree directly. The purpose of walking the Antlr AST via the tree grammer is to generate a DLR compatible AST, so the tree grammar transforms the base AST into a DLR AST (that’s one of the main purposes of Antlr tree grammars, though it maybe done better with Antlr string templates. But I haven’t looked at those yet). You might be able to do a simple DLR AST directly from the Antlr parser, but I think it would become unmaintainable pretty rapidly.
  
  Personally, I’m not concerened about an extra pass (or even two). Antlr is reasonably fast (tho’ not as fast as yacc) and the difference between 5ms and 10ms (say) isn’t really noticable. It’s far better to go for separation of function between the parser (whose job is to check syntax, produce error messages and generate a valid AST) and the tree grammar (whose job here it is to transform the valid AST into some other valid AST). You might well go though the base AST again to optimize it before transforming it to a DLR tree. But that’s just my own view.
  
  Dermot
  - DLR - Build Your Own Language (part 3)
    1 September 2008, by ravenex
    
    Thanks for the reply. It’s always good to hear rationales behind the design
    
    So the DLR came from IronPython Beta 2. I was just comparing it with the source code changesets and missed the beta releases...for sure, releases have more documentation out-of-the-box than the source code changesets. And the documentation for DLR is increasing as each new changeset comes out, which is good. The DLR’s changing so rapidly that I guess the team themselves could keep ToyScript sample in sync with it, so they just removed it in the more recent changesets, which ain’t so good.
    
    On tree generation, what I’m trying to say is that, the MyL sample in this series goes like:
    
    source code => (via ANTLR built-ins) -> parse tree => (via ANTLR lexer/parser grammar) -> ANTLR AST => (via ANTLR tree grammar) -> base AST => (via recursive "Generate" call) -> DLR tree
    
    XRuby, if you’ve heard of it, goes through almost the same steps. Only that the final target wasn’t DLR tree, but instead JVM bytecode. I asked the project lead for the design rationale, but to my surprise he said there wasn’t much of any rationale for this, and the main concern was to separate the parser grammar from base AST building.
    
    I believe there’s a way you can attach extra information on the ANTLR AST, Subclassing CommonTree will pretty much do the job. Walking the ANTLR AST for different purposes (such as type checking, optimization rewrites or code generation) can be done by writing more tree grammars, one for a walking the tree once. The "Generate" method on the base AST classes can alternatively be implemented with the Visitor pattern, and ANTLR tree grammar can be a replacement for such a pattern. I don’t find any information that lives in the base AST that doesn’t come from the ANTLR AST. Since the base AST is now only used as temporary storage for code generation, I think doing codegen directy via ANTLR tree grammar will suffice, without having to bother maintaining a separate class hierarchy. And we save a step in the compilation:
    
    source code => (via ANTLR built-ins) -> parse tree => (via ANTLR lexer/parser grammar) -> custom ANTLR AST => // (via ANTLR tree grammar) -> type check, constant folding, simple flow analysis... // => (via ANTLR tree grammar) -> DLR tree
    
    I’d like to hear more on your opinion.
    - DLR - Build Your Own Language (part 3)
      2 September 2008
      
      I see what you’re getting at ... yes, I think you could do what you want by making all the DLR AST classes based on CommonTree and then generating the nodes explicitly in the parser. However, it would clutter up the parser with C# code and that seems to me to be the wrong place to do it. But on the otherhand, for a simple parser like MyL it wouldn’t be too bad.
      
      For a more complicated parser like Ruby (we use our own which is currently based on Antlr 2.7), I really think it wouldn’t be a good idea. I’ve kept our Ruby parser quite clean by doing all of the complicated stuff in C# and using virtual stubs where necessary in the parser (though actually, it’s the lexer that’s the killer in Ruby). For something like Ruby where you can go cross-eyed trying to figure out what Matz actually meant, that’s pretty important. (I’ve had a look at the Xruby one - I wouldn’t like to maintain it ). I didn’t use the Antlr 2.7 tree grammar thing because it seemed to be pretty limited - I did use it once to try to build the expression parser for our Visual Studio debugger, but it turned out to be far simpler to walk the tree in C#.
      
      I’ve just written a pretty complicated tree grammer (500 lines+ and it will be a couple of thousand lines when its all done) for a new project I’m working on and with Antlr 3.1, it’s a dream. Again, it seems to me to be far better to put the error handling, messages, syntax checked and the like in the parser so that the parser passes a clean stream to the tree grammar and then handle the transformation and symbol table building there.
      
      I suspect that my view is colored by the fact that I have to fix our bugs pretty fast (that’s what our customers expect, after all) and anything I can do to make this easier, I’ll go out of my way to do, even if it does make the thing go slower.
      
      Dermot
      
      DLR - Build Your Own Language (part 3)
      3 September 2008, by raven
      
      Thank you very much for sharing your rationales.
      
      I’m not a fan of mixing two languages in one source file; embedding Java or C# code in an ANTLR grammar looks so foreign...worse, there’s not gonna be enough IDE support: either it’ll support ANTLR grammar’s grammar, or it’ll support Java/C#’s, but not both at once. So, I agree that it’s better just to write more of the programmatic stuff in C# and let ANTLR do what it does best — parsing. I’ve been trying to convince myself that walking the AST an extra pass isn’t that bad, and that’s why I asked for rationales.
      
      Last semester when I was working on an assignment of my compiler course, I wrote the lexer and parser by hand, for a Pascal-like little language. That wasn’t too bad, but I thought it’d be a lot easier if the parser was generated instead. I did the type checking, constant propagation and the like via the Vistor pattern, and that was a pain. As I wasing writing more code, I did rafactoring to the AST classes from time to time; at first the Visitor base class was also hand-written, whenever I added/removed an AST class I’d have to go to all my Visitor classes to sync them back up. It was so annoying that I decided to generate the visitors with a Ruby script.
      
      Walking the tree with ANTLR tree grammar is an alernative to that, but I’ve never actually tried it, because I don’t like mixing languages within the same source file.
      
      XRuby is based on ANTLR 2.7 too. They tried to move the project to ANTLR 3.x, but it doesn’t seems they’ve suceeded. And yes, lexing Ruby is really hard. It seems like the lexer needs to know what an identifier is bound to, so that it’ll generate the right token type for the parser. That’s almost like a parser in the lexer of its own. I can imagine how hard it is to build Ruby in Steel. Good job.
      
      Keep up the good work, I’m looking forward to reading more on this series
  - DLR - Build Your Own Language (part 3)
    1 September 2008, by ravenex
    
    Ah, there’s a typo in the last reply: I meant to say maybe the DLR team *could’t* keep ToyScript in sync with the lastest sources of DLR, so they removed it in the recent changesets.
  - DLR - Build Your Own Language (part 3)
    1 December 2008, by gotenks
    
    > I don’t think you can really generate the DLR tree directly.
    
    I think you can. MyJScript does it .