[ Go back to normal view ]

BW2 :: the bitwise supplement :: http://www.bitwisemag.com/2

Building your own Language – without tears
Dermot Hogan looks at what’s required to build your very own computer language using two new – and remarkable – tools. Microsoft’s Dynamic Language Runtime and ANTLR3 by Terrence Parr from the University of San Francisco.

2 April 2008

by Dermot Hogan

In this series, I’m going to start at the bottom of the DLR pond and work upwards towards the light.Specifically, I’m going to construct an ANTLR tree grammar for a calculator and show you how to wire this into a DLR framework. This is about as simple as you can get with the DLR and still do something meaningful. It’s a lot simpler than the Microsoft example, ToyScript, which comes with the Iron Python distribution. I’ve tried hard to pare the calculator example down to the absolute minimum required to actually do something non-trivial. But I don’t want to denigrate ToyScript – it’s an excellent example of how to use the DLR, but in my view it’s not quite introductory enough.



See also:
- Part One
- Part Three

To build your own language in the DLR you need two things. First off, the DLR itself. Secondly, ANTLR3. The DLR you can get from CodePlex. This is the Iron Python distribution and (according to Microsoft will contain the latest version of the DLR for the near future). Currently this is DLR beta 1 - the first official beta of the DLR, released a couple of weeks ago. To avoid you having to download the whole of Iron Python when all you want is the DLR, I’ve built a complete project sample which contains the DLR (and ANTLR3). The DLR source code is distributed under Microsoft’s Public Licence (there’s a copy in the zip) and is free. Similarly, ANTLR3 is distributed under the BSD license and is also free. ANTLR3 is currently released as 3.0.1 (you can get it from here but we will need ANTLR 3.1 which is currently in beta. I have included a copy of the ANTLR3 runtimes (build 2008-02-27.17), but only the Java jar files, not the source code. So beside the zip file, all you need to build and run the example is Visual Studio 2008 with C# and Java.

As I mentioned last month, the DLR essentially interprets (and compiles) Abstract Syntax Trees (ASTs) into the .NET CLR Intermediate Language (IL) and executes it. There are several ways you can build an AST – by hand, or using something like yacc, or the easy way using ANTLR. And since you need an AST before you can really do anything with the DLR, it makes sense to start at the ANTLR end of things.

Jumping in

To start off, download the zip file CalculatorDLR.zip, and unzip it into a suitable directory. You’ll see that there are three main sub-directories. The Antlr directory contains the ANTLR3 components, Microsoft Scripting holding the DLR source and lastly, Calculator containing the additions to the DLR for our simple DLR Calculator example.

To try things out (without any explanation), you can load the solution MyL.sln into Visual Studio 2008 (I think the Express Edition of C# will work, though I haven’t tried it) and just press F5. You should see a console appear like this:

Typing a few simple expressions like 1 + 1 should give the correct answer. If you want to use the debugger to start examining the internals of the DLR, the I suggest setting a breakpoint on ParseSourceCode in the file MyLLanguageContext.cs and running the program. The breakpoint will fire once you’ve typed an expression in the console and you can go from there.

A pedestrian’s guide to ANTLR3

I want to be upfront about this: ANTLR has a learning curve – and it can be steep. It’s a very powerful tool for generating ASTs for a number of languages and it’s widely used both in industry and in universities. However, the good news is that there is an excellent book written by Terrence Parr (’The Definitive ANTLR Reference: Building Bomain Specific Languages’ from Pragmatic Programmers) which goes into great detail about how to use ANTLR3. While I’d recommend this book to anyone who is serious about using ANTLR3, you don’t need to buy it to get started. The ANTLR website has a lot of documentation on the subject and I’d particularly suggest the Five minute introduction to ANTLR 3.

Incidentally, when I was deciding on which tool to use to write the Ruby In Steel parser, I decided on ANTLR not from any deep understanding of language tools and compilers, etc. – I hardly knew what an AST was at the time! – but because ANTLR was clearly going somewhere with an active and vigorous community behind it. It’s not a decision I’ve regretted.

The calculator example has three ANTLR files – TestLexer.g,TestParser.g and TestTree.g. The first is the ‘lexer’ definition. The function of a lexer is to split up a stream of input text into ‘tokens’ that can be used by the next stage – the ‘parser’ defined in TestParser.g. The output of the parser is an AST which is then fed into the third stage – the ‘tree grammar’ which determines how the AST is ‘walked’. It’s the output of this last stage that’s used in the DLR. This might sound a long winded way to get into the DLR, but if you break it down into stages, you’ll see that it’s really pretty simple stuff.

So let’s have a look at the lexer first. The way ANTLR works is that it takes a ‘.g’ file and produces a ‘.cs’ file. If you’ve unzipped the CalculatorDLR.zip, create a command console and change directory to the top level directory in the directory where you unzipped. You should see a file antlr.bat there. This contains the code for firing up Java and running ANTLR. It sets the Java CLASSPATH to relative directories, so it will only work if you run it in this directory. Type the following (note, the filename ’TestLexer.g’ is case sensitive):

Antlr Calculator\Parser\TestLexer.g

and you should see something like this: ANTLR3 will now have create a new version of the C# file TestLexerLexer.cs in the Test\Parser directory. The C# file will be used in the DLR program to analyze command line input - the TestLexer.g file only describes to ANTLR what the C# file will look like. ANTLR then creates the C# file from the grammar file. But first, open the TextLexer.g file in VisualStudio (or any other text editor). If you open the solution file MyL.sln, you’ll see it in the Calculator project under the Parser folder.

The first thing you’ll see is a comment block. ANTLR comments work in the same way as C/Java/C++. That is, a comment block is delimited by /* ... */ and line comments start with //. There then follows a few sections that describe the lexer to ANTLR and determine what output is generated. For example, C# code with a namespace of MyL.parser.AST.

The first real lexer instruction (or ‘rule’) is the line

PLUS : ‘+’;

This says that when the lexer encounters the character + it emits a token with a value PLUS. Usually, we’re not interested in what the actual (numeric) value of PLUS is, just that we’ve encountered a + symbol and a PLUS token has been emitted for use by something else, normally the parser.

Continuing down the file, there are other simple token mapping instructions, MINUS, MULTIPLY and so on until we reach the rule WS. This is a bit more complicated.

WS : (' ' | '\t')+ {Skip();};

This rule matches ‘whitespace’, that is any sequence of tabs and spaces. The fact that it will match a sequence is indicated by the ‘+’ modifier. Without the ‘+’, the rule would just match a single space or tab – these alternatives are indicated by the vertical bar ‘|’. The ‘+’ modifier means ‘one or more’. There are also two other modifiers – ‘?’ which means “zero or one” and ‘*’ meaning “zero or more”. Lastly, there’s a C# instruction contained between two curly braces. This is a call to the ANTLR method Skip which instructs ANTLR not to emit a token. The effect of this rule is to simply ‘swallow’ whitespace.

Lastly at the end of the file, there are two other rules NEWLINE matching a newline sequence of either a CR-LF pair or just a single LF,

NEWLINE : '\r'? '\n';

and a rule INT which matches a sequence of digits like ‘123’. The two dots here indicate a ’range’ of characters from 0 to 9:

INT : ('0'..'9')+;

Next month, I’ll look at the parser side of things and show you how a tree grammar is used to connect into the DLR itself.


Dermot Hogan is the chief architect of the Ruby In Steel IDE and he is currently involved in the design and implementation of the new Sapphire language for the Dynamic Language Runtime.