In this series, Dermot explains how to integrate
a programming language into Visual Studio 2005. Part
Two: Colour Coding (see also: Part
One and Part
Three) |
Visual Studio provides two
ways of adding colour to your code. One uses the traditional
COM interfaces and you would probably use C++. The other
is via the Managed Package Framework (MPF). While they
both use the same fundamental Visual Studio interface,
you’ll
typically find that using the COM interface directly
will lead to one approach, while using the MPF will can
lead to another.
Microsoft does not include any parsing tools with Visual
Studio. Nor does it recommend any - you either have to
write your own or use external tools. The most widely
available tools (and the ones that have been around the
longest) are Lex and Yacc, deriving initially from AT&T’s
Unix in the 1980’s. In their more modern forms,
these come as Flex and Bison respectively. Flex and Bison
are available from several places on the web (try here: http://dinosaur.compilertools.net)
and are free, though they do come with some GNU restrictions.
They are essentially C or C++ tools and when given an
input ‘grammar’ produce C or C++ output files.
You then compile the files in the usual way and incorporate
them into Visual Studio. Visual Studio comes with a package – the
Babel package – that has hooks into the code produced
by Bison/Flex, and using Babel you can wire up a simple
language colouriser quite quickly.
The other alternative, via the MPF system, means that
you have to find a C# variant of Bison and Flex. I have
found a couple, but the results were not really satisfactory
and, after some experimentation and a dead end or so,
I abandoned this line. There is however a much better
way to work with the MPF and that’s to use a parsing
system called Antlr (www.antlr.org).
There are two things to consider when using Antlr. The
first is that the MPF lexer is more complicated to get
going. The other ‘problem’ is that Antlr
is a ‘recursive descent’ LL(k) parser – quite
different from the more common Yacc/Bison LALR(1) parser.
In my view, this isn’t a real problem at all, as
there are many advantages to using an LL(k) parser over
a LALR one. But be aware of the difference when you start
out: it’s not too easy to change direction once
you are half way in to building your new Visual Studio
Package.
The track I’ll look at here is the simpler, Bison/Flex
route and the example I’ll use to illustrate the
techniques is an assembler. Actually, it’s a Microchip
assembler for the dsPIC30 digital signal processors.
Microchip dsPIC programming is a whole different world
from Visual Studio programming, and you might wonder
how the two are related.
Fundamentally, I didn’t (and still don’t)
like Microchip’s proprietary IDE (MPLAB). I wanted
a different way to program my dsPIC chips. One part of
this was to integrate Microchip’s GNU based C compiler
into Visual Studio; this is really pretty simple to do.
Another part was to host the dsPIC assembler in Visual
Studio (this is part I’ll cover here). The final
bit is to program the dsPIC chip from Visual Studio.
That’s not so easy! But the end result is that
I have a nice Visual Studio based IDE for developing
dsPIC devices. And I don’t intend to go back to
Microchip’s MPLAB any time soon.
Flexing Your Muscles
Most parsing systems spilt the job of making sense of
an input file into two parts; first the ‘lexer’ (here
Flex) reads the raw input file and spits out a stream
of ‘tokens’. The real parser (Bison) then
makes sense of the token stream, checking it for correct
syntax and usually building a ‘parse tree’ out
of the tokens. For simple colourisation, the parser isn’t
really necessary, but since Flex and Bison work as a
pair, it makes sense to have a Bison parser as
well as the Flex tokeniser.
An assembler is essentially a line based language. Each
line of assembly instructions stands on its own and isn’t
usually related to another line. This makes it quite
a bit easier to parse than something like C++ (very,
very difficult). The input to Flex is a set of
regular expressions which describe how a token is composed.
Here’s the definitions for a ‘name’ and
a ‘label’ in Microchip assembler:
name [a-z_\.][0-9a-z_\.]*
label {name}\:
If you aren’t familiar with regular expressions,
in English this reads: ‘the first character of
a name is a lower case letter between a and z. Or an
underscore or a dot. Then this is followed by zero or
more similar characters but allowing digits as well’.
A label is similar, but it is followed by a colon character.
And so on. Regular expressions define the input grammar
to Flex and it’s essential to understand them before
writing a lexer or parser. However, the good news is
that the Flex manual is pretty easy to follow and is
quite thorough. Also, the regular expressions that Flex
understands are pretty simple examples of the beasts – there
are no named extensions or ‘lazy’ matching- so
it’s really very quick to build up a lexer that
recognises tokens that you want to colour.
In the Flex definition file, you then instruct Flex
to emit tokens to Bison like this:
{label} { return
LABEL; }
This just tells Flex to generate C code to emit a LABEL
token when a ‘label’ regular expression
is matched. Bison picks this up and decides if it makes
sense in a given context. You generally need some sort
of parser even if all you want to do is colour a particular
token. The reason is that the lexer cannot usually distinguish
between tokens which are used in two different contexts.
If you rely on just the lexer to categorise your tokens,
you may find that you end up with some odd or undesirable
colouring in certain places. Visual Studio doesn’t
require that you use a parser, but since the output of
Bison is what Visual Studio needs, you really have to
implement a parser, even if it’s a trivial one.
Coding
Once you’ve defined the Flex core lexer and a
simple Bison parser, you can incorporate them into a
DLL which can be called from the Visual Studio core.
It’s pretty easy to build this DLL, since Visual
Studio has a good wizard that builds the skeleton for
you.
You can build a Babel
compatible DLL quite easily using the Visual
Studio Language Package wizard. Since Flex
and Bison don’t come with Visual Studio,
you’ll need to modify the default Flex/Bison
tool location to where you’ve installed them. |
If you are just creating a simple colouring service,
there are just two or three methods to implement in the
service.cpp file (generated for you by the Language Package
wizard). This file contains overrides for the methods
in stdservice_.cpp, which does the real work of implementing
the IBabelService COM interface. Normally, you wouldn’t
be bothered with the base classes, but they can be useful
in setting breakpoints to find out why something hasn’t
coloured correctly.
The three methods you need to override are first the
CommentService:
override const CommentFormat* Service::getCommentFormat()
const {
static CommentFormat commentFormat = { ";",
NULL, NULL, true };
return &commentFormat;
}
This really just allows the IDE to comment selections
of code with the correct line comment character.
Next, override the getTokenInfo method. This method
connects a token type – such as an assembler REGISTER
say with a colour:
override const TokenInfo* Service::getTokenInfo() const
{
static TokenInfo tokenInfoTable[] = {
{ REGISTER,
ClassRegister, "operand ('%s')",
CharKeyword }, …
Finally, you need to define the colours:
override const ColorInfo* Service::getColorInfo() const
{
static ColorInfo colorInfoTable[] = {
{ ClassRegister, "Text","color: darkgray" }, …
Additionally, you need to declare an enum which is used
to index the arrays:
enum MyColorClass {
ClassOpCode
= ClassDefaultLast + 1,
ClassLabel,
ClassDirective,
ClassSymbol,
ClassLiteral,
ClassOperator,
ClassSpecial,
ClassRegister
};
Here’s the result – a nicely coloured assembly
code fragment hosted in Visual Studio. With my package
installed in the Visual Studio IDE, I can now edit assembler
and C files, use the Microchip compiler, assembler and
linker to create a ‘hex’ file and even program
it via a USB link into a Microchip microcontroller:
With
a combination of Flex, Bison and the Babel package,
it’s
relatively easy to build a language colouriser
for Visual Studio.
|
Troubleshooting
It has to be said that while all the above looks to
be simple and straightforward, debugging it can be tricky.
There are really two problem areas. The first is that
you build your package, implement the Babel interface,
load up your file and – it’s not coloured!
Totally monochrome, in fact.
The root of this problem probably lies in the Registry.
Visual Studio is COM based and all Visual Studio extensions
use COM interfaces to communicate with any external code.
When Visual Studio tries to determine if you want colouring
for your code, it looks up the extension in the Visual
Studio part of the Registry, finds out the Guid (CLSID)
of the COM component that does the colouring and co-creates
it (instantiates it via COM). Additionally, the Babel
package that comes with the Visual Studio SDK must be
installed – look for the BabelPackage.msi installer
in the SDK. The registration of your language colouring
package is handled automatically by a post-build step
(set up by the Language Package wizard). This calls the
VS SDK ‘regit.exe’ program to do the business.
Typically, if you get no colourisation at all, one of
the above steps has gone astray. The trouble is that,
being COM, it can be very difficult and frustrating to
find out exactly what is wrong.
The second potential problem is when there is some colourisation,
but not quite what you expected. The problem here is
usually in the Flex definitions. However, there isn’t
an easy way to debug these. Flex and Bison are ‘state-driven’:
they use tables of numbers to decide what to do next.
With a simple language like an assembler, tracking down
these problems is normally quite easy, if a little laborious.
It’s a different ball game if you get into a complicated
language like C++ or Ruby.
I’ve been of the opinion for a long time, that
the speed of development is largely related to how quickly
you can debug something. Debugging complex Yacc/Lex state
tables isn’t to be undertaken lightly. It’s
for this reason that I abandoned the LALR parsers and
turned to Antlr - a tool with far better debugging and
ease of use. But for simple, fast colouring Bison and
Flex are hard to beat.
The registry is the
root of (most) COM evil. The trouble is that
there isn’t a simple way of finding
out what is wrong. You have to do it the hard way – by
eyeball. |
In the next part of this series, Dermot looks at MSBuild...
April 2006 |