See also part one of
this series and the conclusion, part three
“The best laid schemes o’ mice an’ men
gang aft agley”. Or if you’re
not familiar with the Scots
dialect of Robert
Burns:
the best laid schemes of mice and men have a distinct
tendency to go a bit pear
shaped. And so it often goes
with programming. What seems at first sight to be a
simple operation can turn out to be, shall we say,
a tad more complicated.
This month Dermot provides a handy guide to some
of the more obscure corners of .NET, Windows, VB
and VMS with occasional excursions into Lowland Scots,
English slang, the poetry of Burns and the rules
of snooker. We've provided a few links to assist
the terminally confused... |
Last month, I confidently started out to ‘roll-my-own’ .NET
communication control. Initially, it all went pretty
well. In .NET, it’s easy to start a thread and
perform whatever I/O operations are required for the
communications port (COM1, etc.) on that thread, leaving
the main program thread free to handle the keyboard and
display. It’s also easy to combine everything in
a user control and use properties to communicate with
the control. In last month’s column, I created
a ‘Comm’ control which allowed you to open
a communication port, set and clear the DTR signal line
and be notified in the main program thread when the input
signal line corresponding to DSR changed.
So far, so good. It looked as though it would be all
downhill from there. My aim was to end up with a workable
.NET Comm control, equivalent to (or better than) Microsoft’s
somewhat limited MSComm control from earlier versions
of Visual Basic. However, there was a ‘gotcha’ lurking!
The problem comes from the nature of the WaitCommEvent
API. This, in the simple form used last month, does just
what it says – it waits for a control signal to
change on the COM port. But if the thread that handles
the COM events issues the WaitCommEvent call, and then
the main thread attempts to set DTR, the program hangs.
It doesn’t crash, nor gobble up all the CPU in
a loop or indeed do anything uncivilized. It simply does
nothing!
Behind The
Eight Ball? After some head scratching, the light dawned: the WaitCommEvent
API is a ‘blocking’ I/O request. If it is
in progress, any other I/O request must wait ‘behind’ the
current I/O operation, until the this I/O operation has
finished. So the request to set DTR is queued behind
the WaitCommEvent I/O request and never gets to run until
WaitCommEvent completes – even if it’s on
a different thread. Worse still, the program goes into
a ‘wait’ state from which you can’t
wake it. All you can do is kill off the program or satisfy
WaitCommEvent by asserting DSR on the port.
In other words (not from Rabbie Burns this time), we’re
snookered.
Now it turns out that this behaviour is quite by design.
And the reason it’s like this is that Windows XP
is based on an older operating system called VMS from
Digital Equipment Corporation. VMS itself was based on
a much earlier operating system called RSX. And the primary
I/O interface in RSX was a system call, named QIO (that
is, Queue I/O) which exhibited exactly this blocking
behaviour – that’s why it was called QIO
since I/O requests were queued behind one another for
serial execution: they ran one at a time. There’s
another interesting piece of information: all three operating
systems were designed by a certain Dave Cutler, who was
lured away from a declining Digital Equipment by Bill
Gates, no doubt in return for many squillions of Microsoft
stock options.
So maybe all that’s required is a good working
knowledge of VMS internals! And sure enough, after a
bit of deep recall (and some nostalgia for my misspent
youth, hacking away at VMS), the mechanisms required
to solve this particular problem came to the surface.
It was quicker than searching through the Windows documentation,
anyway.
The first thing to do is to prime the I/O sub-system
to expect asynchronous I/O operations – that is,
operations which are not completed when the I/O API returns.
In Windows, it’s called ‘overlapping’ and
you need to set a flag when you open the COM port (or
open a file):
CreateFile(portname, GENERIC_READ Or GENERIC_WRITE,
0, 0, OPEN_EXISTING, FILE_FLAG_OVERLAPPED, 0)
This tells the operating system that certain APIs have
the ability to return to the calling thread before the
I/O operation they were requesting has completed. But
the I/O is still active deep inside the operating system,
and all that has happened is that the initial I/O request
has completed. So now another I/O request can be queued
to the device by another API call and we’re no
longer blocked.
Next, we need an ‘OVERLAPPED’ structure,
which is used to communicate with the still active I/O
operation. Indeed, the existence of an OVERLAPPED structure
in the arguments to an API usually indicates that the
I/O is to be asynchronous. Here’s the structure:
Structure OVERLAPPED
Private Internal As Integer
Private
InternalHigh As Integer
Private offset As Integer
Private OffsetHigh As Integer
Public hEvent As Integer
End Structure
The only thing of interest to us in this is the hEvent
member. This must be set to the handle of an ‘event’.
Events are just one type of signalling mechanism used
by the operating system (see Signals
and Mutexes,
below). One way to think of an event is as a flag that
pops up when something occurs. Other things (threads)
are looking for the flag and wait until they see it.
When they do, off they go again running whatever code
is in the thread.
Events are used to synchronize various activities between
threads and inside the operating system itself. But they
are also used to communicate between the operating system
core (the ‘kernel’) and the threads that
run in your program. It might seem a bit strange at first
sight that there is such a problem. But the kernel runs
according to a very strict set of rules. The ability
to poke around in a user level thread is not among them.
Events can be either automatically re-signalled – reset
themselves, so to speak – or they can manually
reset. I prefer the latter, since it’s clear then
as to what is going on. Also, the event can either be
set to signalled (flag raised) or cleared (flag down)
when the event is created:
o.hEvent = CreateEvent( Nothing, 1, 0, portnumber.ToString)
This API creates an event with no security (first argument),
manual reset (second), initially cleared (third) and
with the port number as the event’s name (last
argument).
One of the problems with
asynchronous I/O is the number of APIs you need to declare
to do anything.
And they’re
all several lines long, as well.
Hanging Around Now we’re ready for the modified event-handling
thread. The WaitCommEvent API actually takes three parameters.
The first two – the port’s handle and the
event mask used to indicate what signal caused the communication
event – are the same as we used last month. But
the third, the lpOverlapped parameter must be set to
an Overlapped structure. If this is set, the API will
not wait, but will return immediately with an error code,
ERROR_IO_PENDING. However, this isn’t really an
error - it’s what we expect in an overlapped I/O
operation - but we’ve got to check anyway:
If r <> 1 Then
r = GetLastError()
If r <> ERROR_IO_PENDING Then
Throw New System.Exception("WaitCommEvent failed")
End If
End If
The real wait must be done next, using the WaitForSingleObject
API:
r = WaitForSingleObject(o.hEvent, INFINITE)
This waits (yes, it really does as advertised and wait this time) until the event specified by the event handle
is raised. The event is the one in the Overlapped structure
which was given to the earlier WaitForCommEvent API and
this is the event set by the operating system when the
I/O completes. There’s also a timeout parameter
which will return after a given period of time has elapsed.
This isn’t required, so set it to an INFINITE value.
Now you might expect that the event mask originally
specified in the WaitForCommEvent would now be set. After
all, the I/O has completed and the result should be there.
Well, actually, no. There are two problems. First,
you have to go and get the I/O results using (yet another
API), GetOverlappedResult. The reason for this is that
the operating system has lost the original I/O request – it
returned with a status of ERROR_IO_PENDING. The way to
get the result is to issue another I/O request to allow
the operating system to communicate back to the thread
that issued the first request. This may sound peculiar,
but that’s the way Windows XP works. By and large,
the operating system doesn’t come back and say ‘I’m
done’; you have to ask it if it’s done, using
the GetOverLappedResult API.
Secondly, the I/O might not have completed – in
spite of the event being signalled. This can actually
happen, though whether it’s a feature or a bug
is anyone’s guess. In any case, it’s good
coding practice to check – anything can trigger
a signal, not just the WaitForCommEvent completion code
in the operating system kernel.
So first, the GetOverlappedResult API:
r = GetOverlappedResult(h, o, Nothing, 0)
The first parameter is the communication port’s
handle, the second is the Overlapped structure used previously,
the third is used to return the number of bytes transferred
(not relevant here) and the last indicates if we want
to wait – which we definitely do not.
Now we have to check if the signal really did indicate
that the I/O was completed:
If r <> 1 Then
r = GetLastError()
If r = ERROR_IO_INCOMPLETE Then
evtMask = -1
Else
Throw New System.Exception("GetOverlappedResult failed")
End If
End If
Here, the code sets evtMask to –1 to indicate
that the signal was spurious and should be ignored.
And finally, we need to reset the signal:
ResetEvent(o.hEvent)
When you run the program, you need to initialize the
communications ports by clicking the ‘Initialize’ button.
Connect the two communication ports together using a
cross-over cable – either make one or buy one – and
you should then be able to set and clear DTR/RTS on both
ports. If you do this you’ll get simple diagnostic
pop-up messages indicating that the state of the corresponding
DSR/CTS lines has changed. Finally, you can use the ‘Check’ button
to determine the DSR/CTR states of the ports.
After
some hard grind, the program now works correctly – the
WaitForCommEvent API no longer ‘blocks’ the
other APIs.
How to believe six impossible things before breakfast...
While I was developing the code for this article, I
came across a really peculiar bug. The symptoms were
as follows. The ports were initialised as usual and then
a test call was made to check on the state of the DSR
or CTS line. Suddenly, an error occurred: ‘object
reference not set to an instance of an object’.
Not only was the error message incomprehensible, the
call stack was reset to about two levels deep – the
sort of thing that regularly happens in C++ programs.
I tracked the bug down (eventually) by displaying the
Disassembly window and single stepping though the lines
of MSIL (Intermediate Language) code displayed. Now you
don’t have to be an expert in MSIL to follow what’s
going on, especially if there’s something dramatic
about to happen. At the critical point just before the
error occurred, everything looked normal. Just afterwards
the call stack indicated that the program was somewhere
silly.
Even in .NET with all
its type checking and validation, it’s possible
to get errors due to data corruption.
They can be difficult to track down too.
That turned out to be the key. The error happened
at the call to GetCommModemStatus where the .NET framework
seemed to be doing the impossible. The final clue was
the error message ‘Fatal execution engine error’ which
turned up about half of the time. According to the .NET
documentation, these "should never occur"!
So it looked as if the .NET Framework was corrupt – maybe
a Microsoft bug? Now the first thing to do if you suspect
Microsoft of writing buggy code (it does occur) is remember
the Biblical proverb about beams, motes and eyes. It
is much, much more likely that you’ve goofed than
Microsoft – whatever you think of them.
And so it turned out in this case. I had entered the
API definition for GetCommModemStatus incorrectly, inadvertently
aliasing it to GetCommState. This is an entirely different
beast which did indeed corrupt the .NET engine when given
a pointer to an integer rather than a larger Device Control
Block which it expected.
The moral of the story: be very careful about how you
define API calls.
Here’s a ‘fatal execution engine error’ – a
sad end for a program.
According to Microsoft, this should
never occur. Great. Unfortunately it just did.
Any operating system worth its name has to offer some
sort of synchronisation between competing processes or
threads. In Windows, there are two basic types: event and mutex.
An event just has two states: it’s either ‘signalled’ or
the opposite, ‘non-signalled’. Any thread
can set an event to signalled, as can the operating system
when, for example, an I/O completes. You can force a
thread to wait for an event to be signalled and, further,
any number of threads can wait (or ‘block’)
for a single event. All the waiting threads will be released
(or unblocked) when the event is signalled.
The other basic synchronization primitive, the ‘mutex’ works
rather differently. A mutex stands for ‘mutual
exclusion’ and has the property that one and only
one thread can own a mutex at any given time. This is
useful where access to a shared resource (say memory)
must be controlled. In fact, you can use an event to
achieve the same effect. But a mutex has one further
property which differentiates it from an event: threads
waiting for ownership of the mutex are queued in a first-in-first-out
basis. When the mutex is released the first thread in
the mutex queue, regardless of priority, will be executed.
An Event
A Mutex
Events
and mutexes are used to synchronize access to a given
resource. However, mutexes form an orderly queue.
VB .NET doesn’t only let you handle asynchronous
events in a sensible manner, you can also handle the
necessary memory buffers that hold the received data
with reasonable efficiency. This was always a bit tricky
in earlier versions of Visual Basic, requiring several
calls to undocumented functions like StrPtr and
ObjPtr.
VB .NET cleans all this up defining clearly the interface
between ‘managed’ memory – memory that’s
controlled by the Common Language Runtime and ‘unmanaged’ memory
required by the fundamental Windows API.
The key to handling memory that’s used by the
API is to ‘pin’ the memory so that is can’t
be moved by the .NET Garbage Collector. This is essential
for API calls that return values to memory locations.
Fortunately, .NET will automatically pin most variables
for you when you use an API. So in the API call:
r = GetCommModemStatus(h, modem_status)
the variable, h, is
passed by value as so doesn’t
need pinning while modem_status is passed
by reference and so must be pinned. This is done automatically
so you don’t have to do anything,
But where you have an API that uses overlapped I/O
you may have to pin any overlapped structures yourself.
This is because the overlapped structure may be used
by the operating system after the API has returned but
before the actual I/O has completed. Allowing the garbage
collector to move the object under these circumstances
will be fatal – it’s better to be safe than
sorry.
Before garbage collection (on the left) and after garbage
collection (on the right).
Memory must not be garbage-collected during a call
to an API.
Allowing garbage-collection can lead to disaster.
Now that was seriously hard work. And all because WaitForCommEvent
doesn’t do quite what you might expect, though
it does work as documented. But next month, we should
be able to move to reading the odd character from the
port – and some can be very odd, as we’ll
see – and writing data as well.
September 2005
|