Aeon Emulator Blog

October 6, 2009

Performance Considerations – Low-level Optimization

Filed under: Aeon, Performance — Greg Divis @ 4:09 pm

[Note: All of the disassembly in this post is for the 64-bit x64 architecture. It should be similar enough to 32-bit x86 to recognize if you’re already familiar with that. Yes, I know that the disassembly can be significantly different between these two compilers, but that’s not addressed here.]

There are many different ways to optimize the performance of code. The ideal solution is to refactor or redesign the offending component, using a more efficient algorithm to perform the same task. Usually this is enough, but once in a while you bump up against a limitation of the compiler or language that you’re using and things get more interesting.

Checking the JIT

The just-in-time compiler in the .NET Common Language Runtime does a good job, but it doesn’t have a lot of time to perform complex optimizations, so sometimes I had to attach a debugger to an optimized build of Aeon just to look at the native machine code it was producing. Fortunately, if you know what settings to change in Visual Studio, it’s pretty easy to do this right from the C# IDE.

Let’s have a look at the disassembly of part of an earlier version of my stosb instruction implementation:


This small code fragment increments the DI register if the processor’s direction flag is not set, otherwise it decrements the register. Just like in hardware, the processor flags are stored in a special register in Aeon as a set of bitflags; normally these flags are accessed in code via the Processor.Flags field, but I also added some convenience properties for getting/setting specific flags. For reference, here is the implementation of the Processor.DirectionFlag property getter:

return (this.Flags & EFlags.Direction) != 0;

We can clearly see that the property get method is getting inlined by the JIT, as there are no calls or jumps to external code, and we can see the direction flag (400h) being AND-ed with the flags register on lines 0x3d and 0x43. Also, we can see that lines 0x57-0x61 are the instructions generated for vm.Processor.DI++; and lines 0x66-0x70 are generated for vm.Processor.DI—;

The code for incrementing or decrementing DI is pretty straightforward: load the DI register into eax, add/subtract 1, and store the result back into DI (remember that DI is actually a pointer to the emulated register in memory). That’s great, but why does the actual direction flag test take so many instructions?

An Easy Improvement

It seems that the JIT is being tad too literal. It has nicely inlined the property, but it also generated code to evaluate the expression in the property getter, expand the result to a boolean value, and then test that result for a nonzero value. I can’t really complain here because this is exactly what I asked it to do, and in most cases it wouldn’t matter, as it still inlined the property so that should be efficient enough.

However, this bit of code is in the stosb instruction implementation, an instruction likely to be used repeatedly to write data to memory – it needs to be as efficient as possible. Fortunately, there’s a very simple way to improve it – just check the direction flag directly instead of using the DirectionFlag property to do it:


Much better. Now the bit flag is simply tested and the result used directly to determine whether to jump to the else case or execute the increment.

Knowing When to Stop

By replacing the property evaluation with a bit flag test, I’ve traded a small amount of readability for a slightly reduced number of instructions. When I first made this change, I was unfortunately tempted to try to improve on it further, even though there really aren’t that many instructions left. For instance, I tried factoring out the pointer to the emulated DI register to eliminate an instruction:


Ugh. Well, I’ve succeeded in eliminating a mov instruction from both clauses of the if, and replacing them with the mov on line 0x36. (Interestingly, the compiler also chose to use a sub instruction to decrement DI this time instead of add.) However, the same number of instructions is executed as before, since only one or the other clause is run at a time.

There’s a much more significant accomplishment here than eliminating a single instruction – I’ve managed to overcomplicate one of the simplest of all programming tasks: incrementing a number. Giving up readability for some measure of performance can be an acceptable trade-off, but trading readability for no measurable improvement is no trade at all. All you’re getting is some obfuscated C# code.

Needless to say, I left it at my original improvement and called it good. Even that change is questionable – I know it’s better from the disassembly, but it’s difficult to measure how much better it is. The only reason I left it that way is because it’s still quite readable.

The Big Picture

It’s worth noting that this sort of fine-tuning was not the first thing I did to speed up emulation; I’ve actually rewritten the code responsible for decoding opcodes and their operands a couple times before resorting to this. Aeon’s at a point now where there aren’t too many other structural changes I can make to greatly improve CPU emulation performance short of implementing a proper dynamic recompiler. (I do plan to do this eventually…)

Since performance on my test systems has been good enough for me, I’ve since decided to shift to improving compatibility and getting all of the plumbing installed for emulating x86 protected mode. I’m not completely happy with my implementation so far, and there is certainly room for improvement regarding its performance, but this project is a hobby of mine, so I have to be my own project manager and improving performance any more just isn’t on my short-term schedule.

This is the last generic “performance”-related post I’m going to write for a while. The goal here was just to illustrate some of the more basic strategies I used to get reasonable performance without too much development effort. Expect some more diverse and hopefully interesting posts from now on. :)


Leave a Comment »

No comments yet.

RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Create a free website or blog at

%d bloggers like this: