Aeon Emulator Blog

September 28, 2009

Performance Considerations – Going Unsafe

Filed under: Aeon, Performance — Greg Divis @ 8:44 pm

I like the unsafe keyword in C#. Don’t get me wrong, I don’t like to use it, but I like that it’s called “unsafe.” Not only is it “unsafe” in that it might cause an access violation outright, but using it throws any guarantee of type-safety in the Common Language Runtime right out the window. In an unsafe block of code, there’s nothing stopping you from munging data in your process’s address space just like you could in good old C++. Even worse, using unsafe code means you’re probably messing around with native pointers, the size and behavior of which may vary depending on what platform the program is running on. At least on x86 and x64 platforms the only significant difference in pointers (at least as they relate here) is their size – but on other platforms there can be even bigger issues with endianness and pointer-alignment.

Unsafe code is initially disabled for new C# projects in Visual Studio (with good reason), but there are a couple cases where it’s incredibly useful: interoperability with native code, and spot optimization. Aeon uses unsafe code for both cases, though I tried to avoid using it as much as possible.

Conventional Memory

If you’re unfamiliar with the “real-mode” architecture of an x86 processor, count yourself lucky. I’ll be going into lots more detail about this later on, but for now you just need to know that it has 1MB of addressable RAM. Memory on a PC may be byte-addressable, but that doesn’t necessarily mean you always want to treat it as an array of bytes. Some instructions operate on bytes, others on 16-bit words, and others on 32-bit doublewords. Modeling RAM as a .NET Byte array would be possible, but clumsy and slower than I’d like for anything but byte-value access. In this case, I have a block of memory I just want to set aside to store arbitrary values inside of it, because that’s exactly what it’s modeling.

Fortunately, unsafe allows me to access memory using pointer arithmetic and a specified word size, just like in C++. Consider the following code snippet from Aeon’s PhysicalMemory.GetUInt32 method:

unsafe
{
    return *(uint*)(rawView + addr);
}
 

In this snippet, rawView is a byte pointer to the allocated block of emulated RAM, and addr is the byte offset of the address to read. Emulated video RAM works in a similar fashion.

Processor

So far everything is pretty basic, but there is one other area I used unsafe code aside from emulated memory – the CPU. The x86 processor architecture has a fairly small number of registers, but to maintain backward compatibility with 16 and even 8-bit instructions, some of them are a little strange. Take, for instance, the AL, AH, AX, and EAX registers – each of these registers refers to a different part of the same 32-bit value:

RegisterDiag

So why does this matter? Since the x86 is a little-endian architecture, any arbitrary 32-bit value in memory may be treated like this collection of registers. In the constructor for Aeon’s Processor class, I just allocate a small block of memory on a native heap, intialize a bunch of pointers to the different registers within that block, and then I can use properties to wrap this:

/// <summary>
/// Gets or sets the value of the EAX register.
/// </summary>
public int EAX
{
    get
    {
        unsafe
        {
            return *(int*)PAX;
        }
    }
    set
    {
        unsafe
        {
            *(int*)PAX = value;
        }
   }
}

This has the benefit of automatically keeping the values of AL, AH, AX, and EAX in sync every time one of them changes without the need for any explicit masking and shifting, but it also allows the registers to be accessed efficiently using an index. When decoding instructions, operands which refer to registers use three bits as a register identifier. By also initializing an array of pointers with these identifiers as indices, I can get a pointer to any decoded register very quickly:

return wordRegisterPointers[rmCode];

Creating my own backing store for what are essentially members of a class is not something I make a habit out of, and I am not advocating this as “faster” or “better” than using normal C# fields or auto-properties to store data. In fact, in almost all cases doing what I’ve done here is a terrible abuse of unsafe code in C#. However, given the performance advantages in this case, I believe it’s justified, and it’s really nice to have this kind of power while still having all the advantages of C#’s abstraction.

Target Architecture

The whole reason I’ve been able to get away with most of these unsafe “enhancements” is that I’ve limited the processors I want Aeon to run on to x86 and x64 architectures. What I’m really doing is taking advantage of the fact that the emulated processor and the host processor are both byte-addressable and little-endian. I could certainly work around this and, indeed, all of this used to be implemented as some bitwise logic in the register get/set properties. However, placing that much logic inside properties tended to prevent their inlining by the JIT compiler (I’ll get more into JIT issues next time), so the instruction decoding phase was significantly slower.

September 27, 2009

Commander Keen 1

Filed under: Aeon, Fun & Games — Greg Divis @ 7:27 pm

One of the nice things about a project like Aeon is that it let me use old games I’d played as a kid as milestones. The first graphical game I got working in Aeon was Commander Keen: Invasion of the Vorticons.

Keen 1 Title

The graphics are pretty primitive, and the only sound effects you get are from the PC speaker, but it boasts smooth scrolling large levels and a nice variety of hazards and enemies. For a shareware game released in 1990, it’s really quite impressive, and it was followed by five episodes.

Keen 1

This was one of id Software’s first games, and you can still download the shareware episode from 3D realms.

I’ll be back to that performance series next time I post.

September 24, 2009

Performance Considerations – Garbage Collection

Filed under: Aeon, Performance — Greg Divis @ 9:06 pm

I’m going to start my series on performance with something that other people have covered many times already, and it’s something that almost always comes up when you’re talking about performance in C# – garbage collection. From my experience, lots of developers don’t understand it, and some think they understand it but don’t. I don’t claim to be an expert on the subject, but I at least have a high-level understanding of how the .NET garbage collector works. If you’re developing something that may be CPU-bound in .NET, do yourself a favor and read a little bit about the garbage collector if you haven’t already.

Ok, now we should be on the same page. To sum up the parts relevant to this post: we have a generational GC with 3 generations and a large object heap. Allocations less than about 85k are taken from Gen 0 and are very fast. The GC suspends all of your application threads when it runs and compacts the generational heap (or part of it) to make it contiguous, and it’s adaptive so it will run more often if you allocate memory on the heap frequently.

This is a great design, because most applications only use the CPU for short periods, then sit idle. The GC works behind the scenes and you don’t even notice it. Aeon, however, doesn’t sit idle unless there’s no program loaded. During emulation, it’s constantly decoding instructions and emulating them, so I had to be careful not to create any garbage along this path.

The Garbage Collector and Aeon

One of the nice things about .NET is all of the instrumentation you get for free. I fired up PerfMon and set it to display the total number of Gen 0-2 collections as Aeon is running Wolfenstein 3D:
Aeon GC count
 

This is about as close to never running as it gets. There’s a Gen 0 collection every 10 seconds, and that’s about it. A look at heap size confirms this:
Aeon memory usage

Too Much Garbage

Next we’ll have a look at what would happen if I had created something on the managed heap for each emulated instruction. I added this line to the critical path so it runs on every instruction:

var rubbish = new object();

Now, System.Object is about as small as it gets for an individual allocation on the managed heap, but this is called millions of times per second. What effect does this have on performance? Let’s have a look at those Gen 0-2 collections again:
Too much rubbish

Hey, that doesn’t look so bad, right? I mean, it’s going up constantly sure, but not too quickly. Actually, take a look at the scale – it’s dividing by 1000 before plotting the graph. If it’s not scaled, the red line would go straight up. Now let’s look at the heap sizes again:
Lots of collections

Clearly there’s a lot of activity, and this isn’t the kind of thing you’d like to see happen in a tight loop. There’s one more bit of data to take a look at before we move on, and that’s the percentage of time spent in garbage collections:
GC Time

It’s a horizontal line, which we should expect since we’re creating garbage at a constant rate, but it seems a little too low to worry about, right? It’s only averaging 3.6% (without the inserted line of code, it stays at 0). Well… maybe, and maybe not. Here’s the actual instruction throughput in the normal and extra garbage cases of Wolfenstein 3D:

  Normal Garbage Mode % difference
Aeon MIPS 22 15 32%

So the GC running 4% of the time actually caused a decrease in throughput of 32% – pretty significant, but not necessarily a show-stopper. After all, before a bunch of optimizations I made a while ago, Aeon ran this game at around 15 MIPS and it played at full speed. But wait – when I run it in “garbage mode,” the game is noticeably sluggish and the music even plays back in slow-motion. What’s going on?

Nasty Side-effects

We really have two unsolved issues here. First of all, why does a GC time of 3.6% cause such a large decrease in instruction throughput? Look again at the graphs above – the number of Gen 0 collections is rapidly increasing at a constant rate, but the % time in GC is pretty low. So… we’re collecting a lot of garbage but not spending much time doing it. I guess time spent in the garbage collector isn’t our problem. Actually, the fact that it’s only 3.6% is pretty amazing given all the garbage being thrown at it. (One of the reasons it’s so quick in this case is because all of this garbage is short-lived and never makes it out of Gen 0).

This is where I have to enter guessing territory. I know that all of the managed .NET threads are suspended while the GC runs. Suspending and resuming a thread has some overhead which could add up if it’s happening constantly. Another consequence of the GC running and accessing different regions of memory than our program is concerned with could be polluting the CPU cache. When the collector runs infrequently, this isn’t a big deal, but we’re hammering on it here and could be causing way more cache misses than before. Nothing kills CPU performance like cache misses.

As for the game running sluggish, I suspect this is also due to the fact that all of the managed threads are getting constantly suspended – including input handling, WPF, FM music synthesis and everything else. I’m open to corrections on any of this.

Lessons Learned

I may be way off base in my wild speculations for why the performance penalty for excessive collections is so severe, but really that doesn’t even matter. This is an extreme case, and one look at the graph showing GC collections shooting up like that should be enough to indicate that there’s some excessive garbage produced by this code.

The cause of the garbage problem was pretty obvious in this case, not least because I deliberately added it. In real-world code with this problem it’s usually not that easy. Maybe the program is calling some library code that generates garbage; maybe there’s a line of code left in there from debugging that was never removed and all of the sudden your program takes twice as long to run. I’ve even seen cases where the author of a program generated a massive text file via string concatenation, not realizing that garbage was produced every time that + operator was used – just changing it to use a StringBuilder turned a 5 minute task into a 1 second task (it was a really big file).

Bottom line: If you have some managed code that’s running slowly and you’re not able to profile it, check its garbage collection rate with PerfMon, especially if you’re calling into some external code in a tight loop. Maybe whoever wrote that external code likes string concatenation.

Performance Considerations – Overview

Filed under: Aeon, Performance — Greg Divis @ 10:01 am

This is the beginning of a series of posts on some of the challenges I faced in getting Aeon to run at an acceptable speed. Right now, I need to set up some background information so the following posts will make more sense. If you find any of this patronizing, unnecessary, or just too rambling, feel free to skip it.

The goal of Aeon is to simulate the environment of a 386 PC running DOS. By far, the single most taxing component is the CPU emulation. In Aeon, this is given its own thread, which just like a real CPU, decodes and executes instructions sequentially (yes, I know sequentially isn’t technically accurate for modern processors). Basically, if this part of Aeon is too slow, it doesn’t matter how fast anything else runs. All of the posts in this series will just consider performance as it relates to this critical path. There’s plenty of other areas to talk about though, and I’ll get to them eventually :).

So lets get some metrics defined now. I’m largely going to be measuring instruction throughput in MIPS. I’m aware that this isn’t a terribly useful real-world metric, but I find it provides a good indication of the impact of optimizations in my code. For instance, let’s say I run a particular game and it averages about 18 MIPS in Aeon. I then optimize part of the ADD instruction and it averages 20 MIPS in the same part of the same game. It’s a significant, relative increase in throughput. It’s a very simple way for me to measure relative speed without resorting to more sophisticated profiling.

I mentioned profiling above. I wrote Aeon in C#, and yes, Visual Studio includes some very nice managed code profiling tools which were very useful to me in the beginning for isolating the biggest performance bottlenecks. However, we’re dealing with tens of millions of instructions per second, and each one involves multiple method calls. Were I to instrument this with a managed code profiler now, it would fill up my hard drive before I got any useful information, not to mention destroying the somewhat delicate timing of my hardware timer emulation. I’ve been wanting to add a more specialized profiler to Aeon that I can use to measure how often certain instructions are emulated and how long they take, but it’s been running “good enough” for now, so this hasn’t been a priority.

Finally, I’d like to briefly mention the .NET JIT compiler. Basically, .NET methods are compiled into machine code the first time they are called (a bit of a simplification), so the compiler isn’t able to optimize the code as much as a static compiler can. The upshot to this is that Aeon runs as-is on x86 and x64, and for various reasons that I won’t get into right now, the x64-compiled version is actually a bit faster. Another benefit of this approach is that as future versions of the .NET framework are released, Aeon will take advantage of improvements in the JIT compiler. As an example, Wolfenstein 3D runs on my Phenom II processor at about 19 MIPS with the .NET 3.5 SP1 version of Aeon. When I tested the beta release of .NET 4, the same game managed around 26 MIPS.

Hopefully now you have an idea of where I’m coming from when I’m talking about CPU performance. Next time I’ll dig into some of the more interesting details.

September 23, 2009

Do we really need another DOS emulator?

Filed under: Aeon — Greg Divis @ 2:47 pm

I touched on my reasons for starting this project on my project site, but since this is a blog and all this seems like a good place to elaborate. Of course at the time I started this last year, there were already plenty of solutions available for someone looking to run a DOS program/game on Windows. Windows itself will run most simple DOS applications just fine (if it’s one of the 32-bit editions). In cases where this fails, DOSBox has been around for quite a while and can run just about anything.

So clearly I’m not trying to solve a problem with this that isn’t already solved. Aeon started out as an experiment to see if I could get acceptable performance out of C# for emulating an entire PC from the early 90’s. It was also an outlet for me to work on a challenging project completely under my control.

Great, so those are perfectly valid reasons for me to write another emulator, but why put it on the Internet and blog about it? Well, I learned a lot of stuff since I started, and I thought other developers might find it interesting – so that’s where the blog comes in. As for posting it on the net, that’s much easier to answer – to show off, of course!

That’s about it for introductory posts. I’ll be talking more next time about Aeon’s architecture and some of the crazy, uh, *features* of the x86 architecture that caused me a lot of grief.

Cheers!

September 22, 2009

Aeon version 0.45

Filed under: Aeon, New Version — Greg Divis @ 11:41 pm

I figured it was about time for another posted build. Grab it here.

Most of the work on this one went into adding support for 32-bit addressing and improving the Sound Blaster emulation. Here’s the complete change list:

  • Rewrote DMA emulation to be more accurate
  • Improved Sound Blaster 2 DSP emulation
  • Implemented most 32-bit addressing modes
  • Text-mode cursor is now displayed

As always, this also includes some small compatibility fixes.

Hello World

Filed under: Uncategorized — Greg Divis @ 11:34 pm

This is where I’m going to post news about Aeon and any other programming-related stuff that interests me. More to come…

Blog at WordPress.com.