Aeon Emulator Blog

November 10, 2009

Decoding and Decoupling

Filed under: Aeon, Performance — Greg Divis @ 2:16 am

In optimizing code for performance, it’s crucial to know the critical path. Once you know which part of your code is most impacting performance, you know where to spend most of your effort optimizing. In Aeon, this means focusing on the instruction decoding, since every instruction must be decoded before it can be emulated. The x86 instruction set is a CISC architecture, so this means there’s a lot of instructions and a lot of data is packed into a small number of bits. For example, most of the x86 instructions accept multiple operands, with the operands consisting of any combination of CPU register, memory address, or immediate value. Add to this the fact that memory addressing has implicit register-based offsets encoded in the operand, with two completely different meanings depending on whether the instruction uses 16 or 32-bit addressing, and you have a lot to decode on potentially every instruction.

The more I learned about the x86 machine code, the more I wanted to decouple it from the actual instruction implementation. In other words, I wanted to be able to write code that would emulate the function of an instruction without worrying about the form its operands are in, and I wanted this to be fast. Eventually, I did succeed in this…mostly. Here’s what my 16-bit MOV instruction emulator looks like:

[Opcode("89/r rmw,rw|8B/r rw,rmw|8C/r rmw,sreg|8E/r sreg,rmw|A1 ax,moffsw|A3 moffsw,ax|B8+ rw,iw|C7/0 rmw,iw", AddressSize = 16 | 32)]
public static void MoveWord(VirtualMachine vm, out ushort dest, ushort src)
{
    dest = src;
}

Before I get into that giant Opcode attribute on the method, let’s dissect the method itself. The first parameter is present on every instruction, and it contains a reference to the current emulated system – this is mainly used by instructions that need to cause certain side-effects, such as updating processor flags or accessing memory in a specific way; for the MOV instruction, it isn’t used, but is still required due to the way this method gets invoked. Following the VirtualMachine are the operands for the instruction; in this case, the first one is marked with the out keyword, meaning that the operand is only written-to by the method. The second operand is only read from, so it is passed by value. Both operands are 16-bit values, implying that this emulates the 16-bit operand-size version of MOV. Finally, the body of the method simply assigns the value of the second operand to the first operand. The dest and src parameters could be registers, values in emulated RAM, or immediates, but to MoveWord they are all the same.

One of the things I like about C# is that you can add custom metadata to things like methods and then inspect these at runtime using reflection. In this case, the custom OpcodeAttribute class specifies all of the opcodes and operand formats for each instruction implementation. The | symbol inside the long opcode string above basically means “or,” so all of the substrings between them specify the different MOV opcodes and their operands. Note that all of them take exactly two operands and all of them have a writable first operand (immediate values only ever show up as the second operand). I’ll break down the first form of MOV:

89 Specifies a 1-byte hexadecimal opcode
/r The byte following the opcode specifies at least one register
rmw The first operand is either a register or a value in memory of the current processor word-size (16-bit, in this case)
rw The second operand is a register of the current processor word-size (16-bit)

This information, along with the out keyword in the method signature, allows Aeon to generate code for decoding these operands, fetching initial values, and then writing any out or ref values back. Basically, when emulation begins for the first time, all of the opcode methods are enumerated and the OpcodeAttributes on them are parsed into more manageable objects. From this information, a small IL method is generated to decode, call the implementation method, and then write back values if necessary. Because the resulting generated method gets compiled by the JIT like everything else, simple instructions like MOV end up getting automatically inlined, despite the runtime binding.

Besides the speed, another benefit to this is that it’s really easy for me to add new instructions (provided they don’t encode some type of operand I haven’t dealt with yet). All I need to do is add a static method with the appropriate attribute and the runtime binding handles the rest. The downside is that writing this IL generator was a big pain, and it was very hard to test and nearly impossible to debug. Hmm… I guess it was worth it…

Advertisements

October 6, 2009

Performance Considerations – Low-level Optimization

Filed under: Aeon, Performance — Greg Divis @ 4:09 pm

[Note: All of the disassembly in this post is for the 64-bit x64 architecture. It should be similar enough to 32-bit x86 to recognize if you’re already familiar with that. Yes, I know that the disassembly can be significantly different between these two compilers, but that’s not addressed here.]

There are many different ways to optimize the performance of code. The ideal solution is to refactor or redesign the offending component, using a more efficient algorithm to perform the same task. Usually this is enough, but once in a while you bump up against a limitation of the compiler or language that you’re using and things get more interesting.

Checking the JIT

The just-in-time compiler in the .NET Common Language Runtime does a good job, but it doesn’t have a lot of time to perform complex optimizations, so sometimes I had to attach a debugger to an optimized build of Aeon just to look at the native machine code it was producing. Fortunately, if you know what settings to change in Visual Studio, it’s pretty easy to do this right from the C# IDE.

Let’s have a look at the disassembly of part of an earlier version of my stosb instruction implementation:

Stos1

This small code fragment increments the DI register if the processor’s direction flag is not set, otherwise it decrements the register. Just like in hardware, the processor flags are stored in a special register in Aeon as a set of bitflags; normally these flags are accessed in code via the Processor.Flags field, but I also added some convenience properties for getting/setting specific flags. For reference, here is the implementation of the Processor.DirectionFlag property getter:

return (this.Flags & EFlags.Direction) != 0;

We can clearly see that the property get method is getting inlined by the JIT, as there are no calls or jumps to external code, and we can see the direction flag (400h) being AND-ed with the flags register on lines 0x3d and 0x43. Also, we can see that lines 0x57-0x61 are the instructions generated for vm.Processor.DI++; and lines 0x66-0x70 are generated for vm.Processor.DI—;

The code for incrementing or decrementing DI is pretty straightforward: load the DI register into eax, add/subtract 1, and store the result back into DI (remember that DI is actually a pointer to the emulated register in memory). That’s great, but why does the actual direction flag test take so many instructions?

An Easy Improvement

It seems that the JIT is being tad too literal. It has nicely inlined the property, but it also generated code to evaluate the expression in the property getter, expand the result to a boolean value, and then test that result for a nonzero value. I can’t really complain here because this is exactly what I asked it to do, and in most cases it wouldn’t matter, as it still inlined the property so that should be efficient enough.

However, this bit of code is in the stosb instruction implementation, an instruction likely to be used repeatedly to write data to memory – it needs to be as efficient as possible. Fortunately, there’s a very simple way to improve it – just check the direction flag directly instead of using the DirectionFlag property to do it:

Stos2

Much better. Now the bit flag is simply tested and the result used directly to determine whether to jump to the else case or execute the increment.

Knowing When to Stop

By replacing the property evaluation with a bit flag test, I’ve traded a small amount of readability for a slightly reduced number of instructions. When I first made this change, I was unfortunately tempted to try to improve on it further, even though there really aren’t that many instructions left. For instance, I tried factoring out the pointer to the emulated DI register to eliminate an instruction:

Stos3

Ugh. Well, I’ve succeeded in eliminating a mov instruction from both clauses of the if, and replacing them with the mov on line 0x36. (Interestingly, the compiler also chose to use a sub instruction to decrement DI this time instead of add.) However, the same number of instructions is executed as before, since only one or the other clause is run at a time.

There’s a much more significant accomplishment here than eliminating a single instruction – I’ve managed to overcomplicate one of the simplest of all programming tasks: incrementing a number. Giving up readability for some measure of performance can be an acceptable trade-off, but trading readability for no measurable improvement is no trade at all. All you’re getting is some obfuscated C# code.

Needless to say, I left it at my original improvement and called it good. Even that change is questionable – I know it’s better from the disassembly, but it’s difficult to measure how much better it is. The only reason I left it that way is because it’s still quite readable.

The Big Picture

It’s worth noting that this sort of fine-tuning was not the first thing I did to speed up emulation; I’ve actually rewritten the code responsible for decoding opcodes and their operands a couple times before resorting to this. Aeon’s at a point now where there aren’t too many other structural changes I can make to greatly improve CPU emulation performance short of implementing a proper dynamic recompiler. (I do plan to do this eventually…)

Since performance on my test systems has been good enough for me, I’ve since decided to shift to improving compatibility and getting all of the plumbing installed for emulating x86 protected mode. I’m not completely happy with my implementation so far, and there is certainly room for improvement regarding its performance, but this project is a hobby of mine, so I have to be my own project manager and improving performance any more just isn’t on my short-term schedule.

This is the last generic “performance”-related post I’m going to write for a while. The goal here was just to illustrate some of the more basic strategies I used to get reasonable performance without too much development effort. Expect some more diverse and hopefully interesting posts from now on. :)

September 28, 2009

Performance Considerations – Going Unsafe

Filed under: Aeon, Performance — Greg Divis @ 8:44 pm

I like the unsafe keyword in C#. Don’t get me wrong, I don’t like to use it, but I like that it’s called “unsafe.” Not only is it “unsafe” in that it might cause an access violation outright, but using it throws any guarantee of type-safety in the Common Language Runtime right out the window. In an unsafe block of code, there’s nothing stopping you from munging data in your process’s address space just like you could in good old C++. Even worse, using unsafe code means you’re probably messing around with native pointers, the size and behavior of which may vary depending on what platform the program is running on. At least on x86 and x64 platforms the only significant difference in pointers (at least as they relate here) is their size – but on other platforms there can be even bigger issues with endianness and pointer-alignment.

Unsafe code is initially disabled for new C# projects in Visual Studio (with good reason), but there are a couple cases where it’s incredibly useful: interoperability with native code, and spot optimization. Aeon uses unsafe code for both cases, though I tried to avoid using it as much as possible.

Conventional Memory

If you’re unfamiliar with the “real-mode” architecture of an x86 processor, count yourself lucky. I’ll be going into lots more detail about this later on, but for now you just need to know that it has 1MB of addressable RAM. Memory on a PC may be byte-addressable, but that doesn’t necessarily mean you always want to treat it as an array of bytes. Some instructions operate on bytes, others on 16-bit words, and others on 32-bit doublewords. Modeling RAM as a .NET Byte array would be possible, but clumsy and slower than I’d like for anything but byte-value access. In this case, I have a block of memory I just want to set aside to store arbitrary values inside of it, because that’s exactly what it’s modeling.

Fortunately, unsafe allows me to access memory using pointer arithmetic and a specified word size, just like in C++. Consider the following code snippet from Aeon’s PhysicalMemory.GetUInt32 method:

unsafe
{
    return *(uint*)(rawView + addr);
}
 

In this snippet, rawView is a byte pointer to the allocated block of emulated RAM, and addr is the byte offset of the address to read. Emulated video RAM works in a similar fashion.

Processor

So far everything is pretty basic, but there is one other area I used unsafe code aside from emulated memory – the CPU. The x86 processor architecture has a fairly small number of registers, but to maintain backward compatibility with 16 and even 8-bit instructions, some of them are a little strange. Take, for instance, the AL, AH, AX, and EAX registers – each of these registers refers to a different part of the same 32-bit value:

RegisterDiag

So why does this matter? Since the x86 is a little-endian architecture, any arbitrary 32-bit value in memory may be treated like this collection of registers. In the constructor for Aeon’s Processor class, I just allocate a small block of memory on a native heap, intialize a bunch of pointers to the different registers within that block, and then I can use properties to wrap this:

/// <summary>
/// Gets or sets the value of the EAX register.
/// </summary>
public int EAX
{
    get
    {
        unsafe
        {
            return *(int*)PAX;
        }
    }
    set
    {
        unsafe
        {
            *(int*)PAX = value;
        }
   }
}

This has the benefit of automatically keeping the values of AL, AH, AX, and EAX in sync every time one of them changes without the need for any explicit masking and shifting, but it also allows the registers to be accessed efficiently using an index. When decoding instructions, operands which refer to registers use three bits as a register identifier. By also initializing an array of pointers with these identifiers as indices, I can get a pointer to any decoded register very quickly:

return wordRegisterPointers[rmCode];

Creating my own backing store for what are essentially members of a class is not something I make a habit out of, and I am not advocating this as “faster” or “better” than using normal C# fields or auto-properties to store data. In fact, in almost all cases doing what I’ve done here is a terrible abuse of unsafe code in C#. However, given the performance advantages in this case, I believe it’s justified, and it’s really nice to have this kind of power while still having all the advantages of C#’s abstraction.

Target Architecture

The whole reason I’ve been able to get away with most of these unsafe “enhancements” is that I’ve limited the processors I want Aeon to run on to x86 and x64 architectures. What I’m really doing is taking advantage of the fact that the emulated processor and the host processor are both byte-addressable and little-endian. I could certainly work around this and, indeed, all of this used to be implemented as some bitwise logic in the register get/set properties. However, placing that much logic inside properties tended to prevent their inlining by the JIT compiler (I’ll get more into JIT issues next time), so the instruction decoding phase was significantly slower.

September 24, 2009

Performance Considerations – Garbage Collection

Filed under: Aeon, Performance — Greg Divis @ 9:06 pm

I’m going to start my series on performance with something that other people have covered many times already, and it’s something that almost always comes up when you’re talking about performance in C# – garbage collection. From my experience, lots of developers don’t understand it, and some think they understand it but don’t. I don’t claim to be an expert on the subject, but I at least have a high-level understanding of how the .NET garbage collector works. If you’re developing something that may be CPU-bound in .NET, do yourself a favor and read a little bit about the garbage collector if you haven’t already.

Ok, now we should be on the same page. To sum up the parts relevant to this post: we have a generational GC with 3 generations and a large object heap. Allocations less than about 85k are taken from Gen 0 and are very fast. The GC suspends all of your application threads when it runs and compacts the generational heap (or part of it) to make it contiguous, and it’s adaptive so it will run more often if you allocate memory on the heap frequently.

This is a great design, because most applications only use the CPU for short periods, then sit idle. The GC works behind the scenes and you don’t even notice it. Aeon, however, doesn’t sit idle unless there’s no program loaded. During emulation, it’s constantly decoding instructions and emulating them, so I had to be careful not to create any garbage along this path.

The Garbage Collector and Aeon

One of the nice things about .NET is all of the instrumentation you get for free. I fired up PerfMon and set it to display the total number of Gen 0-2 collections as Aeon is running Wolfenstein 3D:
Aeon GC count
 

This is about as close to never running as it gets. There’s a Gen 0 collection every 10 seconds, and that’s about it. A look at heap size confirms this:
Aeon memory usage

Too Much Garbage

Next we’ll have a look at what would happen if I had created something on the managed heap for each emulated instruction. I added this line to the critical path so it runs on every instruction:

var rubbish = new object();

Now, System.Object is about as small as it gets for an individual allocation on the managed heap, but this is called millions of times per second. What effect does this have on performance? Let’s have a look at those Gen 0-2 collections again:
Too much rubbish

Hey, that doesn’t look so bad, right? I mean, it’s going up constantly sure, but not too quickly. Actually, take a look at the scale – it’s dividing by 1000 before plotting the graph. If it’s not scaled, the red line would go straight up. Now let’s look at the heap sizes again:
Lots of collections

Clearly there’s a lot of activity, and this isn’t the kind of thing you’d like to see happen in a tight loop. There’s one more bit of data to take a look at before we move on, and that’s the percentage of time spent in garbage collections:
GC Time

It’s a horizontal line, which we should expect since we’re creating garbage at a constant rate, but it seems a little too low to worry about, right? It’s only averaging 3.6% (without the inserted line of code, it stays at 0). Well… maybe, and maybe not. Here’s the actual instruction throughput in the normal and extra garbage cases of Wolfenstein 3D:

  Normal Garbage Mode % difference
Aeon MIPS 22 15 32%

So the GC running 4% of the time actually caused a decrease in throughput of 32% – pretty significant, but not necessarily a show-stopper. After all, before a bunch of optimizations I made a while ago, Aeon ran this game at around 15 MIPS and it played at full speed. But wait – when I run it in “garbage mode,” the game is noticeably sluggish and the music even plays back in slow-motion. What’s going on?

Nasty Side-effects

We really have two unsolved issues here. First of all, why does a GC time of 3.6% cause such a large decrease in instruction throughput? Look again at the graphs above – the number of Gen 0 collections is rapidly increasing at a constant rate, but the % time in GC is pretty low. So… we’re collecting a lot of garbage but not spending much time doing it. I guess time spent in the garbage collector isn’t our problem. Actually, the fact that it’s only 3.6% is pretty amazing given all the garbage being thrown at it. (One of the reasons it’s so quick in this case is because all of this garbage is short-lived and never makes it out of Gen 0).

This is where I have to enter guessing territory. I know that all of the managed .NET threads are suspended while the GC runs. Suspending and resuming a thread has some overhead which could add up if it’s happening constantly. Another consequence of the GC running and accessing different regions of memory than our program is concerned with could be polluting the CPU cache. When the collector runs infrequently, this isn’t a big deal, but we’re hammering on it here and could be causing way more cache misses than before. Nothing kills CPU performance like cache misses.

As for the game running sluggish, I suspect this is also due to the fact that all of the managed threads are getting constantly suspended – including input handling, WPF, FM music synthesis and everything else. I’m open to corrections on any of this.

Lessons Learned

I may be way off base in my wild speculations for why the performance penalty for excessive collections is so severe, but really that doesn’t even matter. This is an extreme case, and one look at the graph showing GC collections shooting up like that should be enough to indicate that there’s some excessive garbage produced by this code.

The cause of the garbage problem was pretty obvious in this case, not least because I deliberately added it. In real-world code with this problem it’s usually not that easy. Maybe the program is calling some library code that generates garbage; maybe there’s a line of code left in there from debugging that was never removed and all of the sudden your program takes twice as long to run. I’ve even seen cases where the author of a program generated a massive text file via string concatenation, not realizing that garbage was produced every time that + operator was used – just changing it to use a StringBuilder turned a 5 minute task into a 1 second task (it was a really big file).

Bottom line: If you have some managed code that’s running slowly and you’re not able to profile it, check its garbage collection rate with PerfMon, especially if you’re calling into some external code in a tight loop. Maybe whoever wrote that external code likes string concatenation.

Performance Considerations – Overview

Filed under: Aeon, Performance — Greg Divis @ 10:01 am

This is the beginning of a series of posts on some of the challenges I faced in getting Aeon to run at an acceptable speed. Right now, I need to set up some background information so the following posts will make more sense. If you find any of this patronizing, unnecessary, or just too rambling, feel free to skip it.

The goal of Aeon is to simulate the environment of a 386 PC running DOS. By far, the single most taxing component is the CPU emulation. In Aeon, this is given its own thread, which just like a real CPU, decodes and executes instructions sequentially (yes, I know sequentially isn’t technically accurate for modern processors). Basically, if this part of Aeon is too slow, it doesn’t matter how fast anything else runs. All of the posts in this series will just consider performance as it relates to this critical path. There’s plenty of other areas to talk about though, and I’ll get to them eventually :).

So lets get some metrics defined now. I’m largely going to be measuring instruction throughput in MIPS. I’m aware that this isn’t a terribly useful real-world metric, but I find it provides a good indication of the impact of optimizations in my code. For instance, let’s say I run a particular game and it averages about 18 MIPS in Aeon. I then optimize part of the ADD instruction and it averages 20 MIPS in the same part of the same game. It’s a significant, relative increase in throughput. It’s a very simple way for me to measure relative speed without resorting to more sophisticated profiling.

I mentioned profiling above. I wrote Aeon in C#, and yes, Visual Studio includes some very nice managed code profiling tools which were very useful to me in the beginning for isolating the biggest performance bottlenecks. However, we’re dealing with tens of millions of instructions per second, and each one involves multiple method calls. Were I to instrument this with a managed code profiler now, it would fill up my hard drive before I got any useful information, not to mention destroying the somewhat delicate timing of my hardware timer emulation. I’ve been wanting to add a more specialized profiler to Aeon that I can use to measure how often certain instructions are emulated and how long they take, but it’s been running “good enough” for now, so this hasn’t been a priority.

Finally, I’d like to briefly mention the .NET JIT compiler. Basically, .NET methods are compiled into machine code the first time they are called (a bit of a simplification), so the compiler isn’t able to optimize the code as much as a static compiler can. The upshot to this is that Aeon runs as-is on x86 and x64, and for various reasons that I won’t get into right now, the x64-compiled version is actually a bit faster. Another benefit of this approach is that as future versions of the .NET framework are released, Aeon will take advantage of improvements in the JIT compiler. As an example, Wolfenstein 3D runs on my Phenom II processor at about 19 MIPS with the .NET 3.5 SP1 version of Aeon. When I tested the beta release of .NET 4, the same game managed around 26 MIPS.

Hopefully now you have an idea of where I’m coming from when I’m talking about CPU performance. Next time I’ll dig into some of the more interesting details.

Blog at WordPress.com.