My Game Boy emulator was both the first emulator I wrote and my first real project written in Rust. Needless to say, there’s a lot of room for improvement. I briefly thought about porting it into my multi-system emulator which doesn’t currently have a Game Boy core, but this also seemed like a good opportunity to go back and clean things up, and it frankly seemed easier to do a complete rewrite while using my v1 as a reference for trickier details (e.g. how the CPU’s DAA instruction behaves for inputs that are officially undefined).
Here We Go Again
The first step of emulating any system is to emulate the CPU. Without a mostly fully-functional CPU, you can’t run any software.
The Game Boy CPU is a Sharp SM83, which is kind of like a Z80-lite, but it’s different enough that you can’t just drop in a Z80 core and expect it to work. For one, the sign and overflow flags are gone, and the remaining flags are in different bits. For another, some instructions have subtle behavior differences, like DAA (Decimal Adjust Accumulator). For one more, the cycle timings are different - SM83’s timings are much simpler since each M-cycle (machine/memory cycle) takes the same number of T-cycles (time/clock cycles) which is not the case for Z80.
Rather than describe exactly how to emulate this CPU (there are gazillions of other resources on how to do that), I’m going to highlight what I did differently the second time around.
No More Parsing
In my first emulator, I separated opcode decoding and opcode execution into two clearly distinct steps. This was enabled by an Instruction
enum that looks like this:
|
|
Opcode decoding is a big match
covering all 245 valid opcodes (with a separate match for the $CB
-prefixed 2-byte opcodes):
|
|
Then there’s another match
for instruction execution:
|
|
While this separation initially seemed nice and clean to me, in retrospect it was kind of pointless as it didn’t really provide any benefit. (I also inlined most of the instruction implementations into that second match
, which is definitely not clean!) Having two matches also (slightly) hurts performance due to needing to branch on opcode twice for every instruction: once on the opcode itself during decoding, and then again on the enum variant when executing.
For v2 I wrote a separate function for every instruction, while grouping extremely similar instructions like ADD A, r
and ADD A, (HL)
together. I then combined the two separate match
expressions into a big match
that directly calls the function for the specified instruction:
|
|
A single instruction function looks like this:
|
|
As is somewhat evident there, another thing I did differently is storing the F register as a struct of bool
s instead of a single u8
:
|
|
I implemented From<Flags> for u8
and From<u8> for Flags
to make it really straightforward to convert to and from the byte representation for the few instructions that need it (e.g. PUSH AF
and POP AF
). This probably makes little difference in terms of performance (maybe even negative difference) but I think it makes the code that deals with flags a lot cleaner.
Bus Interface?
You might have noticed that <B: BusInterface>
generic type and trait bound in the above function for INC r
. That is related to my current solution to dealing with Rust’s ownership and borrow checker rules. (It doesn’t really need to be generic for this particular CPU, but the flexibility is nice if I ever wanted to test it against test suites that aren’t just GB test ROMs.)
This is a problem that is not present in garbage collected languages such as Java, Kotlin, Python, etc. because you can freely store references to values wherever you want and the garbage collector will take care of freeing memory when it is no longer referenced. Rust does not work that way because Rust does not have a garbage collector. You don’t need to manually deallocate memory like you do in C/C++/Zig (heap allocations are freed automatically when the owning value goes out of scope like with C++ smart pointers), but the compiler has a lot of rules in place to ensure that you don’t attempt to access memory that may not be valid.
Rust has two types of references: shared references (&
) and mutable references (&mut
). Shared references only allow read access to the underlying value, while mutable references allow both read and write access.
Without diving too deep, there are two main rules that the compiler’s borrow checker enforces:
- A reference to a value may not outlive the value itself
- An important note is that moving a value immediately invalidates all live references
- Any number of shared references to a value may be live simultaneously, but only one mutable reference may be live at a time, and no shared references can be live while a mutable reference is live
- As a corollary, a value cannot be mutated while there are any live shared references to it, and even the owner may not mutate a value while there is any live reference to it (whether shared or mutable)
There are ways to work around the borrow checker’s limitations without resorting to raw pointers and unsafe
, such as Rc<RefCell<T>>
which wraps a value with a reference-counted pointer that allows for interior mutability, but those do have a small runtime cost due to needing to enforce the borrow checker rules at runtime rather than only at compile time.
Anyway, these ownership rules present a problem for an emulator, where emulated CPU writes can affect every other system component: the graphics processor, the audio processor, things like the Game Boy’s system timer, etc. The CPU can’t just hold mutable references to the other components because that would violate the borrow checker rules as soon as you tried to do anything with the other components directly.
Here’s how I handled the bus in v1, a struct called AddressSpace
that owns RAM and all memory-mapped I/O registers (basically everything that the CPU can read from or write to):
|
|
It has methods that look like this (definitions omitted for brevity):
|
|
There’s a lot that’s not great about this:
- Needing to pass the current PPU state through all of the CPU code in order to read from or write to memory addresses, since the CPU cannot access VRAM or OAM while the PPU is actively using them
- Having separate access methods for the CPU and the PPU
- Needing to pass this “almost god object” to every single component because it owns the memory-mapped I/O registers that configure the PPU, APU, timer, etc., as well as VRAM and OAM which should really be owned by the PPU
While this did spare me from needing to use Rc<RefCell<T>>
everywhere, it’s definitely not ideal.
I’ve since seen two other common solutions to this problem:
- Have the CPU own all other components so that it can freely mutate them during CPU execution, while also allowing each component to own what it should for encapsulation (e.g. the PPU and APU owning their registers and memory)
- Note this approach does not work well for systems that have multiple CPUs with a shared bus, such as the Sega Genesis (68000 + Z80)
- Create a “god object” that contains every component, and pass the god object to every single function instead of trying to encapsulate anything using methods
I didn’t really like either of these approaches. Instead, I did pretty much the same thing I did with my last few CPU emulators (Z80 / 68000 / 65816): for each CPU execution, create a short-lived bus value that contains mutable references to all of the components that the CPU might need to mutate, and pass that in to the CPU as a parameter. The borrow checker is fine with this because the mutable references are all dropped immediately after the CPU finishes execution.
The bus struct looks like this (not currently complete as v2 is still WIP):
|
|
The CPU accesses it through a small handful of methods that delegate to the appropriate component (some omitted here):
|
|
The memory-mapped I/O registers also delegate instead of storing everything in a central struct:
|
|
Running the CPU from the outermost emulator struct looks like this:
|
|
It does make the CPU code a little noisy due to needing to pass a &mut Bus
around to every single function and method, but I personally prefer that to the alternatives I’ve seen. Certainly not claiming there’s not a better way though!
Memory Cycle Accuracy
My first emulator only had instruction-level accuracy. What that means is that for the core emulation loop I execute a single CPU instruction, record how many clock cycles the instruction took, and then run the rest of the components by that many clock cycles. This turns out to be good enough for almost every commercially released Game Boy and Game Boy Color game, but there are a few that don’t work (e.g. Pinball Deluxe), and some demoscene stuff doesn’t work properly.
For round two, I wanted to achieve at least M-cycle-level accuracy, where all components run after every CPU memory access rather than after every CPU instruction. This turned out to be pretty easy to do with how I handled the bus!
I omitted one thing from the Bus
read and write methods above, which is that they actually tick the other components in addition to performing the CPU memory access:
|
|
The idle()
method is necessary because some CPU instructions have “internal” cycles where the instruction does not perform a memory access. For example, RET
(from the Game Boy Complete Technical Reference):
This turned out to work pretty much perfectly, and the v2 passes the mem_timing
test ROMs which validate sub-instruction timing of memory accesses (and which v1 had no hope of ever passing):
Oh Yes, There Were Bugs
Naturally, with a rewrite comes bugs. This is what my first attempt at running blargg’s CPU instructions test ROM looked like:
Well hm! 02 is the interrupts test which is supposed to be one of the harder ones to pass, and yet my CPU passed that one and failed every other test? What?? And how did the test ROM manage to even render any graphical output with such a horribly broken emulated CPU?
So, most of these tests work based on checksums. They initialize CPU and memory to a known state, execute a routine containing the instruction to be tested, and then checksum the final state and compare that to the checksum that would be generated on actual hardware. It turns out that the interrupts test doesn’t use this checksum routine, and neither do the first 5 tests in 01 (“special” instructions).
Of course, the bug was in an instruction that is used in the checksum routine’s inner loop, specifically RRA
(rotate accumulator right through carry). Most instructions were implemented correctly, but the test routine was generating the wrong checksum because of the RRA
bug. Oops. The bug was at least easy to spot once I knew which instruction was problematic:
|
|
Apparently my brain was just not functioning when I wrote this, because I was rotating in using the new carry rather than the existing carry flag. RLA
had the same bug.
After that, almost everything passed! I had 3 remaining bugged instructions that were easy to identify once the test ROM’s checksum routine worked properly:
BIT n, r
was setting the Z flag exactly opposite what it should have beenADD SP, e
was setting the H and C flags based on the upper byte addition (asADD HL, rr
does) rather than the lower byte additionLD HL, SP+e
had the same bug with the H and C flags
I fixed those and all passed:
Continuing On
Now back to work on the PPU, which currently only renders the background layer….