Rewriting My Game Boy Emulator: The CPU

My Game Boy emulator was both the first emulator I wrote and my first real project written in Rust. Needless to say, there’s a lot of room for improvement. I briefly thought about porting it into my multi-system emulator which doesn’t currently have a Game Boy core, but this also seemed like a good opportunity to go back and clean things up, and it frankly seemed easier to do a complete rewrite while using my v1 as a reference for trickier details (e.g. how the CPU’s DAA instruction behaves for inputs that are officially undefined).

Here We Go Again

The first step of emulating any system is to emulate the CPU. Without a mostly fully-functional CPU, you can’t run any software.

The Game Boy CPU is a Sharp SM83, which is kind of like a Z80-lite, but it’s different enough that you can’t just drop in a Z80 core and expect it to work. For one, the sign and overflow flags are gone, and the remaining flags are in different bits. For another, some instructions have subtle behavior differences, like DAA (Decimal Adjust Accumulator). For one more, the cycle timings are different - SM83’s timings are much simpler since each M-cycle (machine/memory cycle) takes the same number of T-cycles (time/clock cycles) which is not the case for Z80.

Rather than describe exactly how to emulate this CPU (there are gazillions of other resources on how to do that), I’m going to highlight what I did differently the second time around.

No More Parsing

In my first emulator, I separated opcode decoding and opcode execution into two clearly distinct steps. This was enabled by an Instruction enum that looks like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21


#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub enum Instruction {
    // All 8-bit LD/LDH instructions
    Load(WriteTarget, ReadTarget),
    // LD rr, nn
    LoadRegisterPairImmediate(CpuRegisterPair, u16),
    // LD (nn), SP
    LoadDirectStackPointer(u16),
    ...
    // RES n, r / (HL)
    ResetBit(u8, ModifyTarget),
    // SET n, r / (HL)
    SetBit(u8, ModifyTarget),
    // CCF
    ComplementCarryFlag,
    // SCF
    SetCarryFlag,
    // DAA
    DecimalAdjustAccumulator,
    ...
}

Opcode decoding is a big match covering all 245 valid opcodes (with a separate match for the $CB-prefixed 2-byte opcodes):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14


match opcode {
    0x00 => Ok((Instruction::NoOp, pc + 1)),
    0x01 | 0x11 | 0x21 | 0x31 => {
        let rr = register_pair_for_other_ops(opcode);
        let nn = address_space.read_address_u16(pc + 1, ppu_state);
        Ok((Instruction::LoadRegisterPairImmediate(rr, nn), pc + 3))
    }
    0x02 => Ok((Instruction::Load(WriteTarget::IndirectBC, ReadTarget::Accumulator), pc + 1)),
    0x03 | 0x13 | 0x23 | 0x33 => {
        let rr = register_pair_for_other_ops(opcode);
        Ok((Instruction::IncRegisterPair(rr), pc + 1))
    }
    ...
}

Then there’s another match for instruction execution:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16


match self {
    Self::Load(write_target, read_target) => {
        let value = read_target.read_value(cpu_registers, address_space, ppu_state);
        write_target.write_value(value, cpu_registers, address_space, ppu_state);
    }
    Self::LoadRegisterPairImmediate(rr, nn) => {
        cpu_registers.set_register_pair(rr, nn);
    }
    Self::LoadDirectStackPointer(nn) => {
        address_space.write_address_u16(nn, cpu_registers.sp, ppu_state);
    }
    Self::LoadStackPointerHL => {
        cpu_registers.sp = cpu_registers.hl();
    }
    ...
}

While this separation initially seemed nice and clean to me, in retrospect it was kind of pointless as it didn’t really provide any benefit. (I also inlined most of the instruction implementations into that second match, which is definitely not clean!) Having two matches also (slightly) hurts performance due to needing to branch on opcode twice for every instruction: once on the opcode itself during decoding, and then again on the enum variant when executing.

For v2 I wrote a separate function for every instruction, while grouping extremely similar instructions like ADD A, r and ADD A, (HL) together. I then combined the two separate match expressions into a big match that directly calls the function for the specified instruction:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15


match opcode {
    // NOP
    0x00 => {}
    // LD rr, u16
    0x01 | 0x11 | 0x21 | 0x31 => self.ld_rr_nn(bus, opcode),
    // INC rr
    0x03 | 0x13 | 0x23 | 0x33 => self.inc_rr(bus, opcode),
    // DEC rr
    0x0B | 0x1B | 0x2B | 0x3B => self.dec_rr(bus, opcode),
    // ADD HL, rr
    0x09 | 0x19 | 0x29 | 0x39 => self.add_hl_rr(bus, opcode),
    // INC r / INC (HL)
    0x04 | 0x0C | 0x14 | 0x1C | 0x24 | 0x2C | 0x34 | 0x3C => self.inc_r(bus, opcode),
    ...
}

A single instruction function looks like this:

1
2
3
4
5
6
7
8
9


// INC r: Increment register
pub(super) fn inc_r<B: BusInterface>(&mut self, bus: &mut B, opcode: u8) {
    let value = self.read_register(bus, opcode >> 3).wrapping_add(1);
    self.write_register(bus, opcode >> 3, value);

    self.registers.f.zero = value == 0;
    self.registers.f.subtract = false;
    self.registers.f.half_carry = value & 0x0F == 0;
}

As is somewhat evident there, another thing I did differently is storing the F register as a struct of bools instead of a single u8:

1
2
3
4
5
6
7


#[derive(Debug, Clone, Copy, Encode, Decode)]
struct Flags {
    zero: bool,
    subtract: bool,
    half_carry: bool,
    carry: bool,
}

I implemented From<Flags> for u8 and From<u8> for Flags to make it really straightforward to convert to and from the byte representation for the few instructions that need it (e.g. PUSH AF and POP AF). This probably makes little difference in terms of performance (maybe even negative difference) but I think it makes the code that deals with flags a lot cleaner.

Bus Interface?

You might have noticed that <B: BusInterface> generic type and trait bound in the above function for INC r. That is related to my current solution to dealing with Rust’s ownership and borrow checker rules. (It doesn’t really need to be generic for this particular CPU, but the flexibility is nice if I ever wanted to test it against test suites that aren’t just GB test ROMs.)

This is a problem that is not present in garbage collected languages such as Java, Kotlin, Python, etc. because you can freely store references to values wherever you want and the garbage collector will take care of freeing memory when it is no longer referenced. Rust does not work that way because Rust does not have a garbage collector. You don’t need to manually deallocate memory like you do in C/C++/Zig (heap allocations are freed automatically when the owning value goes out of scope like with C++ smart pointers), but the compiler has a lot of rules in place to ensure that you don’t attempt to access memory that may not be valid.

Rust has two types of references: shared references (&) and mutable references (&mut). Shared references only allow read access to the underlying value, while mutable references allow both read and write access.

Without diving too deep, there are two main rules that the compiler’s borrow checker enforces:

A reference to a value may not outlive the value itself
- An important note is that moving a value immediately invalidates all live references
Any number of shared references to a value may be live simultaneously, but only one mutable reference may be live at a time, and no shared references can be live while a mutable reference is live
- As a corollary, a value cannot be mutated while there are any live shared references to it, and even the owner may not mutate a value while there is any live reference to it (whether shared or mutable)

There are ways to work around the borrow checker’s limitations without resorting to raw pointers and unsafe, such as Rc<RefCell<T>> which wraps a value with a reference-counted pointer that allows for interior mutability, but those do have a small runtime cost due to needing to enforce the borrow checker rules at runtime rather than only at compile time.

Anyway, these ownership rules present a problem for an emulator, where emulated CPU writes can affect every other system component: the graphics processor, the audio processor, things like the Game Boy’s system timer, etc. The CPU can’t just hold mutable references to the other components because that would violate the borrow checker rules as soon as you tried to do anything with the other components directly.

Here’s how I handled the bus in v1, a struct called AddressSpace that owns RAM and all memory-mapped I/O registers (basically everything that the CPU can read from or write to):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


pub struct AddressSpace {
    // GB or GBC
    execution_mode: ExecutionMode,
    cartridge: Cartridge,
    vram: [u8; 16384],
    working_ram: [u8; 32768],
    oam: [u8; 160],
    io_registers: IoRegisters,
    hram: [u8; 127],
    ie_register: u8,
}

It has methods that look like this (definitions omitted for brevity):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27


impl AddressSpace {
    /// Read the value at the given address from the perspective of the CPU. Returns 0xFF if the
    /// CPU is not able to access the given address because of PPU state.
    pub fn read_address_u8(&self, address: u16, ppu_state: &PpuState) -> u8 {
        ...
    }

    /// Read the OAM/VRAM value at the given address from the perspective of the PPU, bypassing the
    /// CPU access check.
    ///
    /// # Panics
    ///
    /// This method will panic if the address is not an OAM or VRAM address.
    pub fn ppu_read_address_u8(&self, address: u16) -> u8 {
        ...
    } 

    /// Assign a value to the given address from the perspective of the CPU. The write is ignored
    /// if the CPU is not allowed to access the given address due to PPU state.
    pub fn write_address_u8(&mut self, address: u16, value: u8, ppu_state: &PpuState) {
        ...
    }

    pub fn get_io_registers_mut(&mut self) -> &mut IoRegisters {
        &mut self.io_registers
    }
}

There’s a lot that’s not great about this:

Needing to pass the current PPU state through all of the CPU code in order to read from or write to memory addresses, since the CPU cannot access VRAM or OAM while the PPU is actively using them
Having separate access methods for the CPU and the PPU
Needing to pass this “almost god object” to every single component because it owns the memory-mapped I/O registers that configure the PPU, APU, timer, etc., as well as VRAM and OAM which should really be owned by the PPU

While this did spare me from needing to use Rc<RefCell<T>> everywhere, it’s definitely not ideal.

I’ve since seen two other common solutions to this problem:

Have the CPU own all other components so that it can freely mutate them during CPU execution, while also allowing each component to own what it should for encapsulation (e.g. the PPU and APU owning their registers and memory)
- Note this approach does not work well for systems that have multiple CPUs with a shared bus, such as the Sega Genesis (68000 + Z80)
Create a “god object” that contains every component, and pass the god object to every single function instead of trying to encapsulate anything using methods

I didn’t really like either of these approaches. Instead, I did pretty much the same thing I did with my last few CPU emulators (Z80 / 68000 / 65816): for each CPU execution, create a short-lived bus value that contains mutable references to all of the components that the CPU might need to mutate, and pass that in to the CPU as a parameter. The borrow checker is fine with this because the mutable references are all dropped immediately after the CPU finishes execution.

The bus struct looks like this (not currently complete as v2 is still WIP):

1
2
3
4
5
6
7
8


pub struct Bus<'a> {
    pub ppu: &'a mut Ppu,
    pub memory: &'a mut Memory,
    pub cartridge: &'a mut Cartridge,
    pub interrupt_registers: &'a mut InterruptRegisters,
    pub timer: &'a mut GbTimer,
    pub input_state: &'a mut InputState,
}

The CPU accesses it through a small handful of methods that delegate to the appropriate component (some omitted here):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41


impl<'a> BusInterface for Bus<'a> {
    fn read(&mut self, address: u16) -> u8 {
        match address {
            0x0000..=0x7FFF => self.cartridge.read_rom(address),
            0x8000..=0x9FFF => self.ppu.read_vram(address),
            0xA000..=0xBFFF => self.cartridge.read_ram(address),
            0xC000..=0xFDFF => self.memory.read_main_ram(address),
            0xFE00..=0xFE9F => self.ppu.read_oam(address),
            // Unusable memory
            0xFEA0..=0xFEFF => 0xFF,
            0xFF00..=0xFF7F => self.read_io_register(address),
            0xFF80..=0xFFFE => self.memory.read_hram(address),
            0xFFFF => self.interrupt_registers.read_ie(),
        }
    }

    fn write(&mut self, address: u16, value: u8) {
        match address {
            // The CPU cannot actually write to ROM, but some cartridges have internal
            // registers that respond to writes to ROM addresses
            0x0000..=0x7FFF => self.cartridge.write_rom(address, value),
            0x8000..=0x9FFF => self.ppu.write_vram(address, value),
            0xA000..=0xBFFF => self.cartridge.write_ram(address, value),
            0xC000..=0xFDFF => self.memory.write_main_ram(address, value),
            0xFE00..=0xFE9F => self.ppu.write_oam(address, value),
            // Unusable memory
            0xFEA0..=0xFEFF => {}
            0xFF00..=0xFF7F => self.write_io_register(address, value),
            0xFF80..=0xFFFE => self.memory.write_hram(address, value),
            0xFFFF => self.interrupt_registers.write_ie(value),
        }
    }

    fn highest_priority_interrupt(&self) -> Option<InterruptType> {
        self.interrupt_registers.highest_priority_interrupt()
    }

    fn acknowledge_interrupt(&mut self, interrupt_type: InterruptType) {
        self.interrupt_registers.clear_flag(interrupt_type);
    }
}

The memory-mapped I/O registers also delegate instead of storing everything in a central struct:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30


impl<'a> Bus<'a> {
    fn read_io_register(&self, address: u16) -> u8 {
        match address & 0x7F {
            0x00 => self.input_state.read_joyp(),
            0x04 => self.timer.read_div(),
            0x05 => self.timer.read_tima(),
            0x06 => self.timer.read_tma(),
            0x07 => self.timer.read_tac(),
            0x0F => self.interrupt_registers.read_if(),
            0x40..=0x4B => self.ppu.read_register(address),
            _ => {
                log::warn!("read unimplemented I/O register at {address:04X}");
                0xFF
            }
        }
    }

    fn write_io_register(&mut self, address: u16, value: u8) {
        match address & 0x7F {
            0x00 => self.input_state.write_joyp(value),
            0x04 => self.timer.write_div(),
            0x05 => self.timer.write_tima(value),
            0x06 => self.timer.write_tma(value),
            0x07 => self.timer.write_tac(value),
            0x0F => self.interrupt_registers.write_if(value),
            0x40..=0x4B => self.ppu.write_register(address, value),
            _ => log::warn!("write unimplemented I/O register at {address:04X} value {value:02X}"),
        }
    }
}

Running the CPU from the outermost emulator struct looks like this:

1
2
3
4
5
6
7
8


self.cpu.execute_instruction(&mut Bus {
    ppu: &mut self.ppu,
    memory: &mut self.memory,
    cartridge: &mut self.cartridge,
    interrupt_registers: &mut self.interrupt_registers,
    timer: &mut self.timer,
    input_state: &mut self.input_state,
});

It does make the CPU code a little noisy due to needing to pass a &mut Bus around to every single function and method, but I personally prefer that to the alternatives I’ve seen. Certainly not claiming there’s not a better way though!

Memory Cycle Accuracy

My first emulator only had instruction-level accuracy. What that means is that for the core emulation loop I execute a single CPU instruction, record how many clock cycles the instruction took, and then run the rest of the components by that many clock cycles. This turns out to be good enough for almost every commercially released Game Boy and Game Boy Color game, but there are a few that don’t work (e.g. Pinball Deluxe), and some demoscene stuff doesn’t work properly.

For round two, I wanted to achieve at least M-cycle-level accuracy, where all components run after every CPU memory access rather than after every CPU instruction. This turned out to be pretty easy to do with how I handled the bus!

I omitted one thing from the Bus read and write methods above, which is that they actually tick the other components in addition to performing the CPU memory access:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15


impl<'a> BusInterface for Bus<'a> {
    fn read(&mut self, address: u16) -> u8 {
        self.tick_components();
        // do memory read
    }

    fn write(&mut self, address: u16, value: u8) {
        self.tick_components();
        // do memory write
    }

    fn idle(&mut self) {
        self.tick_components();
    }
}

The idle() method is necessary because some CPU instructions have “internal” cycles where the instruction does not perform a memory access. For example, RET (from the Game Boy Complete Technical Reference):

RET Timing

This turned out to work pretty much perfectly, and the v2 passes the mem_timing test ROMs which validate sub-instruction timing of memory accesses (and which v1 had no hope of ever passing):

Mem Timing Passed

Oh Yes, There Were Bugs

Naturally, with a rewrite comes bugs. This is what my first attempt at running blargg’s CPU instructions test ROM looked like:

So Many Failures

Well hm! 02 is the interrupts test which is supposed to be one of the harder ones to pass, and yet my CPU passed that one and failed every other test? What?? And how did the test ROM manage to even render any graphical output with such a horribly broken emulated CPU?

So, most of these tests work based on checksums. They initialize CPU and memory to a known state, execute a routine containing the instruction to be tested, and then checksum the final state and compare that to the checksum that would be generated on actual hardware. It turns out that the interrupts test doesn’t use this checksum routine, and neither do the first 5 tests in 01 (“special” instructions).

Of course, the bug was in an instruction that is used in the checksum routine’s inner loop, specifically RRA (rotate accumulator right through carry). Most instructions were implemented correctly, but the test routine was generating the wrong checksum because of the RRA bug. Oops. The bug was at least easy to spot once I knew which instruction was problematic:

1
2
3
4
5


pub(super) fn rra(&mut self) {
    let carry = self.registers.a.bit(0);
    self.registers.a = (self.registers.a >> 1) | (u8::from(carry) << 7);
    self.registers.f = Flags { zero: false, subtract: false, half_carry: false, carry };
}

Apparently my brain was just not functioning when I wrote this, because I was rotating in using the new carry rather than the existing carry flag. RLA had the same bug.

After that, almost everything passed! I had 3 remaining bugged instructions that were easy to identify once the test ROM’s checksum routine worked properly:

BIT n, r was setting the Z flag exactly opposite what it should have been
ADD SP, e was setting the H and C flags based on the upper byte addition (as ADD HL, rr does) rather than the lower byte addition
LD HL, SP+e had the same bug with the H and C flags

I fixed those and all passed:

CPU Instrs Passed

Continuing On

Now back to work on the PPU, which currently only renders the background layer….