Rewriting My Game Boy Emulator: The Pixel FIFO

First of all, if you are writing your own Game Boy emulator, you don’t need to emulate the pixel FIFO! Almost nothing depends on it for correct rendering, it’s a lot more complex to implement than a scanline-based renderer, and it’s also significantly more difficult to debug than a scanline-based renderer. I don’t recommend emulating the FIFO unless one of your goals is to be as accurate to hardware as possible.

…Or if you’re interested in the challenge, which is why I did it. My first GB emulator had a pixel-based renderer but I made nearly zero attempt to accurately emulate mid-scanline rendering timings. For the rewrite, I wanted to see if I could do better.

Putting Pixels on the Screen

The Game Boy’s display is an LCD screen with a resolution of 160x144. The Game Boy PPU (picture processing unit) repeatedly draws frames onto the display, roughly 60 times per second. Each frame is drawn directly to the display, line by line. Internally the PPU cycles through 154 lines per frame, 144 lines of active display and 10 lines of VBlank (vertical blanking) during which the PPU is idle and the CPU is able to modify video memory in preparation for rendering the next frame.

The PPU clock speed is the same as the CPU clock speed, roughly 4.194 MHz. Each line takes exactly 456 cycles:

80 cycles to scan OAM (object attribute memory, i.e. the sprite table) in order to determine which sprites should display on the current line and where
A variable number of cycles to render a line of pixels to the LCD display, ranging between about 172 cycles and about 289 cycles
The remaining cycles are HBlank (horizontal blanking) during which the PPU is idle and the CPU can freely modify video memory
- The PPU provides both an HBlank interrupt and a status register so that the CPU does not have to guess when the PPU has reached HBlank

How is the rendering period a variable number of cycles rather than always 160 cycles, the same as the display width? Well, unlike with TV-based consoles such as the NES or the Sega Genesis, the Game Boy’s PPU can control the display’s clock rate because the display is part of the console. If the PPU isn’t ready to push a pixel to the display, it can simply tell the display to pause until it’s ready to render the next pixel.

And this is where things start to get complicated.

Contradictions Everywhere

Before going into any further detail, I feel the need to point out that much of the documentation on this topic is contradictory, and I’m honestly not completely sure how much of what I’m about to describe is really accurate to how hardware works. As an example, The Ultimate Game Boy Talk is overall an excellent overview of the Game Boy hardware, but it gets some major details wrong regarding the pixel FIFO queues: in particular, it claims that background and sprite pixels are merged into a single FIFO queue, when in reality the hardware has separate queues for background/window pixels and sprite pixels that don’t get merged until a pixel is actually rendered to the display (which does actually affect what pixel gets displayed in certain cases).

Now, this is understandable because there isn’t software that depends on getting it 100% right (as far as I know), but still worth noting I think.

Here are some of the sources I referenced, to varying degrees:

FIFO?

Just to be explicit, FIFO stands for first-in-first-out. It’s a type of queue that in software is commonly implemented using data structures such as VecDeque in Rust or ArrayDeque in Java. You sometimes see them implemented using linked lists, but linked list implementations tend to be vastly less performant than implementations that store a circular buffer in an array.

Hardware implementations like in the Game Boy commonly use shift registers.

One Layer at a Time

Let’s start by ignoring that sprites and the window exist, leaving only the background layer.

The PPU has an 8-pixel FIFO queue for background pixels that are soon to be displayed. When it is ready to render a pixel to the display, it simultaneously pops a pixel off the queue and tells the LCD display to advance 1 clock. If it is not ready to render, it tells the LCD to wait.

The PPU also has a pixel fetcher that can fetch a row of 8 pixels from a tile in VRAM, taking 6 cycles to do so: 2 cycles to look up the tile number in the background tile map, 2 cycles to read the low byte of the tile data, and 2 cycles to read the high byte of the tile data. It does not prefetch before rendering starts, so there is a 6-cycle delay at the start of the rendering period as the PPU fetches the first background tile. After that the pixel fetcher runs in parallel to drawing pixels on the screen, pausing for 2 cycles in between each fetch.

You would think this results in the rendering period taking 166 cycles minimum, but the PPU actually fetches the first tile twice, and it “renders” the first copy offscreen before pushing any pixels to the display. This adds another 8 cycles as the PPU pops these 8 pixels off the queue without clocking the LCD. This was probably done to make it easier to handle sprites and the window, which can both be positioned past the left edge of the screen.

This adds up to 174 cycles, not 172 as I mentioned above, and this is a point where documentation disagrees. Pan Docs says 172, while the kevtris doc says it should be 174 cycles but it’s actually 173.5 because the last cycle gets cut short by half a cycle due to a detail of how the display clock is generated. For the purposes of this post I’m going to ignore the half cycle detail and say it’s 174 cycles minimum.

One other thing to consider with the background is fine X scrolling. When the X scroll value is not a multiple of 8, rendering needs to start partway through a tile instead of starting at the left edge of the first tile. The PPU implements this by popping the first (SCX % 8) pixels out of the BG FIFO queue without pushing them to the display, which gets rendering to start at the correct pixel within the first tile. This extends the rendering time by (SCX % 8) cycles, so rendering only the background can take up to 181 cycles (when SCX % 8 is 7).

I thought for a bit about how to emulate this and settled on a state machine. It’s extremely overkill for doing just the background but it will make things a lot easier later when adding in sprites and the window.

Rust’s sum types and pattern matching make state machines pretty convenient to implement compared to some other commonly used languages. Here I define the state type:

1
2
3
4
5
6
7


#[derive(Debug, Clone, Copy)]
enum FifoState {
    // Fetching the offscreen first tile
    InitialBgFetch { dots_remaining: u8 },
    // Rendering a background tile
    RenderingBgTile { dots_remaining: u8, screen_x: u8, fetcher_x: u8 },
}

At the start of each line I initialize the state to InitialBgFetch { dots_remaining: 6 } to handle the 6-cycle delay for the offscreen tile fetch.

I use a VecDeque to store the pixels:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


#[derive(Debug, Clone, Copy)]
struct BgPixel {
    color: u8,
}

#[derive(Debug, Clone)]
struct PixelFifo {
    bg: VecDeque<BgPixel>,
    y: u8,
    state: FifoState,
}

I then call this method once per PPU cycle during the rendering period:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15


impl PixelFifo {
    pub fn tick(&mut self, vram: &Vram, registers: &Registers, frame_buffer: &mut PpuFrameBuffer) {
        use FifoState::*;

        match self.state {
            InitialBgFetch { dots_remaining } => {
                // Decrement dots_remaining, and change state to RenderingBgTile if it decrements to 0
            }
            RenderingBgTile { dots_remaining, screen_x, fetcher_x } => {
                // Render a pixel to the frame buffer and increment screen_x, using dots_remaining to track when
                // to fetch the next tile and fetcher_x to track which tile to fetch
            }
        }
    }
}

I handle the offscreen tile and fine X scroll by starting screen_x at 0_u8.wrapping_sub(x_scroll % 8) instead of starting it at 0, and only rendering to the frame buffer when screen_x is between 8 and 167 (which is kind of hacky but avoids some annoying conditional logic). Rendering ends when screen_x reaches 168.

We have a background layer!

Super Mario Land Background Only

Zelda Background Only

Sprites

Displaying the background is a step forward, but almost no games are playable without sprite rendering!

First, sprites need to be pre-processed. The OAM scan before the rendering period determines which sprites should render on the current line, up to a max of 10 out of the 40 sprites in the sprite table. The overlap calculation is based purely on the current line number, the sprite’s Y position, and the sprite size (either 8x8 or 8x16, configured globally in the LCDC register). One quirk is that a sprite Y position of 0 is actually 16 pixels above the top edge of the screen, done so that sprites can display partially offscreen.

For rendering, sprites have their own 8-pixel FIFO, and background and sprite pixels get merged at display time to determine what color to display. The merging rules are:

If the sprite pixel is transparent (color ID == 0), always display the background pixel even if it is also transparent
If the sprite pixel is from a sprite with the BG-over-OBJ flag set and the background pixel is not transparent, display the background pixel
Otherwise, display the sprite pixel. To be complete, this covers two remaining cases:
- The sprite pixel is not transparent and it is from a sprite that does not have the BG-over-OBJ flag set
- The sprite pixel is not transparent, and it has the BG-over-OBJ flag set but the background pixel is transparent

In my code I refer to the BG-over-OBJ flag as “low priority” because I find that less confusing. By default sprites always display over the background, but the low priority flag allows non-transparent background pixels to display over sprites on a sprite-by-sprite basis. As an example use case, Metroid 2 sometimes makes the player sprite low priority so that foliage in the background layer will display in front of Samus rather than behind her.

If there are no sprites on a line then the sprite FIFO will contain only transparent pixels, causing background pixels to display for the entire line.

If there are sprites on the line then the PPU does a check at every pixel to see if there is a sprite that begins on that pixel, based on sprite X position. Similar to Y=0 being above the screen, sprites with X=0 have their left edge located 8 pixels to the left of the screen.

When the PPU encounters a sprite, it halts rendering until it can fetch the row of 8 pixels from that sprite that should display on the current line. Which row to fetch is a simple calculation based on scanline, sprite Y position, and whether the sprite has the vertical flip flag set.

The pixel fetcher is shared between background fetches and sprite fetches which makes the timing here tricky. If a background fetch is currently in progress, the sprite fetch must wait until the background fetch finishes…probably. Some documentation says that the first cycle of a sprite fetch can overlap the final cycle of a background fetch (which would add a net 5 cycles), while other documentation says that the sprite fetch always takes at least 6 cycles. The fetch cycles being able to overlap does not make sense to me, so I’m going to assume that sprite fetches always take 6 cycles but that a sprite fetch may need to wait for up to 5 cycles in order for a background fetch to finish. This means that every sprite fetched will add between 6 and 11 cycles to the rendering period while the PPU waits for the sprite fetch to complete.

Once a row of sprite pixels is fetched, the pixels are written into the sprite FIFO, but only in positions that currently contain a transparent pixel. (This is why sprites with a lower X coordinate always display over sprites with a higher X coordinate - that’s the order they’re processed in.) If the sprite’s horizontal flip flag is set then the pixel row is simply reversed before inserting into the FIFO.

Alright, hopefully this is simpler to implement than it was to understand!

I’ll add a new type of FIFO state to account for sprite fetch delays, while factoring out the RenderingBgTile fields into a separate struct (and adding a new field to track whether a sprite fetch was delayed because of the current BG fetch):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18


#[derive(Debug, Clone, Copy)]
struct RenderingBgTileFields {
    dots_remaining: u8,
    screen_x: u8,
    fetcher_x: u8,
    // Whether or not a sprite fetch was delayed by a BG fetch in the current BG tile
    sprite_fetch_delayed: bool,
}

#[derive(Debug, Clone, Copy)]
enum FifoState {
    // Fetching the offscreen first tile
    InitialBgFetch { dots_remaining: u8 },
    // Rendering a background tile
    RenderingBgTile(RenderingBgTileFields),
    // Fetching a sprite tile
    SpriteFetch { dots_remaining: u8, previous_bg_fields: RenderingBgTileFields },
}

In the pixel FIFO itself, I’ll add new fields for the sprite FIFO and for the list of sprites on the current scanline, sorted by X coordinate:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26


#[derive(Debug, Clone, Copy)]
struct SpriteData {
    x: u8,
    y: u8,
    tile_number: u8,
    palette: u8,
    horizontal_flip: bool,
    vertical_flip: bool,
    low_priority: bool,
}

#[derive(Debug, Clone, Copy)]
struct SpritePixel {
    color: u8,
    palette: u8,
    low_priority: bool,
}

#[derive(Debug, Clone)]
struct PixelFifo {
    bg: VecDeque<BgPixel>,
    sprites: VecDeque<SpritePixel>,
    y: u8,
    state: FifoState,
    scanned_sprites: VecDeque<SpriteData>,
}

scanned_sprites is a VecDeque so that I can pop sprites off the front of the list once they’ve been fetched, which means I only ever need to check the first sprite in the list in order to determine if there’s a sprite that matches the current X coordinate.

I then added logic to the RenderingBgTile handler method to do just that, check if there’s a sprite at the current X coordinate. If there is, fetch the sprite tile, compute the sprite fetch delay based on the current state, and switch the state to SpriteFetch with the appropriate delay. If there was additional delay added from needing to wait for a background fetch to complete, I set the sprite_fetch_delayed flag so that no other sprites can trigger the background fetch wait delay until the next background fetch (in case there are multiple sprites that start within the same background tile).

The SpriteFetch state handler is very simple: decrement dots_remaining by 1, and if it decrements to 0 then restore the previous RenderingBgTile state from the start of the sprite fetch.

The merging logic is fairly straightforward:

1
2
3
4
5


let color = if sprite_pixel.color != 0 && (!sprite_pixel.low_priority || bg_pixel.color == 0) {
    registers.sprite_palettes[sprite_pixel.palette as usize][sprite_pixel.color as usize]
} else {
    registers.bg_palette[bg_pixel.color as usize]
};

(Plus some omitted logic to ensure that the background always shows white when the background is disabled)

After doing that (and fixing some off-by-one errors), we have sprites!

Super Mario Land Sprites

Zelda Sprites

Of course, as the Zelda screenshot makes evident, there’s one thing left.

The Window

The window is sort of a separate layer and sort of part of the background. Games can configure a window region by setting the coordinates of the top-left corner of the window, and when inside the window the PPU will render pixels from the window in place of the background. The window is not affected by scroll values which makes it very useful for things like status bars which should always render in the same position onscreen without scrolling. The window can also be configured to use a separate tile map from the background.

Now, the window in general has some unintuitive and glitchy behaviors, some of which you need to deal with even with a scanline-based renderer. The biggest example is that if the window is enabled early in the frame, disabled mid-frame, and then re-enabled later in the frame, the window starts rendering lines from right where it left off instead of counting the lines where it was disabled and skipping those. This behavior is often referred to as “splitting the window” and a number of games depend on it to render correctly.

There are also strange behaviors when the window’s X position is set to either 0 (7 pixels past the left edge of the screen) or 166 (the rightmost pixel of the screen). WX=0 seems to cause different glitchy behavior on different hardware, and WX=166 seems to cause the window to render for the entire next scanline with some glitchy X scrolling behavior (despite the fact that the window isn’t supposed to scroll).

Ignoring all of that, the simple case is actually quite simple to handle! When the window is in-range vertically and the PPU reaches the left edge of the window, it clears the background FIFO and begins to fetch window tiles in place of background tiles, which it then does for the rest of the scanline. This introduces a 6-cycle delay as the PPU pauses while the pixel fetcher fetches the first window tile, but after that everything works pretty much the same way, only fetching window tiles instead of background tiles.

I’m implementing the window trigger condition as both of the following being true (based on Pan Docs):

At some point earlier in the frame, WY was equal to the scanline while the window was enabled (checked at the start of every line)
- EDIT: Since originally writing this, I tested my implementation against the fairylake test ROM which revealed this check to be incorrect. Removing the “while the window was enabled” part fixed fairylake while also not breaking games that do strange mid-frame things with the window (e.g. Pokemon Gold/Silver)
While rendering a scanline, WX + 1 == screen X while the window is enabled (the +1 is to account for WX=0 being 7 pixels past the left edge of the screen rather than 8)

To implement this, I’ll add another field to the RenderingBgTile state tracking whether it’s currently rendering background tiles or window tiles:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


#[derive(Debug, Clone, Copy)]
enum BgLayer {
    Background,
    Window { window_line: u8 },
}

#[derive(Debug, Clone, Copy)]
struct RenderingBgTileFields {
    bg_layer: BgLayer,
    ...
}

There’s a new state for the initial window tile fetch:

1
2
3
4
5
6


#[derive(Debug, Clone, Copy)]
enum FifoState {
    // Fetching the first window tile
    InitialWindowFetch { dots_remaining: u8, screen_x: u8, window_line: u8 },
    ...
}

And the pixel FIFO needs a bit of additional state to track whether the window Y condition has triggered this frame, along with the current window line:

1
2
3
4
5
6


#[derive(Debug, Clone)]
struct PixelFifo {
    window_y_triggered: bool,
    window_line_counter: u8,
    ...
}

The main logic changes are:

In the RenderingBgTile handler, if the window Y condition has triggered and WX + 1 == screen_x then clear the background FIFO, fetch window tile 0, switch state to InitialWindowFetch, and increment the window line counter
The InitialWindowFetch handler simply decrements dots_remaining, and when it decrements to 0 it switches back to RenderingBgTile but with bg_layer set to Window
Again in the RenderingBgTile handler, if bg_layer is Window instead of Background then fetch window tiles instead of background tiles, and fetch using the current window line instead of the current scanline

And we have a window!

Zelda Window

Acid Test

Just to validate that everything is working correctly, I ran the dmg-acid2 test ROM:

dmg-acid2

All good! Well, after fixing a bug related to sprites being disabled mid-scanline, anyway.

What Actually Depends on All This?

The most well-known game that depends on having a pixel-based renderer is Prehistorik Man, which makes mid-scanline background palette changes to render the large scrolling text on the title screen:

Prehistorik Man

Another example is the demo Is That a Demo in Your Pocket? which requires fairly precise mid-scanline timing to correctly handle one of its early effects, where it does mid-scanline writes that change the background tile data area from $8000-$8FFF in VRAM to $8800-$97FF. Here’s how my first emulator handles it:

Pocket Demo Wrong

v2 emulator:

Pocket Demo Fixed

Neither of these requires completely emulating the pixel FIFO however. I’m not aware of any software that requires accurately emulating sprite and window delays, for example. Hence my statement at the beginning that it’s really not necessary to emulate the pixel FIFO at a low level unless you’re aiming for very high accuracy!

Game Boy Color

There are a few differences in how the pixel FIFO works on the Game Boy Color, but they’re not massive - they mostly have to do with the new background tile attributes feature, the altered sprite priority behavior, and the altered behavior of the BG/window enabled flag in the LCDC register. I may cover the differences in more detail in a future post, as I want to get regular Game Boy fully working before touching GBC.

Next Time

I will probably do a post on the APU (audio processing unit), though it won’t go into as much detail as I did here. I don’t expect I’ll change my APU implementation as radically as I changed my PPU implementation.

Rewriting My Game Boy Emulator: The Pixel FIFO

See Also: