I recently added 32X support to my Sega Genesis emulator. While I’d say it was definitely easier than Sega CD overall, it wasn’t without difficulty, and much like Sega CD there’s very little public documentation aside from Sega’s poorly translated official docs from the 90s. This post covers some of the issues I ran into: some were dumb oversights, some were long-standing bugs in my Genesis code (mainly my 68000 core), and some were legitimately tricky to figure out.
Learning to Read
Knuckles’ Chaotix was the first 32X game that I got working. After implementing all of the SH-2 instructions and 32X features that I saw it using, it was pretty much fully playable except for one really noticeable bug: in the special stages, the player sprite wasn’t correctly cycling through its frames of animation. It would correctly change sprites based on the player’s speed and whether they were jumping, but the sprite wouldn’t animate aside from that.
This took me an embarrassingly long time to figure out.
The SH-2s in the 32X include a number of hardware units as part of their SH7604 package, including a division unit. Software writes the divisor to one register, writes the dividend to another register (or two registers for a 64-bit dividend), and then waits a certain number of cycles for the operation to complete. Once the division is finished, one register holds the 32-bit quotient and another register holds the 32-bit remainder.
Well, I misinterpreted this snippet from the SH7604 documentation:
I took this to mean that after a 32-bit division operation completes (initiated by writing to DVDNT), the DVDNTH register will simply hold the sign extension of the original 32-bit dividend. That is not correct!
The correct behavior is for DVDNTH to hold the 32-bit remainder, for both 64-bit division and 32-bit division operations. The sign extension sentence simply describes how 32-bit division is implemented in the chip: it sign extends the 32-bit dividend to 64 bits and then begins to execute a 64-bit division operation. Once the division completes, this register should hold the 32-bit remainder. That makes a lot of sense in retrospect, but hindsight is 20/20 of course.
As for why this broke the sprite animation, well…in the special stages, Chaotix determines which player sprite to render by doing (N % F) where N is a frame counter and F is the number of frames in the current animation. It performs this operation by using the division unit to perform a 32-bit division and then reading the remainder from DVDNTH. In my emulator, DVDNTH was always reading either 0 or -1 after 32-bit division operations, so every sprite was stuck on its first frame of animation. Oops.
Undocumented Mirrors
Virtua Fighter was the second 32X game that I got working. I immediately saw it doing something weird: it reads from $FFFFFF18 and $FFFFFF1C, which according to the SH7604 documentation are unmapped addresses. Hmm. Stubbing them to return 0 doesn’t work or there will be no 3D graphics in-game:
These two addresses come immediately after the division unit registers, which are all mapped to $FFFFFF00-$FFFFFF17. I figured I’d try mirroring these addresses to the nearest division unit registers by masking out A3 so that $FFFFFF18 mirrors $FFFFFF10 (DVDNTH, 32-bit remainder on reads) and $FFFFFF1C mirrors $FFFFFF14 (DVDNTL, 32-bit quotient on reads). This seemed to work, surprisingly enough.
Not sure why the game uses these addresses instead of the standard ones, but anyway…
Buffer Swap
Another Virtua Fighter problem was that once it got in-game, it would constantly flicker between frames that looked correct and frames that looked like this:
Digging into differences between the two frame buffers, I noticed that the two line tables were very different. It looked like the game was writing the correct pixel data into both frame buffers but one of them had a bad line table. The line table is what tells the VDP where the pixel data for each scanline is located in the frame buffer, so the VDP was displaying garbage while rendering from the buffer with the bad line table.
This thankfully did not take too long to figure out. I started logging where line table writes were occurring and noticed that the game tried to write to both line tables during a screen transition, but in my emulator it wrote to the same frame buffer twice. It tried to swap the frame buffers in between the line table writes, but the frame buffer swap hadn’t taken effect yet, which was incorrect.
Normally, games can only swap the frame buffers during vertical blanking, when the VDP is not actively using the front buffer. Games can request a frame buffer swap at any time, but if a swap is requested during active display, it won’t take effect until the start of the next VBlank period (and there’s a register bit games can poll to see when this happens).
Turns out there was an exception I missed: when the 32X VDP is in “blank” mode (in other words off), games can swap the frame buffers at any time because the VDP is never using either one of them. Allowing frame buffer swaps to take effect immediately in blank mode fixed Virtua Fighter.
First In First Out
Games can transfer data from the Genesis hardware to 32X SDRAM through a fairly convoluted DMA process, which is necessary because the 68000 can’t access SDRAM directly. For the most basic usage, the process goes like this (not always in this order):
- 68000 triggers a command interrupt for one of the SH-2s and gives it whatever information it needs to start the DMA on the 32X side, usually via the 32X communication ports
- 68000 writes to 32X registers to enable and configure the DMA, and then it begins to write data words to a FIFO port
- In parallel, the SH-2 configures one of its DMA channels to read from the other end of the FIFO port and to write to SDRAM
- 68000 continues to write data words to the FIFO port until the DMA is complete
Games basically always use the SH7604’s builtin DMA controller to handle this on the 32X side because it executes in parallel to the SH-2, so the SH-2 can still do useful work while the DMA controller is transferring data. The SH-2’s DREQ0 line is connected to the FIFO so that the DMA controller will automatically start and stop based on when data is available (assuming the DMA is using DMA channel 0).
I bring this up now because of Virtua Racing Deluxe, which booted but then crashed after the title screen.
Looking over 32X register accesses, I noticed something odd: there was a mismatch between 68000 FIFO writes and SH-2 FIFO reads! In the first DMA exchange after the title screen, the two CPUs set up a DMA to transfer 64 words, but then the 68000 wrote 65 words to the FIFO port. The 65th word wasn’t read by the SH-2 until the start of the next DMA exchange, which seemed probably wrong.
This edge case is actually covered in Sega’s official documentation, I just missed it the first time around:
I think this poorly translated paragraph is saying that the DREQ length counter on the 68000 side should decrement on every write to the FIFO port, and the DMA automatically goes inactive when it decrements to 0. Writes to the FIFO port while no DMA is active are discarded, so that 65th word write shouldn’t go anywhere. This fixed Virtua Racing.
Division by Zero
Next up is After Burner Complete, which froze partway through the title screen animation:
It would go in-game if I skipped the animation, but then it would freeze partway through the first stage.
This one was easy because it tripped some of my error logging: this game regularly divides by zero on the 68000! I don’t know why, but it does.
The 68000 has two division instructions: DIVU for unsigned division and DIVS for signed division. They both trigger a divide by zero exception when the divisor is zero, and I wasn’t handling that correctly. I was triggering the exception, but I was pushing the wrong PC value onto the stack, so the game returned to the wrong instruction after the exception and then got stuck. The PC pushed onto the stack should point to the instruction after the DIVU/DIVS that triggered the exception, not the DIVU/DIVS itself:
The next page specifically mentions DIVU/DIVS divide by zero exceptions as being in this exception category.
This bug had been in my 68000 core since I first wrote it, but this was the first time I saw a game actually dividing by zero (that wasn’t caused by a different bug anyway).
As a sidenote, After Burner Complete also depends on the SH7604 user break address registers being R/W. It doesn’t use any of the user break functionality, but the slave SH-2 (ab)uses the break address registers to store audio processing state, and the PWM audio won’t play correctly if those registers aren’t R/W.
Frame Buffer RAM Refresh
Metal Head booted but froze after the SEGA splash screen. It was stuck in an infinite loop waiting for the 32X VDP’s FEN bit to read 1. The FEN bit should only read 1 when frame buffer RAM access is blocked, and that should only happen while an auto fill operation is running, right?
Well…no. The FEN bit also reads 1 for about 40 SH-2 cycles out of every scanline for frame buffer DRAM refresh:
The CPUs are allowed to access the frame buffer while FEN=1 for DRAM refresh, they’ll just stall until the refresh is complete. Metal Head doesn’t depend on this stalling but it does depend on the FEN bit intermittently reading 1.
Game works now:
Off to the Races
Mortal Kombat II got in-game fine, but it would often render glitched frames that looked like this:
This one was tricky to figure out. It’s caused by a race condition between the master SH-2 and the 68000.
When the master SH-2 receives the VBlank interrupt, it immediately starts drawing the next frame. It first zeroes out the frame buffer using a series of VDP auto fills and then starts drawing pixels.
In parallel, when the 68000 receives the VBlank interrupt, it starts reading updated I/O state including current controller inputs. Once it has everything in order, it triggers a command interrupt for the master SH-2 and sends over the new I/O state via 32X DMA.
The problem occurs when the master SH-2 receives updated I/O state while it’s already partway through drawing the next frame and it sees that controller inputs have changed since the last frame. It will get very confused and apparently restart drawing the frame, but at a different vertical position.
The trick (I’m pretty sure) is VDP auto fill timing. Games aren’t supposed to access the frame buffer while an auto fill is in progress; they’re supposed to poll the VDP’s FEN bit to know when the auto fill finishes. Mortal Kombat II does this like it’s supposed to, and I believe the game expects the master SH-2 to still be zeroing out the frame buffer using auto fills when it receives the command interrupt from the 68000. If it hasn’t yet started drawing pixels then it won’t get confused if controller inputs have changed since the last frame.
Sega’s documentation has this to say on auto fill timing:
To be honest, I’m really not sure if this is supposed to be (7 + 3 x length) Mclk cycles or (7 ÷ 3 x length) Sclk cycles. Either timing will fix Mortal Kombat II, and I didn’t find any other games that depend on auto fill timing (beyond it not being too slow). Assuming it’s an Mclk timing is probably better because that’s the shorter of the two possible timings.
There’s a Game That Depends on CPU Cache!?
Like other CPUs of this era (and since), the SH-2 has a CPU cache to speed up instruction and data fetches when a program is repeatedly reading from the same memory addresses.
SH-2’s cache is a shared instruction/data cache with 4KB of cache RAM. Cache entries are 4-way set associative and use a pseudo-LRU algorithm for cache replacement. In other words, specific address bits are used to select a cache entry (A9-4 for SH-2), and each cache entry holds cache lines for up to 4 different addresses. These four sub-entries within each cache entry are called ways.
(The cache can also be configured to a 2-way mode, where each cache entry holds cache lines for up to 2 different addresses - this frees up the other 2KB of cache for the program to use as very fast RAM. That’s not relevant for this issue though.)
Programs can use the highest address bits to control whether memory reads/writes use the cache. If A31-29 are all clear (e.g. $06012345), the memory access will use cache. If A29 is set (e.g. $26012345), the memory access will bypass the cache.
On reads from cached addresses, cache hits will mark the hit way as most recently used within its cache entry, and cache misses will replace the least recently used way after fetching the cache line from memory.
The cache is write-through, which is important for this issue. Cache hits on writes will update the cache entry in addition to writing to memory. Cache misses on writes will not do anything to the cache.
Now, with that background out of the way, I was kind of assuming that no 32X games depend on SH-2 CPU cache for correctness. There aren’t that many 32X games, and there’s no way older (and faster) 32X emulators emulated the SH-2 cache, right?
Well, enter WWF Raw.
All graphics that are supposed to come from the 32X VDP were missing until you went in-game. After a bit of investigation, this was because one of the SH-2s was filling 32X palette RAM with all zeroes instead of the 256-color palette that it was supposed to write, so the 32X VDP was rendering solid black frames until the SH-2 eventually rewrote palette RAM to prepare for rendering in-game graphics.
I looked into the routine where it was writing all these zeroes to palette RAM and discovered the issue pretty quickly: it writes values to cartridge ROM addresses (!), and then it later reads back those values and writes them to palette RAM. Those addresses in ROM are all zeroes, so in my emulator it was writing all zeroes to palette RAM.
On actual hardware, this happens to work the way the game expects it to because the SH-2 has a write-through CPU cache. The writes to ROM addresses will hit in CPU cache and update the cache entries, and then subsequent reads will fetch the values from CPU cache instead of going to cartridge ROM.
Now, if your goal is only to get the game working, there are much simpler ways to fix this than fully emulating the CPU cache. You could allow the game to write to cartridge ROM, though that risks breaking badly programmed games that make stray writes to ROM addresses. A more robust solution would be to emulate a pseudo-cache: maintain an array in memory parallel to ROM that is used for cached reads/writes, and reinitialize it to the contents of ROM any time the game purges the CPU cache. Basically, provide a data cache that covers all of ROM where entries never expire unless they’re explicitly purged. That would fix WWF Raw and probably wouldn’t break anything else.
I didn’t want to settle for a partial mostly-works solution, and I also wanted cache emulation anyway to make it possible to semi-accurately emulate memory access timings (which significantly affect SH-2 speed in some games), so I went and implemented the CPU cache as it’s described in the SH7604 manual. Thankfully it is very thorough and also much more well-translated than Sega’s documentation. WWF Raw is fixed:
Although…
Separate Buses
Some time later I re-tested Metal Head and discovered that it no longer worked. The in-game graphics were extremely wrong:
Since I knew it was working at one point, I used the handy git bisect to figure out which change broke it, and it pointed to the commit where I implemented SH-2 CPU cache. Welp.
Looking over logs, I noticed that it fell apart shortly after a 68000-to-32X DMA. Comparing to the earlier version where it worked correctly, I noticed that the SH-2’s DMA controller was reading different values out of the DMA FIFO: in the earlier version it was reading a number of different values, while in the bugged version it was only reading a single value repeatedly. …Ohh.
The SH-2 was configuring its DMA controller to read from the cached FIFO address ($00004012) instead of the uncached address ($20004012). However, the DMA controller can’t even access the CPU cache! It’s on a separate level of the SH-2’s internal bus than the one that has the cache on it. All DMA reads/writes bypass cache regardless of what A31-29 are set to in the DMA source/destination address.
Changing all DMA controller memory accesses to bypass CPU cache fixed Metal Head (again).
Not Every Unused Opcode Is Illegal
Now to Zaxxon’s Motherbase 2000. This crashed almost immediately due to executing an illegal opcode on the 68000, specifically 0xFF18. If you’ve programmed the 68000 before, you might have already spotted the bug.
The 68000 normally handles illegal opcodes using exception vector number 4 ($000010), but it behaves differently if the highest 4 bits of the opcode are 1010 (0xA) or 1111 (0xF). 1010 illegal opcodes use vector number 10 ($000028) and 1111 illegal opcodes use vector number 11 ($00002C).
68000 documentation refers to these as “line emulator” exceptions. They were intended to reserve ranges of opcodes for use in future 68000-family CPUs, so that software written for later CPUs could behave reasonably if a new opcode was executed on a 68000 that doesn’t support it. I’m not sure exactly why this game uses them, but it depends on handling the line emulator exceptions correctly (at least the line 1111 exceptions).
This is another bug that had been in my 68000 core since I first wrote it. I’m kind of surprised I never saw any Genesis or Sega CD games using this feature. Anyway, implementing this got Zaxxon to boot, but…
Stack Abuse
…Once it got in-game, pretty much nothing worked correctly. The 32X VDP only rendered a single dot in the middle of the screen, no 3D graphics. The background also wasn’t scrolling like it’s supposed to.
This took me the longest to solve out of all the issues mentioned here. Something was going horribly wrong during the game’s boot process, but I couldn’t figure out what. I eventually resorted to the laborious process of comparing CPU traces with Ares, which runs this game correctly.
Even comparing CPU traces to a known-working emulator didn’t help me solve this immediately because the issue seemed to be related to specific memory reads not returning what the game expected. After lots of manual tracing through both CPU and memory operations, I finally found the culprit:
|
|
This game pushes the stack pointer onto the stack! Why, I do not know, but it does - and it depends on the correct value getting pushed.
My original implementation of that instruction was doing this:
|
|
This does the wrong thing when m==n because the value written should be the value before the decrement, not after. This works correctly:
|
|
With that 2-line change, the game now works fine:
Closely Synchronized
While that last bug took the longest to find, this next one might have been the most annoying to fix.
The game in question is Brutal Unleashed: Above the Claw, which Digital Foundry described as being an “absolutely horrible” game and easily the worst fighting game on the 32X. …Worst out of only 5 games, but still.
The problem was that the game would almost always freeze at the end of every fight, right here (after selecting one of the options):
From CPU logging, it was obvious that the master SH-2 was stuck in an infinite loop polling one of the 32X communication ports, but the “why” took a bit more digging. There weren’t any communication port writes leading up to the freeze.
There actually hadn’t been any communication port writes since the start of the fight! The last communication port activity was a few writes from the slave SH-2, to the same communication port that the master SH-2 was polling. In fact, the second-to-last write from the slave SH-2 contained the exact value that the master SH-2 was waiting for. …Oh no.
Yep, this was a CPU synchronization issue. I confirmed this by testing what happened if I ran the two SH-2s completely in lockstep. This tanked emulator performance but fixed the bug.
What happens is that the master SH-2 writes to one of the communication ports, the slave SH-2 sees that write, and then after a bit the slave SH-2 writes to one of the communication ports twice in very quick succession. In parallel, the master SH-2 is polling that communication port, and it needs to see the first write before the slave SH-2 overwrites it with the second value. If the master SH-2 ever misses that first write then it desyncs and the game will freeze after the fight. (This also immediately breaks the background scrolling, since the master SH-2 renders the background during fights.)
I wanted a way to fix this that didn’t require running the two SH-2s in lockstep because that is terrible for performance. The SH-2s execute many more instructions per second than the 68000 or the Z80, and those context switches really add up.
After a few failed attempts at trying to catch up one SH-2 when the other one accessed one of the communication ports, I ended up abandoning that idea and instead tried serializing communication port accesses by time. This worked!
Essentially, whenever one of the two SH-2s reads from or writes to a communication port, it passes along the current time from that CPU’s perspective. The precise “time” value is simply the number of emulated CPU cycles executed since power-on.
Communication port writes are tagged with the CPU’s current cycle count and pushed into a list of writes, sorted by cycle count. On reads, the CPU will get the latest value with a cycle count that is less than or equal to the CPU’s current cycle count. Older writes are pruned off once they’re no longer relevant.
My implementation of this idea is incredibly janky, but it works well enough that I can have the SH-2s execute batches of 15 instructions at once without causing Brutal Unleashed to freeze. Larger instruction batch sizes are still liable to cause freezing, unfortunately, but I was also seeing diminishing returns from increasing the batch size much more so I didn’t look into it too much.
Multiple Resolutions
The last visual bug that I encountered was in NFL Quarterback Club, where the menus initially looked like this:
What’s happening here is that the game has the Genesis VDP and the 32X VDP rendering in different horizontal resolutions. Every other 32X game generally leaves the Genesis VDP in H40 mode (320px) because the 32X VDP can only render in 320px horizontally, and you’d always want the two VDPs to render in the same resolution, right? Well, not this game, which sets the Genesis VDP to H32 mode (256px) in its menus while still using the 32X VDP to render some graphics. It expects the 320x224 32X VDP output to overlay the 256x224 Genesis VDP output. Ouch.
I initially tried to make this work by upscaling the 256x224 Genesis output to 320x224 before compositing the two frames, but the final image looked pretty bad no matter what I tried in terms of filtering and blending - it was really obvious that some of the image was being poorly upscaled. This was ultimately expected given that 320 is not an integer multiple of 256, but I was kind of hoping the easy solution would work.
What ended up working decently was to render the final video frame in 1280x224 if the two VDPs are rendering in different resolutions, with 1280 being the least common multiple of 256 and 320.
First, I expand the Genesis VDP frame from 256x224 to 1280x224 by repeating every pixel 5 times horizontally. Then, I composite the 32X VDP frame onto that 1280x224 frame while logically repeating every pixel 4 times horizontally. This avoids any of the artifacts associated with non-integer upscaling.
The major downside to this approach is that upscaling the horizontal resolution like this makes the image really sharp, especially if filters/shaders are assuming that the output frame’s resolution matches the console’s native resolution. This game only does this in menus though, so you still get 320x224 output during gameplay as long as you don’t always render in 1280x224.
PWM Is Not PCM
This final bug is an audio bug that was caused by me misunderstanding how the 32X PWM chip works.
Games configure the PWM chip by setting a 12-bit “cycle register” value that determines the pulse period, and thus the sample rate. The sample rate is determined by the following formula:
rate = Sclk / ((cycle_register - 1) & 0xFFF)
Where Sclk is the SH-2 clock rate of ~23.01 MHz. The vast majority of games use a cycle register value of 1047 which corresponds to a sample rate of about 22 KHz.
Games play audio through the PWM chip by having one of the CPUs (usually the slave SH-2) push 12-bit pulse width samples to the PWM chip’s FIFO ports. Similar to the cycle register, the effective sample value is -1 from what gets pushed into the FIFO. I was interpreting these as PCM samples on a scale from 0 to 4095 and then remapping them to the range [-1, 1] like so:
|
|
This produced audio that sounded basically correct, but it was way too quiet, so I added a volume multiplier during final audio mixing:
|
|
This worked pretty well…..for games that set the sample rate to 22 KHz.
Here are some sound effects from BC Racers, which are supposed to sound like a character eating meat (possible volume warning):
This game uses the PWM chip for sound effects, and it unusually uses different sample rates for different sound effects. The meat-eating sound effect sets the cycle register to 800, which is a sample rate of about 28.8 KHz. The sound effect is there under the popping but it’s extremely soft - it’s barely audible without using a much larger volume multiplier, and that would make other games sound way too loud.
The problem here is that I completely misunderstood what the 12-bit pulse width samples represent, and I happened to cobble something together that only really worked for 22 KHz samples.
The sample doesn’t represent a percentage on a scale from 0 to 4095 - it represents the number of Sclk cycles in each period that the wave should stay at maximum amplitude.
For example, if the cycle register is set to 1047 (sample rate 22 KHz), then only pulse width values from 1 to 1047 are meaningful. Any other value will constantly keep the wave at maximum amplitude. This is why I needed a volume multiplier in my original implementation - I was incorrectly scaling the 22 KHz samples as if they were in the range [0, 4095] instead of [0, 1046].
Fixed implementation:
|
|
This completely removes the need for a volume multiplier on PWM audio output, which was only covering for the incorrect PWM-to-PCM conversion logic.
There’s still a ton of popping in BC Racers because it doesn’t properly transition between different sound effects (it does just about everything you’re not supposed to do with the PWM chip), but at least you can mostly hear the meat-eating sound effect now:
It still sounds quite bad, but it’s good to have this fixed for better-programmed games that use a sample rate other than 22 KHz. I could put in some hacks to reduce the popping, like this:
But it doesn’t feel right to me to do that, at least not without better understanding how actual hardware works. Some of those hacks directly contradict statements in Sega’s documentation.
That’s All
There were a few other issues I ran into but nothing that I thought deserved its own writeup. WWF Raw and Brutal Unleashed definitely get the sloppy programming awards, although BC Racers is no slouch in that department - the framerate is abysmal in addition to its audio issues.
While the 32X isn’t exactly beloved, it was kind of fun to emulate a system with a small enough library that you can feasibly test every single released game in a reasonable amount of time. The hardware also has way fewer (officially) undocumented edge cases than Sega CD does, which was nice. My 32X emulator’s performance is pretty poor (on my laptop it barely hits 2x speed while fast-forwarding), but other than that I’m pretty happy with how it turned out.