SNES Coprocessors: S-DD1

There are two SNES coprocessors that contain hardware data decompression chips: S-DD1 and SPC7110 (no relation to the SPC700 CPU as far as I know). These chips allow the SNES CPU to read compressed data from ROM without the CPU needing to do any decompression work - the hardware decompresses the data on-the-fly while the CPU is reading it. This post will cover S-DD1 but some of it also applies to SPC7110.

Star Ocean

S-DD1 was used by 2 games: Star Ocean and Street Fighter Alpha 2. Presumably the “DD” stands for Data Decompressor.

Emulating Star Ocean

Star Ocean is one of the most technically impressive RPGs on the SNES, but it never released outside of Japan (at least not until it got a PSP remake many years later). This made it a prime target for an English fan translation.

One big problem was that early SNES emulators couldn’t run Star Ocean because they didn’t support the S-DD1 chip. Or rather, they couldn’t support S-DD1 because the decompression algorithm was not public knowledge and no one had reverse engineered it.

The main author of ZSNES and some others found a solution to this: extract the decompressed graphics from Star Ocean’s cartridge ROM into a separate file called a “graphics pack”, and modify emulators to support loading decompressed data from the graphics pack file whenever the S-DD1 chip would have performed decompression. This removed the need for emulators to support the decompression algorithm.

It’s unclear to me exactly how they were able to extract the graphics in the first place - maybe they were able to do something clever with the cartridge hardware that didn’t require understanding the exact decompression algorithm. Regardless, this made Star Ocean playable in emulators as long as you had the decompressed graphics pack along with a copy of the cartridge ROM. (Although as an aside, Star Ocean’s combat infamously ran way too fast in ZSNES compared to actual hardware…)

Some years later, people were able to fully reverse engineer the S-DD1’s decompression algorithm, and with that emulator developers were able to implement S-DD1 decompression directly in emulators. This made graphics packs no longer necessary.

As an interesting bit of trivia, Star Ocean’s raw cartridge ROM is 6MB while a fully decompressed version is 12MB, twice the size. That gives an idea of how effective the compression was.

The Algorithm

I’m going to discuss S-DD1’s decompression algorithm here but SPC7110’s algorithm is very similar. It’s unfortunately different enough that the implementations must be different, but the ideas behind the algorithms are very similar.

First of all, there is an SFC Development Wiki page on the S-DD1 that has a very in-depth explanation of the algorithm. Additionally, the nocash fullsnes documentation provides pseudocode that implements the algorithm. I would not have gotten anywhere with S-DD1 without both of those resources.

Golomb Coding

S-DD1 uses an adaptive compression algorithm that ultimately compresses a bitstream using a type of run-length encoding. “Adaptive” means that the algorithm dynamically alters its parameters as it processes the bitstream in order to try and achieve optimal compression for each chunk of data.

S-DD1 adapts to its data using a variant of Golomb coding. The coder keeps track of the current MPS (most probable symbol) and the LPS (least probable symbol), and it can encode long runs of the MPS extremely efficiently. In a bitstream, every symbol is either 0 or 1, so one of the two bits is always the MPS and the other bit is always the LPS.

The bitstream is encoded into what are called codewords. For a codeword size N, each codeword represents one of two types of runs:

  • A run of 2N MPSes
  • A run of between 0 and (2N - 1) MPSes followed by an LPS

A full 2N MPS run is encoded using only a single bit, a 0. A run ending in an LPS is encoded using 1+N bits: a 1 bit to indicate that this run ends with the LPS, followed by N bits to specify the number of MPSes before the LPS (between 0 and 2N - 1). The SFC wiki page refers to these as 0-codewords and 1N-codewords.

The codeword size N is adjusted dynamically based on the data. As runs of 2N MPSes are encountered, the Golomb coder will gradually increase the codeword size, up to a maximum of 7 for S-DD1 (MPS runs of length 128). As runs ending in an LPS are encountered, the Golomb coder will gradually decrease the codeword size, potentially all the way down to 0 which results in exactly one “compressed” bit per input bit. When multiple consecutive LPSes are encountered while the codeword size is 0, the coder will swap the MPS and LPS to see if it can start to increase the codeword size by treating the other bit as the MPS.

One note is that the algorithm starts out more adaptive immediately after initialization. It initializes the MPS to the first bit in the bitstream and then begins to rapidly increase the codeword size as it encounters consecutive 2N MPS runs. It stops when it encounters the first LPS, after which it transitions to its normal increase/decrease rate for codeword size.

To get concrete, S-DD1’s Golomb coder effectively has tables for 33 different states. State 0 is the initial state (and not used after initialization), states 1-24 are the normal states, and states 25-32 are the rapid increase states that it uses during initialization. Within 1-24 and 25-32, higher states have higher codeword sizes.

This is the codeword size for each state, from 0 to 7:

1
2
3
4
5
6
7
8
const CODEWORD_SIZE: &[u8; 33] = &[
    // Initial state (0)
    0,
    // Normal states (1-24)
    0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7,
    // Rapid increase states (25-32)
    0, 1, 2, 3, 4, 5, 6, 7
];

These are the “next state” tables based on whether the current codeword is a full 2N MPS run (0-codeword) or a run that ends with the LPS (1N-codeword):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
const NEXT_STATE_MPS: &[u8; 33] = &[
    // Initial state (0)
    25,
    // Normal states (1-24)
    2, 3, 4, 5, 6, 7, 8, 9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,24,
    // Rapid increase states (25-32)
    26,27,28,29,30,31,32,24
];

const NEXT_STATE_LPS: &[u8; 33] = &[
    // Initial state (0)
    25,
    // Normal states (1-24)
    1, 1, 2, 3, 4, 5, 6, 7, 8, 9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,
    // Rapid increase states (25-32)
    1, 2, 4, 8,12,16,18,22
];

The MPS and LPS values are swapped if the coder encounters an LPS while in state 0 or 1. Once the coder enters the 1-24 range, a full MPS run causes the state to increase by 1 (up to a max of 24) while an LPS causes the state to decrease by 1 (down to a min of 1).

This coding algorithm is good at compressing bitstreams that contain extremely long runs of the same bit, and rather poor for bitstreams that don’t. Which is why S-DD1 combines Golomb coding with a context model.

Context Model

Rather than encoding the data as a completely linear stream of bits, S-DD1 uses what’s called a context model. At a very high level, a context model specializes the encoding algorithm for a particular type of data by tracking some coder state per-context rather than globally. The model defines how to determine the context for each input bit. In S-DD1’s case, it uses a context model specialized for 8x8 tile data in the SNES bitplane format.

The core idea behind S-DD1’s context model is that you can frequently predict the color of a pixel based on the colors of adjacent pixels. And that’s essentially what S-DD1 uses as its context: in the current bitplane, what were the bit values of specific adjacent pixels?

S-DD1’s context model is configurable - each block of data specifies its context configuration in the first 4 bits of the block. The first 2 bits specify whether the data consists of 2bpp tiles, 4bpp tiles, 8bpp tiles, or “other” (e.g. Mode 7 graphical data). “Other” is functionally 1bpp as far as the algorithm is concerned.

The next 2 bits specify which adjacent pixels to look at when computing the context, which can be some combination of the following:

  • 1 previous (pixel directly left)
  • 2 previous (pixel two to the left)
  • 7 previous (pixel above and right)
  • 8 previous (pixel directly above)
  • 9 previous (pixel above and left)

The specific 4 options are:

  • 1, 7, 8, 9 (left / above / above-left / above-right)
  • 1, 8, 9 (left / above / above-left)
  • 1, 7, 8 (left / above / above-right)
  • 1, 2, 8, 9 (left / two left / above / above-left)

The context is formed from those 3/4 previous bit values, plus one extra bit for whether the current bitplane is even or odd. That makes 16 contexts for the 3-pixel configurations and 32 contexts for 4-pixel.

The even/odd tracking is because of how SNES tiles are broken into bitplanes. 8x8 tiles can be either 16 bytes (2bpp), 32 bytes (4bpp), or 64 bytes (8bpp). Each 16-byte chunk stores 2 bitplanes, with the bytes within a chunk alternating between a row of pixels in one bitplane and a row of pixels in the other bitplane. The context uses the extra bit to ensure that these alternating bitplanes use separate contexts instead of stomping on each other and likely causing much worse compression.

Decoding

Now, how does this actually work?

This is where things get pretty unintuitive. The rest of this probably makes more sense from a compression perspective than a decompression perspective. Some of the details might also be due to specifics of how the hardware implementation was designed.

It’s probably easiest to start by listing what the S-DD1 chip actually tracks while it’s decompressing:

  • For each context (0-31): The current MPS value (0 or 1) and the current state value (0-32)
  • For each codeword size (0-7): The number of bits remaining in the current MPS run, and whether the decoder should emit an LPS before reading the next codeword
    • To be very explicit, there are 8 decoders, 1 for each codeword size
  • The previous 9 bits in each of the up-to-8 bitplanes (needed to compute context)

Rather than maintaining a completely separate Golomb decoder state for each context, the chip maintains a separate decoder state for each codeword size. The only things it tracks per-context are the current MPS value and the current state (which determines codeword size). This means that different contexts can and will use bits from the same decoder state if they happen to have the same codeword size at any given moment.

Note that because the same decoder state will likely be accessed by multiple different contexts throughout each run, a single run is not necessarily a stream of constant 0s or constant 1s! The decoder simply emits “MPS” or “LPS”, and the consuming context uses its current MPS value to determine whether that is a 0 or a 1.

State transitions occur when one of the contexts consumes the final bit of a run from one of the 8 decoders, and only the context that consumed the final bit will change its state. The state transition is performed solely based on whether that final bit was MPS (final bit of 0-codeword) or LPS (final bit of 1N-codeword).

To be honest, I don’t really understand why this works well. Every attempt I’ve made at explaining it to myself has ended with “but wait, no…” Maybe it has something to do with how the algorithm handles a bitplane transitioning from a run of 0s to a run of 1s (or vice versa), or maybe it has to do with how the bitplanes don’t each get their own contexts in 4bpp/8bpp mode (Star Ocean mainly uses 4bpp), or maybe it has to do with ease of hardware implementation.

Anyway, the last piece is: how do the 8 Golomb decoders receive their input codewords? That’s actually pretty simple: whenever a context asks one of the decoders for another bit, if the decoder doesn’t have any more bits to emit from the last codeword, it consumes the next codeword from ROM and advances the current ROM pointer. If the next bit in ROM is a 0, the decoder prepares to emit a run of 2N MPSes and advances the ROM pointer by 1 bit. If the next bit in ROM is a 1, the decoder additionally consumes the next N bits to determine the MPS run length, and it prepares to emit a run of that many MPSes followed by an LPS. It then advances the ROM pointer by 1+N bits.

Partial Pseudocode

Here is my attempt at some Python-like pseudocode for reading a single decompressed bit, omitting the bitplane and context calculations:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
# Constants:
#   CODEWORD_SIZES: [u8; 33]
#   NEXT_STATE_MPS: [u8; 33]
#   NEXT_STATE_LPS: [u8; 33]
#
# Fields:
#   context_states: [u8; 32]
#   context_mps: [u8; 32]
#   decoders: [Decoder; 8]

def next_bit(context):
    state = context_states[context]
    codeword_size = CODEWORD_SIZES[state]

    if decoders[codeword_size].bits_remaining == 0:
        # Need more bits, read next codeword from ROM and advance ROM pointer
        decoders[codeword_size].read_next_codeword()

    # Assuming 0 means MPS and 1 means LPS
    decoder_bit = decoders[codeword_size].next_bit()
    result_bit = decoder_bit ^ context_mps[context]

    if decoders[codeword_size].bits_remaining == 0:
        # Consumed last bit, perform state transition (but do not read next codeword from ROM yet)
        if decoder_bit == 0:
            # Perform MPS state transition
            context_states[context] = NEXT_STATE_MPS[state]
        else:
            # Perform LPS state transition
            context_states[context] = NEXT_STATE_LPS[state]

            if state == 0 or state == 1:
                # Encountered LPS while state was 0/1; swap MPS and LPS
                context_mps[context] ^= 1
    else:
        # If did not consume final bit, decoder bit is guaranteed to be MPS
        assert decoder_bit == 0

    return result_bit

And the decoder piece of it, assuming there’s some magic function to consume bits from ROM and advance the pointer:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
def consume_rom_bits(num_bits):
    # Definition omitted
    # Note that num_bits can be 0, in which case this function should return 0 without consuming any bits
    pass

class Decoder:
    def __init__(self, codeword_size):
        self.codeword_size = codeword_size
        self.mps_remaining = 0
        self.bits_remaining = 0

    def read_next_codeword(self):
        first_bit = consume_rom_bits(1)
        if first_bit == 0:
            # 0-codeword; emit run of 2^N MPSes
            self.mps_remaining = 2 ** self.codeword_size
            self.bits_remaining = self.mps_remaining
            return

        # 1N-codeword; run of [0, 2^N - 1] MPSes followed by an LPS
        length_bits = consume_rom_bits(self.codeword_size)

        # The bits-to-length mapping is not as straightforward as treating the bits
        # as a binary number, so use a lookup table
        self.mps_remaining = RUN_LENGTHS[self.codeword_size][length_bits]
        self.bits_remaining = 1 + self.mps_remaining

    def next_bit(self):
        self.bits_remaining -= 1

        if self.mps_remaining != 0:
            # Emit MPS
            self.mps_remaining -= 1
            return 0
        
        # Emit LPS, which should always be the last bit from the previous codeword
        assert self.bits_remaining == 0
        return 1

One note on the run length table is that you can actually use a single 128-entry table as long as you fill the low bits with 1s instead of 0s. For example, if the codeword size is 2, you can use the same lookup table as with a codeword size of 7 by using the 2 length bits as the highest 2 bits and setting the lowest 5 bits to all 1s:

1
let run_table_idx = (length_bits << 5) | 0x1F;

More generally:

1
let run_table_idx = (length_bits << (7 - codeword_size)) | (0x7F >> codeword_size);

SNES Integration

Phew, now to how this thing fits into the SNES.

S-DD1 has an MMC extremely similar to SA-1, which I described here. Unlike SA-1, S-DD1 games actually use the MMC! Well, one of them does, anyway - Star Ocean has a 6MB ROM, so it needs to use the MMC to manage which 4 of its 6 1MB ROM banks are mapped at any given time.

The S-DD1 MMC probably supports both LoROM and HiROM banking just like the SA-1 MMC, but games don’t depend on LoROM banking. Star Ocean only uses HiROM mappings except for the 65816 interrupt vectors in bank $00 (which it never tries to remap), and Street Fighter Alpha 2 only has 4MB of ROM so it doesn’t bother to use the MMC at all. Regardless, the following will work for both games:

  • Map $8000-$FFFF in banks $00-$3F and $80-$BF to the first 4MB of ROM with LoROM address mapping
    • Banks $00-$3F to the first 2MB of ROM and banks $80-$BF to the second 2MB
  • Treat banks $C0-$FF as 4 mappable 1MB HiROM banks, just like SA-1

Street Fighter Alpha 2 has no SRAM but Star Ocean does. It seems to expect its 8KB of SRAM to be mapped to $6000-$7FFF in bank $70, but it’s probably mapped to other locations as well.

S-DD1’s very few internal registers are mapped to $4800-$4807 in the I/O area banks. 4 of them are the MMC bank registers, 2 are for setting up decompression, and the other 2 have unknown functionality - Star Ocean writes to them but it’s unknown what they’re supposed to do.

The interesting part is how the data decompression works from the SNES perspective. First, the SNES CPU writes to S-DD1 registers to tell it that a specific DMA channel is about to read some compressed data. Then, the S-DD1 chip automagically notices when SNES DMA begins on that channel, it converts that channel’s DMA source address to a ROM address using the S-DD1 MMC, and then it initializes the decompressor to start reading from that ROM address. As the SNES DMA controller reads bytes from the cartridge, the S-DD1 chip returns decompressed bytes from its decompressor chip instead of raw ROM bytes, and it continues to do so until it notices that the SNES DMA has ended.

The decompressor obviously operates in terms of bits rather than bytes. Whenever SNES DMA requests another byte, the decompressor simply returns the next 8 decompressed bits, stored in little-endian order within the byte. In an emulator implementation, you probably want to decompress bits in batches and store them in a buffer, to simplify the code if nothing else.

I’m not sure exactly how S-DD1 observes when a DMA starts and ends - maybe it’s listening in on writes to the DMA register addresses? Or maybe there’s some signal that indicates that reads are coming from the DMA controller rather than directly from the 65816?

Next Time

I mentioned SPC7110 early in this post as the other data decompression coprocessor. This post was originally going to cover both S-DD1 and SPC7110, but I realized pretty quickly that was going to be too much because SPC7110 definitely has some interesting/weird things to discuss beyond just how the decompression works. So next time, SPC7110!

updatedupdated2024-02-192024-02-19