PlayStation: The SPU, Part 1 - ADPCM

The PlayStation’s SPU (Sound Processing Unit) is definitely not the most logical next thing to work on after implementing a basic triangle rasterizer, but it’s what I feel like discussing next. With my own projects, I don’t feel like an emulator is really online until it supports audio, so here we go!

Heritage

Having worked on both SNES and PS1 emulators, I cannot discuss the PS1 SPU without mentioning its heritage.

In short, the PS1 SPU is pretty much a V2 of the SNES APU (Audio Processing Unit). It might be hard to imagine nowadays, but Nintendo collaborated with Sony to develop the SNES audio hardware. The SNES APU was designed by Ken Kutaragi, who was also working on a CD-ROM add-on for the SNES before Nintendo canceled the project (likely due to seeing the Sega CD’s poor sales). After Nintendo shafted Sony on the SNES CD, Sony decided to make their own gaming console called the PlayStation, with Kutaragi as the lead hardware designer. Given this history, it’s no surprise that the SNES APU and PS1 SPU are extremely similar chips - they were designed by the same person!

The SNES APU and the PS1 SPU are both ADPCM sound chips that work very similarly, although the PS1 SPU is more advanced in almost every aspect. A light comparison of features:

	SNES APU	PS1 SPU
Sound RAM	64 KB	512 KB
Channels/Voices	8	24
Output Sample Rate	32000 Hz	44100 Hz
Embedded CPU	SPC700 @ 1.024 MHz	None
ADPCM Format	16 samples per 9-byte block	28 samples per 16-byte block
Volume Bit Width	7-bit	15-bit
Panning Volume Envelopes	❌	✅

There are also a few features that don’t lend themselves well to a comparison table. For example, the PS1 SPU’s reverb feature is significantly more flexible than the SNES APU’s echo filter.

One important difference to note is that the PS1 SPU does not have an embedded audio CPU like the SPC700 in the SNES APU. This actually works out nicely for the PS1 - games don’t need to store an audio driver in sound RAM like they did on the SNES, games don’t need to deal with the complexity of communications between the main CPU and the audio CPU, and the PS1 CPU is fast enough that it’s easily capable of driving audio itself.

ADPCM

This post will assume some familiarity with basic concepts of digital audio. I covered some of that in the first section of my post on the Game Boy APU.

Unlike sound chips in earlier consoles (e.g. Game Boy, NES, Genesis), the PS1 SPU is a sample playback chip rather than a wave generator chip. This means that instead of games programming the sound chip at runtime to generate a particular sound, games provide sound wave samples to the sound chip and the sound chip simply plays them (with runtime options for pitch, volume, reverb, etc.).

ADPCM, or adaptive differential pulse-code modulation, is a sound format that applies lossy compression to a stream of PCM samples. Samples are encoded as differences relative to previous samples (the differential part), and the differences can be quantized to a varying number of bits (the adaptive part). What does this actually mean?

The PS1 format specifically compresses signed 16-bit PCM samples into signed 4-bit values. Each block of 28 samples is compressed into 16 bytes: 2 header bytes followed by 14 bytes of compressed 4-bit samples. Compared to the 56 bytes required to store the raw 16-bit samples, that’s significant savings! This is the main benefit of ADPCM - the audio data takes significantly less space than storing raw CD-quality audio samples, although there is a hit to quality due to the compression being lossy.

The first header byte in each block contains two important values:

Shift: Specifies the bit shift to apply to each signed 4-bit sample in the block
Filter: Specifies a formula to generate decoded 16-bit PCM sample values

The second header byte contains loop flags, which are important for playback but not for decoding.

psx-spx describes the ADPCM format in detail in a section on its CD-ROM page. I think pseudocode or actual code is the easiest way to understand the format. psx-spx has a pseudocode implementation, and here’s my own example:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49


fn decode_adpcm_block(block: &[u8; 16], decoded: &mut [i16; 28], old_sample: &mut i16, older_sample: &mut i16) {
    // First byte is a header byte specifying the shift value (bits 0-3) and the filter value (bits 4-6).
    // A shift value of 13-15 is invalid and behaves the same as shift=9
    let shift = block[0] & 0x0F;
    let shift = if shift > 12 { 9 } else { shift };

    // Filter values can only range from 0 to 4
    let filter = cmp::min(4, ([block[0] >> 4) & 0x07);

    // The second byte is another header byte specifying loop flags; ignore that for now

    // The remaining 14 bytes are encoded sample values
    for sample_idx in 0..28 {
        // Read the raw 4-bit sample value from the block.
        // Samples are stored little-endian within a byte
        let sample_byte = block[2 + sample_idx / 2];
        let sample_nibble = (sample_byte >> (4 * (sample_idx % 2))) & 0x0F;

        // Sign extend from 4 bits to 32 bits
        let raw_sample: i32 = (((sample_nibble as i8) << 4) >> 4).into();

        // Apply the shift; a shift value of N is decoded by shifting left (12 - N)
        let shifted_sample = raw_sample << (12 - shift);

        // Apply the filter formula.
        // In real code you can do this with tables instead of a match
        let old = i32::from(*old_sample);
        let older = i32::from(*older_sample);
        let filtered_sample = match filter {
            // 0: No filtering
            0 => shifted_sample,
            // 1: Filter using previous sample
            1 => shifted_sample + (60 * old + 32) / 64,
            // 2-4: Filter using previous 2 samples
            2 => shifted_sample + (115 * old - 52 * older + 32) / 64,
            3 => shifted_sample + (98 * old - 55 * older + 32) / 64,
            4 => shifted_sample + (122 * old - 60 * older + 32) / 64,
            _ => unreachable!("filter was clamped to [0, 4]")
        };

        // Finally, clamp to signed 16-bit
        let clamped_sample = filtered_sample.clamp(-0x8000, 0x7FFF) as i16;
        decoded[sample_idx] = clamped_sample;

        // Update sliding window for filter
        *older_sample = *old_sample;
        *old_sample = clamped_sample;
    }
}

Don’t ask me how to write an encoder for this format, but I think a decoder implementation makes it easier to see how the format manages to compress 16-bit samples into 4-bit samples without sounding absolutely terrible.

The filter applies the D in ADPCM. Unless filter 0 is used, the 4-bit samples are decoded by combining them with previous decoded sample values, so the encoded “sample” values are more like differences than raw samples (although they’re not exact differences due to the filter formulas). This takes advantage of the fact that raw sample values are often very close to the surrounding sample values, so storing differences takes way fewer bits than storing all of the raw samples.

The shift applies the A in ADPCM. 4-bit only supports storing difference values between -8 and +7 which is a tiny range, but the shift makes it possible to encode larger values at the cost of losing the ability to finely specify smaller differences (within the block). For example, with a shift of 6 the difference values can range from -512 to +448, but each 4-bit value can only represent 1 of 16 specific values within that range:

1
2


>>> [i << (12 - 6) for i in range(-8, 8)]
[-512, -448, -384, -320, -256, -192, -128, -64, 0, 64, 128, 192, 256, 320, 384, 448]

The decoding process ultimately produces a series of signed 16-bit PCM samples that are assumed to be at 44100 Hz. The SPU can play at different sample rates, but it accomplishes this by simply playing the decoded samples faster or slower. The sample rate only affects how frequently sample blocks are decoded, not the details of the decoding algorithm. The SPU also performs some interpolation between samples which I’ll discuss towards the end.

Here is an example block from Crash Bandicoot 2’s intro. First, the 16 raw bytes:

48 00 D2 4D EF F0 E3 3C 1F ED F4 2F 2E EF E3 13

From the first byte, we can tell that the shift is 8 and the filter is 4. The second byte is 0 which means no loop flags are set. The remaining 14 bytes contain the signed 4-bit values, which are:

1

[2, -3, -3, 4, -1, -2, 0, -1, 3, -2, -4, 3, -1, 1, -3, -2, 4, -1, -1, 2, -2, 2, -1, -2, 3, -2, 3, 1]

The previous two decoded samples in this voice happened to be 392 (“old”) and 465 (“older”), from which it’s possible to decode the 28 samples. Given a shift of 8 and a filter of 4, the first decoded sample should be:

1
2


>>> floor((2 << (12 - 8)) + (122 * 392 - 60 * 465 + 32) / 64)
343

Then the second decoded sample should be:

1
2


>>> floor((-3 << (12 - 8)) + (122 * 343 - 60 * 392 + 32) / 64)
238

For reference, the full list of decoded samples should be this (split into two lines only for readability):

+343 +238 +84  +2   -90  -204 -304 -403 -434 -481 -573 -592 -606 -583
-590 -609 -543 -479 -419 -317 -242 -131 -38  +18  +118 +176 +273 +371

Memory Map

The SPU has a lot of memory-mapped registers, more than any other hardware device in the PS1. I’m not going to list them all individually here (they’re documented on the psx-spx SPU page), but they fall into a few ranges:

$1F801C00-$1F801D7F: Voice registers
- $1F801C00-$1F801C0F are the voice 0 registers, $1F801C10-$1F801C1F are the voice 1 registers, etc.
$1F801D80-$1F801DBF: Control registers
- This range also includes key on/off registers and a few other voice control registers
$1F801DC0-$1F801DFF: Reverb registers
$1F801E00-$1F801E80: Internal registers

I’ll mention some of the more important registers where they’re relevant.

One thing to note is that psx-spx describes a number of SPU registers as being 32-bit, but games will typically write to them using two 16-bit writes instead of a single 32-bit write. This is because actual PS1 hardware doesn’t stably support 32-bit writes to SPU registers. 32-bit reads (and 8-bit reads) do seem to work properly.

The 512KB of sound RAM doesn’t have much of a memory map. The first 4KB is reserved by the hardware for capture buffers, and the reverb buffer must always be located at the end of sound RAM (if reverb is enabled), but aside from that sound RAM is 508KB of space for games to allocate however they wish.

Data Transfer

ADPCM samples must be stored in sound RAM for the SPU to be able to decode and play them. Sound RAM isn’t mapped directly into the CPU’s memory map - the CPU can only access sound RAM through an SPU data port.

First, relevant registers:

$1F801DA6: Sound RAM data port start address
$1F801DA8: Sound RAM data port (16-bit)

There is also some sort of data transfer control register at $1F801DAC, but I don’t know that anything depends on emulating the weird behavior that psx-spx describes for that register. I’ve only ever seen games write $0004 to it which is the standard value.

Most games transfer ADPCM samples into sound RAM using SPU DMA (DMA4), but some games (and the BIOS) will write sound data directly to $1F801DA8. Data port writes technically send the halfword into a FIFO instead of immediately applying the write to sound RAM, but I don’t think anything depends on emulating the data transfer FIFO. Some games might indirectly depend on it via SPU DMA timing.

Anyway, there’s not much to emulating the data port. Games will write to $1F801DA6 to set the start address and then they’ll either kick off an SPU DMA or start manually writing halfwords to $1F801DA8. Each 16-bit data port write automatically increments the current data port address by 2. Note that SPU DMA operates in words instead of halfwords - each SPU DMA word transfer writes two halfwords to the data port.

Games don’t commonly read data out of sound RAM, but it is possible to do so using SPU DMA with the DMA direction set to device-to-RAM instead of RAM-to-device. Manually reading from the data port apparently does not work on actual hardware, so SPU DMA is the only way to read from sound RAM.

Playback

Now, how does the SPU play back decoded ADPCM samples?

Registers

For now I’m going to ignore all of the volume and envelope registers because volume is applied after decoding and interpolation. You can think of the envelope functionality as a self-contained subsystem that generates different volume multipliers, independent of the sample values being decoded.

The most relevant registers for playback are in the voice register range, with a separate set of registers for each of the 24 voices:

$1F801C04 + N*$10: ADPCM sample rate
$1F801C06 + N*$10: ADPCM start address
$1F801C0E + N*$10: ADPCM repeat address

Each voice also has an internal ADPCM current address register that is not memory mapped. Note that all address registers are in 8-byte units, so you’ll need to left shift by 3 to convert the register value to a sound RAM address.

The key on register is also relevant to playback:

$1F801D88: Key on voices 0-15 (0=no change, 1=key on)
$1F801D8A: Key on voices 16-23

If we’re ignoring volume, keying off is not relevant for playback because it only affects the envelopes, but now’s as good a time as any to mention those:

$1F801D8C: Key off voices 0-15 (0=no change, 1=key off)
$1F801D8E: Key off voices 16-23

Setup

Prior to starting playback on a voice, software will set the ADPCM sample rate and ADPCM start address in the relevant voice registers. It may also set the ADPCM repeat address, but a lot of software relies on loop flags instead of explicitly setting the repeat address in the register.

Once everything is set up (including the envelope parameters), software will key on the voice by setting the corresponding bit in $1F801D88 (voices 0-15) or $1F801D8A (voices 16-23). Keying on resets both the envelope state and the ADPCM decoding state. Resetting ADPCM decoding state involves doing the following:

Set the ADPCM current address to the ADPCM start address
Decode the first 16-byte / 28-sample block at the new current address
Reset the internal pitch counter to 0

The pitch counter is used to manage playback speed and interpolation, covered in more detail a bit later.

Example code:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29


struct Voice {
    start_address: u32,
    current_address: u32,
    pitch_counter: u16,
    decode_buffer: [i16; 28],
    envelope: AdsrEnvelope,
    ...
}

impl Voice {
    fn key_on(&mut self, sound_ram: &[u8]) {
        self.envelope.key_on();

        // Reset ADPCM decoding state and decode the first block
        self.current_address = self.start_address;
        self.pitch_counter = 0;
        self.decode_next_block(sound_ram);
    }

    fn decode_next_block(&mut self, sound_ram: &[u8]) {
        // Grab the next 16-byte block
        let block = &sound_ram[self.current_address as usize..(self.current_address + 16) as usize];

        // Decode the 28 samples
        decode_adpcm_block(block, &mut self.decode_buffer, ...);

        ...
    }
}

Looping

The second header byte in each 16-byte block contains three loop flags that are used during playback:

Loop end (Bit 0)
Loop repeat (Bit 1)
Loop start (Bit 2)

As their names imply, these are used to manage looping.

The loop start flag marks the beginning of a loop. When a voice decodes an ADPCM block with the loop start flag set, it updates the ADPCM repeat address register to point to the start of the 16-byte block that had the loop start flag set.

The loop end flag marks the end of a loop. When a voice decodes an ADPCM block with the loop end flag set, it copies the contents of the ADPCM repeat address register to the internal ADPCM current address register. In other words, it jumps to the start of the loop. Note that the voice will still play the block that had the loop end flag set before it starts playing samples from the start of the loop!

The loop repeat flag is only meaningful in blocks where the loop end flag is set. When a voice decodes a block where the loop end flag is set and the loop repeat flag is not set, the voice will immediately silence itself by setting the envelope volume to 0 and keying off the envelope. It will still jump back to the beginning of the loop and start “playing” samples, but nothing will be audible due to the voice having muted itself.

In code, this might look something like this (omitting lots of details):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39


struct Voice {
    start_address: u32,
    repeat_address: u32,
    current_address: u32,
    envelope: AdsrEnvelope,
    ...
}

impl Voice {
    fn decode_next_block(&mut self, sound_ram: &[u8]) {
        let block = &sound_ram[self.current_address as usize..(self.current_address + 16) as usize];

        decode_adpcm_block(block, ...);

        // Parse loop flags from the second header byte
        let loop_end = block[1] & 1 != 0;
        let loop_repeat = block[1] & (1 << 1) != 0;
        let loop_start = block[1] & (1 << 2) != 0;

        if loop_start {
            // Start of loop, update repeat address
            self.repeat_address = self.current_address;
        }

        if loop_end {
            // End of loop, jump to start of loop
            self.current_address = self.repeat_address;

            if !loop_repeat {
                // End of non-repeating loop, immediately mute the voice
                self.envelope.volume = 0;
                self.envelope.key_off();
            }
        } else {
            // Not end of loop, move to the next 16-byte block
            self.current_address += 16;
        }
    }
}

Pitch

Now for how to play the samples within a decoded 28-sample block.

This is where timing starts to come into play. The SPU clocks at a rate of 44100 Hz, which should be exactly 1/768 times the CPU clock rate of 33.8688 MHz. This makes SPU timing easy - you can simply clock it once per 768 CPU cycles. Actual hardware likely performs different operations at different times instead of doing everything all at once every 768 CPU cycles, but that isn’t really important for SPU emulation.

The ADPCM sample rate determines playback speed, which determines the sound’s pitch. A value of 0x1000 means 44100 Hz which is the SPU’s native sample rate. The maximum allowed sample rate is 0x4000 or 176400 Hz, 4x the native sample rate.

Note that the SPU always outputs samples at 44100 Hz - the ADPCM sample rate only affects the speed at which the voice moves through decoded samples. For example, with a sample rate of 176400 Hz (0x4000), the voice will move forward by 4 samples on every SPU clock. With a sample rate of 22050 Hz (0x0800), the voice will move forward by half a sample on every SPU clock.

On every SPU clock, the pitch counter is incremented by the sample rate, and then bits 12-15 of the pitch counter are used to advance through samples. Bits 4-11 of the pitch counter are used for interpolation between samples but we’ll ignore that for now.

Ignoring pitch modulation and sample interpolation (and volume), voice clocking looks something like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34


struct Voice {
    sample_rate: u16,
    pitch_counter: u16,
    decode_buffer: [i16; 28],
    current_buffer_idx: u8,
    current_sample: i16,
    ...
}

impl Voice {
    fn clock(&mut self, sound_ram: &[u8]) {
        // Increment pitch counter using the sample rate.
        // Effective sample rate cannot be larger than 0x4000 (176400 Hz)
        let pitch_counter_step = cmp::min(0x4000, self.sample_rate);
        // In a full implementation, pitch modulation would be applied right here
        self.pitch_counter += pitch_counter_step;

        // Step through samples while pitch counter bits 12-15 are non-zero
        while self.pitch_counter >= 0x1000 {
            self.pitch_counter -= 0x1000;
            self.current_buffer_idx += 1;

            // Check if end of block was reached
            if self.current_buffer_idx == 28 {
                self.current_buffer_idx = 0;
                self.decode_next_block(sound_ram);
            }
        }

        // Update current sample.
        // In a full implementation, this is where sample interpolation and voice volume would be applied
        self.current_sample = self.decode_buffer[self.current_buffer_idx as usize];
    }
}

Note that keying on should also reset the new current_buffer_idx field to 0 in addition to the pitch counter.

Now, if you’re willing to put in some terrible hacks, this should be enough to get some audio output! You can (badly) fake volume by applying, say, a 1/4 volume multiplier to each keyed-on voice and a 0 volume multiplier to each keyed-off voice. Something like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


let mut mixed_sample: i32 = 0;
for voice in &self.voices {
    if !voice.keyed_on {
        continue;
    }

    mixed_sample += i32::from(voice.current_sample / 4);
}

let output_sample = mixed_sample.clamp(-0x8000, 0x7FFF) as i16;
// Then output sample to a 44100 Hz audio stream of signed 16-bit integer samples

Wire that up to an audio API like SDL2 or cpal and you’ll get…well…sound, at least!

Clearly the volume envelopes and reverb are supposed to be doing a lot of work here! (Spoilers: It’s mostly the envelopes.)

Some other audio will sound closer to correct, even with no volume functionality implemented. Here’s a snippet from Mega Man 8’s opening stage with the exact same SPU implementation that produced the previous sound:

Gaussian Interpolation

I mentioned interpolation a few times but skipped over what the SPU actually does to interpolate between samples. Time to cover that!

When a voice’s sample rate is not an integer multiple of 0x1000 (44100 Hz), the pitch counter can end up partway between samples after an SPU clock. To make these different in-between points sound different, the SPU performs Gaussian interpolation between the current sample and the previous three samples, using different weights depending on exactly where the pitch counter is. Note that “previous three” here means the previous three decoded samples, not the previous three output samples.

Hardware performs the interpolation using a 512-entry lookup table of multipliers, full of signed 16-bit integers where each value N represents the multiplier N/0x8000. Bits 4-11 of the pitch counter determine which multipliers are used for which samples. psx-spx has the interpolation table here, as well as the basic interpolation formula:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16


// 512-entry lookup table
// These are signed 16-bit integers, but they're i32s in this example to avoid casts from i16 to i32
const GAUSSIAN_TABLE: &[i32; 512] = &[...];

// Use pitch counter bits 4-11 as the interpolation index
let interpolation_idx = ((pitch_counter >> 4) & 0xFF) as usize;

// Four most recent samples, sign extended to 32 bits
let samples = [oldest_sample, older_sample, old_sample, current_sample].map(i32::from);

// Perform interpolation; do math using signed 32-bit integers to avoid overflow in intermediate calculations
// The right shifts by 15 are because each multiplier N represents (N / 0x8000)
let mut interpolated = (GAUSSIAN_TABLE[0x0FF - interpolation_idx] * samples[0]) >> 15;
interpolated += (GAUSSIAN_TABLE[0x1FF - interpolation_idx] * samples[1]) >> 15;
interpolated += (GAUSSIAN_TABLE[0x100 + interpolation_idx] * samples[2]) >> 15;
interpolated += (GAUSSIAN_TABLE[interpolation_idx] * samples[3]) >> 15;

And that’s it! This should be performed after every SPU clock, after the pitch counter is incremented (and after stepping forward in the decoded sample buffer if necessary). The lookup table is constructed such that each set of four multipliers sums to less than 0x8000, so the final result is guaranteed to fit in a signed 16-bit integer.

Note that the interpolation requires storing at least the last three samples from the previous block after decoding a new block. You already need the last two samples for filtering, so this is just one more to store.

This is one area where an emulator may wish to implement enhancement options by offering different types of sample interpolation in addition to Gaussian interpolation. This isn’t something I’ve done myself, but you could implement cubic interpolation or something more advanced. It’s pretty subjective what type of interpolation sounds best, and some might find it hard to even notice the difference at all, but it’s something to consider.

To Be Continued

And that’s all for the ADPCM decoding! The next post will likely cover volume and envelopes, and maybe reverb.

PlayStation: The SPU, Part 1 - ADPCM

See Also: