SNES & PlayStation Cubic ADPCM Interpolation

It’s been a minute, and I haven’t done a whole lot of emulation work recently, but this is a topic I’ve wanted to do a short post about for a while: a really simple audio enhancement for SNES and PlayStation emulation that works surprisingly well with some games. It’s more noticeable for SNES, but there are some PlayStation games that benefit from this as well!

ADPCM Interpolation?

I’m not going to go into detail on how ADPCM works right here, but I attempted to describe it in a previous post on the PlayStation SPU. The SNES APU works extremely similarly - its ADPCM encoding format is a bit different, but for the purposes of this post that’s (mostly) not important.

One thing that both chips have in common is that “voices” (i.e. audio channels) can play audio samples at a configurable sample rate. The audio chip itself always outputs audio at a fixed sample rate (about 32040 Hz for SNES and 44100 Hz for PS1), and the raw ADPCM samples are encoded at that same sample rate, but the voices don’t necessarily have to play at that sample rate - and they typically don’t. This creates a problem: how should the chip resample a 32040/44100 Hz stream of decoded ADPCM samples to a different sample rate, while still outputting samples at 32040/44100 Hz?

I went into a bit of detail on how the PS1 SPU does this here, and the SNES APU does almost the exact same thing with a different coefficient table. The short version is that when a voice needs to output a 32040/44100 Hz sample, it performs 4-point Gaussian interpolation between the nearest 4 decoded ADPCM samples, choosing the interpolation coefficients based on the distance from the nearest 2 samples.

This works well enough, but when a voice is playing at a very low sample rate, the Gaussian interpolation process can cause the audio to sound very muffled and blurry (as much as the word “blurry” can apply to audio). SNES games in particular often used low sample rate audio to save space - SNES cartridges could only fit a max of 4MB of ROM without using a coprocessor (which few games did), plus the SNES APU only has a measly 64KB of audio RAM shared between the SPC700 (embedded audio CPU) and the actual ADPCM playback chip.

It’s possible to do better! Or different, at least.

A Different Algorithm

First, credit to the paper linked in this blog post for the actual algorithm I’m about to describe here: https://yehar.com/blog/?p=197

There’s a bit of discussion on this StackOverflow post which includes a link to that blog post.

I’ll also note that other SNES emulators have had this enhancement feature for a very long time, although I’m not sure if any PS1 emulators have it.

The paper describes a number of different algorithms for audio resampling, but I’m going to focus on one that I’ve found to work well: 4-point 3rd-order Hermite interpolation, i.e. cubic Hermite interpolation. The 6-point 5th-order version is a bit higher quality, but the 4-point version is much easier to drop into an existing SNES APU or PS1 SPU implementation since you already need 4 points to emulate 4-point Gaussian interpolation.

The paper describes both an “x-form” calculation and a “z-form” calculation. These produce the same result in different ways, but the x-form calculation involves fewer arithmetic operations, so I’m going to use that one.

Suppose you have a function to perform Gaussian interpolation, with an API that looks like this:

1
2
3
fn gaussian_interpolation(samples: [i16; 4], pitch_counter: u16) -> i16 {
    ...
}

pitch_counter is the thing that increments based on the voice’s configured sample rate. Both SNES and PS1 use bits 4-11 of the pitch counter to determine the interpolation coefficients, and they both advance to the next ADPCM sample whenever the highest 4 bits of the pitch counter are non-zero.

The interpolation algorithm has 5 inputs: y0, y1, y2, y3, and x. The 4 y values are the 4 samples to interpolate between, with y0 being the oldest and y3 being the newest. The x value is the current distance between samples on a scale from 0 to 1. In the context of SNES and PS1 emulation, the x value can be computed based on the lower bits of the pitch counter.

We can create a function with the same API as the Gaussian interpolation function with a different interpolation algorithm. Here’s mine, which is really just a Rustified version of what’s in the paper:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
const MIN: f64 = ...;
const MAX: f64 = ...;

fn hermite_interpolation(samples: [i16; 4], pitch_counter: u16) -> i16 {
    // Convert samples to floating-point
    let [y0, y1, y2, y3] = samples.map(f64::from);

    // Actual hardware only uses bits 4-11 of the pitch counter for interpolation, but
    // we're already going beyond actual hardware, so let's use all 12 of the lowest bits
    let x = f64::from(pitch_counter & 0xFFF) / 4096.0;

    // The actual cubic Hermite algorithm, slightly modified to avoid needing to do any divisions
    let c0 = y1;
    let c1 = 0.5 * (y2 - y0);
    let c2 = y0 - 2.5 * y1 + 2.0 * y2 - 0.5 * y3;
    let c3 = 0.5 * (y3 - y0) + 1.5 * (y1 - y2);
    let result = ((c3 * x + c2) * x + c1) * x + c0;

    // Round to the nearest integer and clamp to a valid range
    result.round().clamp(MIN, MAX) as i16
}

That’s pretty much it! MIN and MAX are different between SNES and PS1 - SNES should clamp to signed 15-bit while PS1 should clamp to signed 16-bit, to match the range of values that actual hardware can produce.

This isn’t the most efficient thing, but it’s not that big of a deal for something that will only be called a few thousand times per frame. If you wanted to get fancy you could use AVX instructions to do the interpolation for multiple voices simultaneously, but that’s really not necessary.

Once you have these two interpolation functions with the same API, it shouldn’t be too difficult to modify your emulator to support switching between the two algorithms.

Comparisons - SNES

Let’s hear how it sounds!

Here’s a snippet from Stickerbrush Symphony in Donkey Kong Country 2, recorded in my emulator while emulating actual hardware’s Gaussian interpolation:

Here’s the same snippet recorded with cubic Hermite interpolation:

Such a small change, but (at least to my ears) it makes a huge difference! That sounds significantly sharper and clearer!

All three of the Donkey Kong Country games use pretty low sample rate audio, so the difference is more noticeable here than it will be in some other games. I find the difference barely noticeable in the SNES Mega Man X and Final Fantasy games, for instance.

For another example, here’s the voice sample from the beginning of Super Metroid’s intro, first with Gaussian:

And then with Hermite:

I find the enunciations significantly clearer in the Hermite version. Super Metroid’s music also seems to benefit pretty well from cubic Hermite interpolation.

For one final example that’s more “huh” than anything else, here’s the game start sound effect from Contra 3’s title screen, first with Gaussian:

Hermite:

I don’t know about better, but that certainly sounds different! And no, the audio crackle is not a recording issue - there is a noticeable crackle there when using cubic Hermite interpolation.

Comparisons - PS1

While PS1 games weren’t subject to the same size constraints as SNES games, there are still some games that benefit from fancier ADPCM interpolation. As with the SNES comparisons, all of these examples were recorded in my emulator.

One example is actually the BIOS startup sound, which sort of makes sense given that the audio samples had to fit in the 512KB BIOS ROM along with everything else in the BIOS. Here’s the Gaussian version:

Hermite version:

It’s not quite as noticeable as some of the SNES examples, but I personally find the Hermite version noticeably clearer.

For an actual game example, here’s the first part of the title screen music from Mega Man 8, first with Gaussian interpolation:

And with Hermite interpolation:

I think the last part in particular (0:21 on) sounds noticeably better with Hermite interpolation.

Here’s one of Valkyrie Profile’s voice samples recorded with Gaussian interpolation:

Hermite:

The voice samples don’t all sound that much clearer, but the battle grunts in particular do.

Finally, here’s a snippet from one of Valkyrie Profile’s dungeon songs, Gaussian first:

And Hermite:

The difference isn’t as noticeable as some of the other examples (partly because of how heavily Valkyrie Profile uses reverb), but I do think the Hermite version sounds clearer here as well.

It’s worth noting that PS1 games have three different ways of playing audio: through an SPU voice (as all of the above examples do), through a CD digital audio track, and through CD-XA ADPCM sectors (encoded at either 18900 Hz or 37800 Hz).

There’s no real enhancement possible for CD-DA tracks because they’re already 16-bit samples at 44100 Hz, but it might be possible to slightly improve the audio quality in games that use CD-XA ADPCM sectors by using a more intelligent resampling algorithm than actual hardware does (it’s a form of zig-zag interpolation rather than Gaussian). I’d be surprised if a very significant improvement is possible but I haven’t yet tried it myself. Perhaps a topic for a short follow-up post.

updatedupdated2024-11-082024-11-08