(Why you shouldn’t try) Implementing WaveNet on a GPU

Rendering natural-sounding speech is a hot problem, with Amazon, Google and Baidu among others rolling out cloud-based solutions for services like voice assistants. Until recently, the state-of-the-art in converting text to speech was WaveNet, from Google DeepMind. (More recently they published a parallel implementation suitable for GPUs; this is what they use in production, but has a different set of trade-offs, eg more complicated training.)

Anyway, the point of this post isn’t to actually implement WaveNet, but to explore how something-WaveNet-like runs on a GPU. This came out of a question from my friends at Myrtle Software, who specialise in compiling code to run on FPGAs, typically to run in a datacentre.

The question in question is: is the GPU implementation in this paper by Baidu really representative of the best you can do on a GPU?

What WaveNet does

The original WaveNet algorithm synthesises audio at 16kHz one sample at a time. Each sample is the output of a deep neural network whose inputs include (among other things) the previous about-a-hundred samples. The upshot of this is there’s limited parallelism to exploit, since we need to compute sample n completely before all of the work for sample n+1 can kick off. (Though there are dynamic-programming-style optimisations available to avoid doing redundant work.)

Now, an important point about this set-up is quite how “deep” deep is: dozens and dozens of layers of activations. The proxy we’re going to use as a benchmark to approximate WaveNet is to take a 128-element vector, and apply serial matrix multiplication by eighty different 128×128 matrices in turn. With full 32-bit values this works out to ~5MB of weights we need to access to compute each sample. This is a major problem for generating an efficient implementation.

The Baidu paper uses an interesting strategy to combat this problem. They basically say “clearly it would be ridiculous to keep reloading matrix weights all the time. But look – a big GPU actually has a lot of very fast local storage close to the execution units: the register files!”

It’s true that there are a lot of registers in a big discrete GPU. A typical NVIDIA Streaming Multiprocessor (SM) has 65,536 registers, ie a 256kb register file. This means if you have 20 SMs you’ve got enough space to keep all of those matrix coefficients resident in fast local memory.

But you’ve probably spotted the problem here: although we’ve got enough memory overall, there’s no space for any redundancy, and each matrix can only live in one place: a single SM. Because of the serial dependency in the computation, this means that only one SM can be “active” at once. So before we start we’re limited to an upper bound of 5% compute efficiency.

Baidu results

And in fact it’s worse than that. The strategy feels a bit odd (“it’s not how GPUs are supposed to be programmed”) and perhaps unsurprisingly, it turns out that the compiler is really unhappy about the idea of keeping all those weights resident in registers. So things go from bad to worse, and the actual compute efficiency they achieve in the end is more like 0.1%. This is bad.

Can we do better?

TL;DR – yes, but…

The Baidu implementation runs at something like 0.2x real-time. A simple bandwidth calculation shows that even streaming through all the weights over and over again will be faster than that: real-time performance would be 5MB x 16kHZ = 80GB/s. (A top-end discrete GPU has bandwidth >300GB/s, and even the aging and frail 860m in my laptop has 80GB/s.)

So I wrote a CUDA kernel to do the proxy calculation. Here are some of the fiddly details:

A. We can’t efficiently launch ~10,000 kernels/sec so to minimise our overheads we have to run an uber-kernel. (That is, we launch a small number of thread groups – so they can all be in flight on the GPU at once – and implement blocking/waiting/synchronisation behaviour by hand.)

B. We split each matrix multiplication across the thread groups/SMs. Conceptually, each thread group will be responsible for a few rows of the matrix multiply.

C. We have to store our results to main memory and synchronise after every matrix multiply. Using an atomic add and spinning seems as good as anything. (This is a rare case where adding thread-predication and thread-group-synchronisation actually improves performance: just spin on the counter on a single thread/warp.)

D. Streaming the weights is slow and introduces latency, so it’s a win to unwind the loop. Instead of the straightforward:

{
    [load weights for iteration i]
    ...
    [do work for iteration i]
}

we instead do:

[load weights for iteration 0]

loop i
{
    [load weights for iteration i+1 into temp]
    ...
    [do work for iteration i]
    ...
    [weights = temp]
}

By overlapping the work for iteration i with the loading for iteration i+1 we hide a lot of the memory latency. (And this is enough: unwinding the loop two iterations is slower again.)

E. Not maximising occupancy. My 860m is wide enough to have enough threads so each one has 4 multiply-adds to do. But it’s actually faster to do 8 multiply-adds per thread and have half as many thread groups to synchronise. (It doesn’t work to half the number of threads again though.)

F. And a bunch of minor things. For instance, it’s better to spread the intermediate vector results across several cache lines, so there’s no contention between different SMs trying to write the same line. But the difference this makes is pretty much in the noise.

Results

After optimisation, this artificial benchmark ends up running at about 1.5% compute efficiency (at least on the 860m in my laptop). This is terrible, terrible performance. In some sense it’s an order-of-magnitude improvement on the Baidu approach, but we’re not really comparing apples-to-apples. (This is one case where it’s actually a good thing that my GPU is small!)

Why is it so bad? In a word: synchronisation. The only way that different SMs can communicate is via L2, with latency in the hundreds of cycles. At each stage of the inner loop, a single thread only has 8 multiply-adds to do, so performance is dominated by the time spent waiting on L2.

Conclusion

This exercise highlights a fundamental limitation of an NVIDIA-style “big” GPU: we can’t parallelise efficiently across the whole processor if we need frequent synchronisation.

We can see in the Baidu results that a CPU is a better choice in terms of efficiency for this problem. This is because CPUs are optimised for high single-threaded performance, with narrow(ish) vector processors, and the relative cost of synchronising through memory is much lower.

Even the CPU doesn’t look great in terms of absolute time. To minimise latency we need to parallelise (since we’ll never be able to run a small number of floating-point units fast enough), and that means solving the problem of frequent communication across a large parallel array of ALUs.