Anyway, the point of this post isn’t to actually implement WaveNet, but to explore how something-WaveNet-like runs on a GPU. This came out of a question from my friends at Myrtle Software, who specialise in compiling code to run on FPGAs, typically to run in a datacentre.

The question in question is: is the GPU implementation in this paper by Baidu really representative of the best you can do on a GPU?

The original WaveNet algorithm synthesises audio at 16kHz one sample at a time. Each sample is the output of a deep neural network whose inputs include (among other things) the previous about-a-hundred samples. The upshot of this is there’s limited parallelism to exploit, since we need to compute sample completely before all of the work for sample can kick off. (Though there are dynamic-programming-style optimisations available to avoid doing redundant work.)

Now, an important point about this set-up is quite how “deep” deep is: dozens and dozens of layers of activations. The proxy we’re going to use as a benchmark to approximate WaveNet is to take a 128-element vector, and apply serial matrix multiplication by eighty different 128×128 matrices in turn. With full 32-bit values this works out to ~5MB of weights we need to access to compute each sample. This is a major problem for generating an efficient implementation.

The Baidu paper uses an interesting strategy to combat this problem. They basically say “clearly it would be ridiculous to keep reloading matrix weights all the time. But look – a big GPU actually has a lot of very fast local storage close to the execution units: the register files!”

It’s true that there are **a lot** of registers in a big discrete GPU. A typical NVIDIA Streaming Multiprocessor (SM) has 65,536 registers, ie a 256kb register file. This means if you have 20 SMs you’ve got enough space to keep all of those matrix coefficients resident in fast local memory.

But you’ve probably spotted the problem here: although we’ve got enough memory overall, there’s no space for any redundancy, and each matrix can only live in one place: a single SM. Because of the serial dependency in the computation, this means that only one SM can be “active” at once. So before we start we’re limited to an upper bound of 5% compute efficiency.

And in fact it’s worse than that. The strategy feels a bit odd (“it’s not how GPUs are supposed to be programmed”) and perhaps unsurprisingly, it turns out that the compiler is really unhappy about the idea of keeping all those weights resident in registers. So things go from bad to worse, and the actual compute efficiency they achieve in the end is more like 0.1%. This is bad.

TL;DR – yes, but…

The Baidu implementation runs at something like 0.2x real-time. A simple bandwidth calculation shows that even streaming through all the weights over and over again will be faster than that: real-time performance would be 5MB x 16kHZ = 80GB/s. (A top-end discrete GPU has bandwidth >300GB/s, and even the aging and frail 860m in my laptop has 80GB/s.)

So I wrote a CUDA kernel to do the proxy calculation. Here are some of the fiddly details:

**A**. We can’t efficiently launch ~10,000 kernels/sec so to minimise our overheads we have to run an uber-kernel. (That is, we launch a small number of thread groups – so they can all be in flight on the GPU at once – and implement blocking/waiting/synchronisation behaviour by hand.)

**B**. We split each matrix multiplication across the thread groups/SMs. Conceptually, each thread group will be responsible for a few rows of the matrix multiply.

**C**. We have to store our results to main memory and synchronise after every matrix multiply. Using an atomic add and spinning seems as good as anything. (This is a rare case where adding thread-predication and thread-group-synchronisation actually *improves* performance: just spin on the counter on a single thread/warp.)

**D**. Streaming the weights is slow and introduces latency, so it’s a win to unwind the loop. Instead of the straightforward:

{ [load weights for iteration i] ... [do work for iteration i] }

we instead do:

[load weights for iteration 0] loop i { [load weights for iteration i+1 into temp] ... [do work for iteration i] ... [weights = temp] }

By overlapping the work for iteration with the loading for iteration we hide a lot of the memory latency. (And this is enough: unwinding the loop two iterations is slower again.)

**E**. Not maximising occupancy. My 860m is wide enough to have enough threads so each one has 4 multiply-adds to do. But it’s actually faster to do 8 multiply-adds per thread and have half as many thread groups to synchronise. (It doesn’t work to half the number of threads again though.)

**F**. And a bunch of minor things. For instance, it’s better to spread the intermediate vector results across several cache lines, so there’s no contention between different SMs trying to write the same line. But the difference this makes is pretty much in the noise.

**Results**

After optimisation, this artificial benchmark ends up running at about 1.5% compute efficiency (at least on the 860m in my laptop). This is terrible, terrible performance. In some sense it’s an order-of-magnitude improvement on the Baidu approach, but we’re not really comparing apples-to-apples. (This is one case where it’s actually a good thing that my GPU is small!)

Why is it so bad? In a word: synchronisation. The only way that different SMs can communicate is via L2, with latency in the hundreds of cycles. At each stage of the inner loop, a single thread only has 8 multiply-adds to do, so performance is dominated by the time spent waiting on L2.

**Conclusion**

This exercise highlights a fundamental limitation of an NVIDIA-style “big” GPU: we can’t parallelise efficiently across the whole processor if we need frequent synchronisation.

We can see in the Baidu results that a CPU is a better choice in terms of efficiency for this problem. This is because CPUs are optimised for high single-threaded performance, with narrow(ish) vector processors, and the relative cost of synchronising through memory is much lower.

Even the CPU doesn’t look great in terms of absolute time. To minimise latency we need to parallelise (since we’ll never be able to run a small number of floating-point units fast enough), and that means solving the problem of frequent communication across a large parallel array of ALUs.

]]>Consider the integral for computing the L1 vector of coefficients :

This looks somewhat similar to the definition of irradiance, and if we dot the whole expression by the normal vector we get:

which yields

This is actually quite surprising — what it says is that the L1 band captures the antipodal difference in irradiance perfectly. If you know the exact irradiance in a direction, and you know the L1 vector, then you can compute the exact irradiance in the opposite direction. Or, to put it another way, the only odd terms in the irradiance expansion are the linear terms in the L1 band.

Perhaps this is just a curiosity, but it feels like it might give us a useful clue for improving higher-order SH irradiance models.

]]>Radiance is the incoming light at a point. Since this is a spherical function, it can be approximated by a truncated SH series, and this approximation can be computed by sampling (in, say, your favourite path tracer or real-time global illumination system). However, for shading geometry we need to know outgoing light, also known as irradiance.

For the purposes of this post we assume the perfectly diffuse Lambertian BRDF, which is a constant function. Therefore irradiance is the following hemispherical integral:

where is the geometry term.

Computing the SH approximation we can replace by and compute the SH coefficients of independently:

Putting it all together, we find that the irradiance reconstruction is:

Here the radiance reconstruction constants have been absorbed into the irradiance computation. This means we have only one set of constants to apply to convert the measured radiance coefficients into irradiance.

For quite a while the Enlighten SH solver generated L1 *irradiance *using this linear reconstruction. We considered it an optimisation to fold the conversion into the solver itself. However it’s pretty easy to see this has serious problems. One nice feature of our definition for L1 is that we know . The lower bound is attained when the lighting is completely symmetrical or ambient (with no preferred direction), while the upper bound occurs if the incoming light comes from a single direction. However, the factor of two in the linear radiance-to-irradiance conversion means that for very strongly directional lighting environments, our approximation for irradiance will be negative in some direction. This is clearly undesirable since irradiance can never be negative in reality, and causes obvious visual artefacts in practice.

To improve this situation, we eventually unbaked the conversion constants in the solver (so it once again generated SH radiance), and applied a more complex conversion to irradiance in shader code. The rest of this post contains the motivation and derivation of this improved L1 irradiance model.

We already know that the formula given above is the best possible linear conversion to irradiance. Therefore we need to look for a non-linear alternative. To narrow this down we’re first going to find a better polynomial approximation for the pure directional case . Once we’re happy with that we can generalise as varies. Finally, we need to ensure the energy and dynamic range of the outgoing irradiance are correct for our measured radiance values.

For an ideal point light source with energy 1 in direction , Lambertian irradiance is given by:

This has a minimum value of 0 (attained on a whole hemisphere), and a maximum value of 4. Compare this to the linear reconstruction for L1 irradiance:

This has a minimum of -1 and a maximum of 3. Although this integrates to the correct energy as written, when we use it in practice in a shader we’ll clamp negative values to 0, with the net result that the total energy of the function will be too high. Therefore the linear reconstruction gives us the wrong energy, and the wrong dynamic range (incorrect maximum value).

To find our non-linear approximation in the pure directional case, let’s first specify that the approximation is smooth. This makes sense since we want it to be one endpoint of a smoothly varying overall model. To satisfy this, we will try to fit a polynomial function (which will automatically be smooth) in the variable .

There are constraints on this function that we already know: the total energy (integral under the curve) should be 2, the minimum value 0 and the maximum value 4.

Additionally, we are going to specify derivative conditions at -1 (the point opposite the light source). The true irradiance is 0 on the whole back-facing hemisphere. We can’t attain that with a polynomial, but we can at least specify that .

This gives us five constraints to satisfy, so we can fit a unique quartic . Solving for the coefficients gives us:

We now know what to do at the two extreme cases. When we can plug in our non-linear approximation. Alternatively, if we have no directional variation at all, so irradiance is simply a constant function. But what should be do in between the two extremes?

For each value in between there is a range of possibilities, from a linear model to something “as non-linear” as our extreme model. It’s reasonable to suppose that our model should become more linear as it becomes more ambient (less directional). So let’s define the model power to be:

This varies between 1 and 3 based on the length of the L1 vector. Our final model will then be based on where , that is the dot product with the normalised direction of the L1 vector.

At this point it’s useful to introduce a new variable , so the model is of the form where and are constants we need to find.

We have two constraints to apply to find and : total energy, and dynamic range.

We are assuming that the total energy is 2 (only the length of L1 is varying, not L0). Therefore we want to find so that . It follows that . Our model is then a linear interpolation between the ambient and directionally-varying terms:

The correct dynamic range (maximum – minimum) varies linearly with the length of L1 and equals . The dynamic range of the model is . Substituting in the definition of and solving for gives:

Putting it all together, the final model is:

where

We’ve discussed how to convert SH radiance (incoming light) into SH irradiance (outgoing bounced light), assuming a diffuse Lambertian BRDF. The natural, linear best-fit method generates functions which can attain negative values, which in the case of light is an obviously undesirable property.

In the case of L1 we fix this with a non-linear reconstruction, fitting the model to the measured direction, energy and dynamic range. This model gives very good “bang for the buck”: it produces decent shading quality, despite only using a small number of coefficients (four per colour channel).

Unfortunately, it’s frustratingly unclear how to extend this non-linear reconstruction naturally to higher SH bands. Instead, the most practical current approach is to remove negative values by “de-ringing”; Peter-Pike Sloan presented a robust numerical method for doing this at SIGGRAPH Asia 2017.

]]>However the traditional definition from quantum mechanics (other variations exist) is somewhat intimidating:

The constants here are well-motivated: they are required so that quantum mechanical probability distributions are normalised to 1… but if we blindly plug this definition straight in for a 3D graphics/lighting use case, we end up evaluating lots of trigonometric functions and carrying around complicated factors of that aren’t necessary. In the literature, these constants are often quoted as decimal quantities without explanation, making everything even more confusing. If instead we make a simplifying redefinition, our SH coefficients become easier to interpret and the computations require no trigonometry and fewer magic constants (and thus, hopefully, less debugging…)

The Spherical Harmonic (SH) series of functions is the analogue of the Fourier Series for functions on the surface of a 2-sphere . The complete set of functions is an infinite-dimensional basis for functions on the sphere, but in practical use the series is truncated to give an approximation of an arbitrary function by a finite weighted sum of basis functions.

In computer graphics, spherical harmonics are used as a form of compression, which in turn greatly accelerates computations. For instance, the incoming light at a point in space is a spherical function (since it varies with direction), but using SH its approximation can be compactly represented by a handful of coefficients (the weights for the first few basis functions). This allows computations to be performed on a vector of these SH coefficients, rather than the spherical functions themselves.

Spherical harmonics are a good choice for this compressed representation because the SH approximation is invariant under rotation. That is, the SH approximation of a rotated function is the same as the rotated SH approximation of the original function.

The Spherical Harmonic series is composed of separate “bands” of functions, usually denoted L0, L1, L2 etc. The L0 band is a single constant function; the L1 band consists of three linear functions; the L2 band contains five quadratic functions; and so on. For real-time lighting it is unusual to go past the L2 band (which requires nine total coefficients per channel) since the data and computation requirements become large, and the L2 approximation is already pretty accurate. However use cases in physics and astronomy can require up to L3000!

We will return to the bands later, but for now let’s start with some definitions.

Given a basis , a spherical function can be written as a weighted sum:

where are scalar coefficients.

Truncating the sum gives a finite approximation:

where the vector of coefficients fully describes the approximation.

Recall that the standard definition of the inner product of spherical functions , is a spherical integral:

where is the measure over the sphere and is the variable of integration.

Therefore if the basis is orthogonal we compute the coefficients for a function with integrals:

The standard definition of the first SH basis function — the single function in L0 — is:

This defintion is motivated by ensuring normalisation over the sphere, that is:

However for our use case this “extra” factor of is redundant. Consider what happens when we approximate a constant function . First we compute the first SH coefficient:

Now to construct the SH approximation we form the weighted sum of basis functions, that is we multiply the coefficient and the basis function:

As expected, the function is reconstructed exactly, but the process is clearly more complicated than necessary: we gain a factor of in the SH coefficient only to cancel it with the same factor in the basis function.

The factor of arises since it is the surface area of the unit sphere, but that isn’t relevant to our intended use case. It is much simpler to subsume this factor in a new definition of the inner product.

Redefine the inner product:

The idea is to “normalise so the surface area of the sphere is one.” With this new definition, for we now obtain the more natural outcome that the corresponding SH coefficient .

In fact this definition results in a 1-to-1 correspondence between the constructive approach of computing SH coefficients via sampling and the theoretical integral form:

where are samples taken from a uniform spherical distribution.

This means that to compute the th SH coefficient we simply sum the values of the th basis function for given sample directions and divide the result by .

We will abuse the notation for the basis functions, where the subscript is the band, and the superscript is an index within the band. This is the similar to the standard notation, except that we use rather than to distinguish our renormalised basis functions.

With our renormalised inner product, we can define the first SH basis function:

This is the only basis function which has a non-zero average value over the sphere. This means that the first SH coefficient for a function represents the average energy over the sphere — the subsequent coefficients and basis functions neither add nor remove energy (instead they simply “move it around”).

Define:

Note that we have chosen functions which are not unit-length with respect to our inner product. This is deliberate. We choose to take care of the normalisation in the reconstruction step — that is, when we compute the approximation . If the function we are measuring is radiance (incoming light) then for shading we need to convert it to irradiance (outgoing light) for a given normal direction, and the reconstruction coefficients will get swept into this conversion.

In fact, it is convenient to think of the L1 band as a single vector with corresponding SH coefficients .

The direction of is the average direction of the function . If is radiance, then it is the average direction of the incoming light (weighted by intensity).

By construction, the length of will vary between zero and , the first SH coefficient. The ratio is an indication of “how directional” the function is. If the ratio is one then is completely directional — all of the energy is at a single point on the sphere. Conversely, if the ratio is zero then is symmetrical, and the L1 band gives us no directional information at all.

There are six different quadratic combinations of the variables :

However, since we are on the surface of a 2-sphere we know that . Therefore there is one linear dependency between these six functions, and we end up with five basis functions in L2.

The simplest way to capture the L2 band is a 3×3 matrix:

with a corresponding matrix of SH coefficients, where and is one if and zero otherwise. The negative one-third term is required to ensure each function has zero integral over the sphere, so is orthogonal to the constant basis function.

This 3×3 matrix is symmetric, meaning that , and traceless, meaning that the diagonal elements sum to zero: . Therefore it does indeed have five degrees of freedom (and an actual implementation would not store nine coefficients!)

Whereas L0 and L1 have relatively simple physical meanings, it is harder to grasp L2 intuitively. The linear L1 band is “antipodal” in the sense that if you add weight in the direction then it must be balanced by negative weight in the opposite direction. For the quadratic L2 band, the and directions are the same (since their squares are equal). Instead, to ensure a net-zero overall contribution, if weight is added to the axis then the balancing negative weight is shared equally in the orthogonal and axes.

Although L2 is already a good approximation, in the case of CG lighting this property can have counter-intuitive effects: adding a light may cause “negative light” to appear in some orthogonal direction.

With the above definitions, the SH approximation to the original function is:

This formulation requires only simple rational constants applied in the reconstruction step.

Everything should be made as simple as possible, but no simpler.

Albert Einstein

We have shown how to redefine the SH basis functions for L0, L1 and L2 in a way that is simpler and more natural for many use cases. This definition removes all trigonometric functions, factors of and square roots, and just leaves the rational constants required for normalisation. The process can be naturally extended to higher SH bands as required.

Although we’ve mentioned lighting a few times in this post, nothing here is lighting-specific. In the next post we’ll discuss a lighting problem: how best to convert from SH radiance (incoming light) to irradiance (outgoing light).

]]>