Graham Hazel – maths, lighting, C++, games, blah blah blah blah

Double-Dummy Analysis in a Web Browser

Introduction

In version 1.3 of RealBridge we added double-dummy analysis. This isn’t anything new, and we used the standard piece of open-source software – but we did it in a “non-standard” way, and the details might be useful to others.

In particular, this post explains how to run a specific bit of open-source software – the DDS library, written in C – inside a web browser. It will be of particular interest to anyone developing web-based bridge software, but of more general interest to web developers interested in client-side deployment of software written in non-web-native languages such as C or C++.

The Double-Dummy Solver library (DDS)

The DDS library, written by Bo Haglund and Soren Hein (with contributions from others) is a well-optimised open-source double-dummy solver, the purpose of which is to evaluate how good different choices of play are during bridge deals.

A bridge deal has four hands of thirteen cards each, and the four players play as two teams of two, or partnerships. During the play of a deal, one of the players becomes dummy and their hand is placed face-up on the table. The three remaining players can see twenty-six of the cards (their own hand, plus the dummy), and the other twenty-six are hidden.

This partial-information or uncertainty is what makes the game interesting – but it also makes analysing the play extremely challenging. A simpler problem – and one that has the advantage of a definite answer – is what the best play would be if you knew the complete layout of all the cards in advance. This is what “double-dummy” refers to – it’s simply an obscure way of saying “everyone knows the whole hand”.

Under that condition, there’s a single double-dummy result, which is what happens if everyone makes the best choice possible at every turn.

Given a deal, a trump suit and a declarer, the DDS library can efficiently compute the double-dummy result. Moreover, given the current state of play at any point during a deal – and assuming perfect play thereafter – it can tell you the eventual result of each of the legal cards you could play. That is, it can tell you how good each of the choices is at any point.

Play recaps with double-dummy analysis

When reviewing what happened during a game of bridge, players like to be able to check the double-dummy analysis – it can help identify errors, which can help you learn to avoid them in future. Now, not every double-dummy error is actually a mistake – you can make the correct with-the-odds play and lose to an inferior play that happens to work this time. This is part of the fun of the game! With enough luck, anything is possible…

Nevertheless, double-dummy analysis is a useful tool to have in your bag, and it’s much easier to use software that can compute it in a fraction of a second than try to work it out by hand.

DDS as a web service

Existing bridge sites offering double-dummy analysis usually wrap up DDS, or other software with equivalent functionality, as a web service. That is, the client makes a request to a server, the processing happens on the remote server, and then the server sends the results back to the client to display to the user.

This used to be the only way to expose the capabilities of library software to browsers, but it comes with some drawbacks. The main one is the latency of response: given that a DDS solve of a full hand takes around 50 milliseconds, the round-trip latency to the server and back is likely to be a significant fraction of the total time. If the client and server are far apart – for instance, on different continents – the communication latency will far exceed the processing time.

Another disadvantage is that the service provider has to provide a server! There is a potential scaling problem inherent in this – it’s likely that double-dummy queries will be an irregular workload (eg, when lots of players finish sessions at the same time, and then all look through the deals).

There’s another more subtle point about scaling and performance. The DDS solver has some built-in caching of deal-positions and results. When it explores the full “game-tree” of possibilities for a deal, equivalent positions will come up multiple times, and it’s much faster to be able to look up results we’ve already computed than recompute the same thing multiple times.

Now, the library is clever enough to reuse this information across separate queries. If an incoming query comes from the same deal, with the same trump suit, the solver can simply reuse results already in its cache. This means that in the case that several consecutive queries are about the same contract and line-of-play, the later queries will return near-instantly, with no work to do.

This means that if, somehow, each client had its own private version of DDS to use, its queries would be much more efficient – since typical use is to make lots of queries about the same deal consecutively. Whereas if we have one centralised server processing requests for lots of users, it’s much harder to efficiently reuse data without doing significant extra engineering.

All this points towards one thing: what we really want to do is run the DDS library locally, on the client’s machine, separately for each user. While a few years ago this would have been a pipedream, more modern technologies make it pretty straightforward to get the best of both worlds – faster and more efficient for the user, and simpler and easier for the service provider!

Compiling with Emscripten

If we suppose our use case is a webpage or web app, this means we want to run DDS directly in the client’s browser. To do that, we can’t compile the C code to a binary – because the binary will depend on the OS and hardware it’s compiled on (or for). We need something else, something platform-independent that browsers understand. Fortunately, there are standards and software out there that can help us!

Emscripten is a tool for turning C or C++ into JavaScript. More specifically, it’s a compiler toolchain that uses LLVM to compile C/C++ to WebAssembly.

Emscripten itself has been around for a relatively long time – the project started in 2010 – and was a big reason for the standardisation process that created WebAssembly. Because it’s a standard, and supported in all major modern browsers, WebAssembly is now a mature technology.

Once the toolchain is installed, compiling a simple C program with Emscripten is straightforward:

emcc main.c

This will emit two files: a.out.js and a.out.wasm

The WASM file is our code, compiled to WebAssembly – instructions for a virtual machine that browsers know how to execute efficiently. The JavaScript file contains the “glue code” that allows us to interact with the compiled functionality from JS.

We can run our compiled code as a console application via the Node.js runtime. The Emscripten toolchain bundles Node.js with it, so all we need to do is:

node a.out.js

Without any further instructions, Emscripten will simply generate the executable consisting of whatever the C main() function does. The JavaScript glue will then execute this main() functionality and then exit.

This isn’t exactly what we want to achieve – we have library functionality that we want to call on-demand, multiple times. That is, we don’t want the JS module to exit, we want it to stay alive indefinitely. Fortunately, with a bit more work and some handy compiler options we can do this too.

Compiling DDS with a C-function wrapper

We want to call two DDS functions from JavaScript:

SetResources – function to initialise the system.
SolveBoardPBN – function to solve a deal or position.

It’s convenient to write a small C-wrapper to translate parameters into the form expected by DDS. Ours looks like this:

Simple C wrapper for DDS

Things to note are:

We decorate the C functions we want to call from JavaScipt with the EMSCRIPTEN_KEEPALIVE macro (defined in emscripten.h). This tells the compiler not to strip out that function from the compiled output.
All of the work in the do_dds_solve_board function is simple translation of the input and output data. We write the output into an array of 26 integers, since this is easy to read and write on the JavaScript side.
We tell the compiler to generate JavaScript that doesn’t run-and-then-exit. We do this by adding -sNO_EXIT_RUNTIME=1 to the command line.

We can now try to compile DDS!

emcc -sNO_EXIT_RUNTIME=1 <dds-source-files> main.c

We get a few compiler errors, so there are a handful of places we need to modify the DDS code to compile with Emscripten:

There are some string-related functions in Par.cpp that Emscripten complains about. Fortunately, we don’t need any of the functions in Par.cpp for our purposes, so we can simply leave that file out.
In dds.h, Emscripten is fussy about the declarations of structs containing other structs, so we need to add the word struct a few times.

To actually run the code successfully, we also need to modify the System::GetHardware function. In normal operation, this detects the number of cores and free memory available in the system. The idea is to use all of the resources in the system when solving multiple deals in parallel.

However, we only want to solve one deal at a time, so we can hard-code the GetHardware function to report a single core and 50MB of memory. This will cause DDS to run in single-threaded mode, with one of its “small” threads.

To get this to work, we also need to tell Emscripten to increase the amount of memory available to be allocated (from the default of 16MB):

emcc -sINITIAL_MEMORY=52428800 -sNO_EXIT_RUNTIME=1
  <dds-source-files> main.c

Testing the code from JavaScript

We’ve successfully compiled the code, but can’t do anything useful with it yet – first we need to tell Emscripten which functions we’re going to call from JavaScript:

emcc -sINITIAL_MEMORY=52428800 -sNO_EXIT_RUNTIME=1 
  -sEXPORTED_FUNCTIONS="_free,_malloc,_do_dds_solve_board, _dds_init" 
  -sEXPORTED_RUNTIME_METHODS="getValue,ccall,allocateUTF8"
  <dds-source-files> main.c

Here, the EXPORTED_FUNCTIONS are the C functions we want to call. Note that each of them gets a leading underscore character when exported. On the other hand, the EXPORTED_RUNTIME_METHODS are JavaScript script functions generated by emcc in the a.out.js file. More specifically, we use getValue to read the 32-bit integers in the generated output array, and we use allocateUTF8 to convert a JavaScript string containing the deal information into a C-style string for input to SolveBoardPBN.

With all of that done, we can write a simple test case to make sure it works:

Simple JavaScript test of DDS functionality

As noted in the comment, to import the Emscripten JavaScript as a module, we need to compile it with -sMODULARIZE=1. This is convenient for this little test, which we can run via Node.js on the command line.

The test does a single double-dummy solve of the given deal, returning 26 integers. Each pair of integers is of the form (card, result): the eventual double-dummy result of playing each possible card.

Running DDS in a WebWorker

A more likely useful scenario is to run DDS inside a webpage. Although it’s fast, the processing for a double-dummy solution isn’t instant, so it would be best to use a separate thread to do the work – that way we don’t cause any hitches on the main processing thread.

Happily, using a WebWorker makes this very simple. Here we have to compile without –sMODULARIZE=1, but then we can use this useful function at the top of our WebWorker:

importScripts("./a.out.js");

(The importScripts function is part of the WebWorker API.)

This defines a variable Module, and we can subsequently call our functions through it, eg:

Module._dds_init()

Of course, the exact details of the WebWorker will depend on the app or webpage it’s communicating with.

Conclusion

Incorporating library functionality into web-based software used to require wrapping software as a web-service and providing servers to run it – but with modern compilation tools and browser capabilities we can now run more types of software directly. This can make developing fully-featured web experiences both quicker and simpler.

(Why you shouldn’t try) Implementing WaveNet on a GPU

Rendering natural-sounding speech is a hot problem, with Amazon, Google and Baidu among others rolling out cloud-based solutions for services like voice assistants. Until recently, the state-of-the-art in converting text to speech was WaveNet, from Google DeepMind. (More recently they published a parallel implementation suitable for GPUs; this is what they use in production, but has a different set of trade-offs, eg more complicated training.)

Anyway, the point of this post isn’t to actually implement WaveNet, but to explore how something-WaveNet-like runs on a GPU. This came out of a question from my friends at Myrtle Software, who specialise in compiling code to run on FPGAs, typically to run in a datacentre.

The question in question is: is the GPU implementation in this paper by Baidu really representative of the best you can do on a GPU?

What WaveNet does

The original WaveNet algorithm synthesises audio at 16kHz one sample at a time. Each sample is the output of a deep neural network whose inputs include (among other things) the previous about-a-hundred samples. The upshot of this is there’s limited parallelism to exploit, since we need to compute sample $n$ completely before all of the work for sample $n+1$ can kick off. (Though there are dynamic-programming-style optimisations available to avoid doing redundant work.)

Now, an important point about this set-up is quite how “deep” deep is: dozens and dozens of layers of activations. The proxy we’re going to use as a benchmark to approximate WaveNet is to take a 128-element vector, and apply serial matrix multiplication by eighty different 128×128 matrices in turn. With full 32-bit values this works out to ~5MB of weights we need to access to compute each sample. This is a major problem for generating an efficient implementation.

The Baidu paper uses an interesting strategy to combat this problem. They basically say “clearly it would be ridiculous to keep reloading matrix weights all the time. But look – a big GPU actually has a lot of very fast local storage close to the execution units: the register files!”

It’s true that there are a lot of registers in a big discrete GPU. A typical NVIDIA Streaming Multiprocessor (SM) has 65,536 registers, ie a 256kb register file. This means if you have 20 SMs you’ve got enough space to keep all of those matrix coefficients resident in fast local memory.

But you’ve probably spotted the problem here: although we’ve got enough memory overall, there’s no space for any redundancy, and each matrix can only live in one place: a single SM. Because of the serial dependency in the computation, this means that only one SM can be “active” at once. So before we start we’re limited to an upper bound of 5% compute efficiency.

Baidu results

And in fact it’s worse than that. The strategy feels a bit odd (“it’s not how GPUs are supposed to be programmed”) and perhaps unsurprisingly, it turns out that the compiler is really unhappy about the idea of keeping all those weights resident in registers. So things go from bad to worse, and the actual compute efficiency they achieve in the end is more like 0.1%. This is bad.

Can we do better?

TL;DR – yes, but…

The Baidu implementation runs at something like 0.2x real-time. A simple bandwidth calculation shows that even streaming through all the weights over and over again will be faster than that: real-time performance would be 5MB x 16kHZ = 80GB/s. (A top-end discrete GPU has bandwidth >300GB/s, and even the aging and frail 860m in my laptop has 80GB/s.)

So I wrote a CUDA kernel to do the proxy calculation. Here are some of the fiddly details:

A. We can’t efficiently launch ~10,000 kernels/sec so to minimise our overheads we have to run an uber-kernel. (That is, we launch a small number of thread groups – so they can all be in flight on the GPU at once – and implement blocking/waiting/synchronisation behaviour by hand.)

B. We split each matrix multiplication across the thread groups/SMs. Conceptually, each thread group will be responsible for a few rows of the matrix multiply.

C. We have to store our results to main memory and synchronise after every matrix multiply. Using an atomic add and spinning seems as good as anything. (This is a rare case where adding thread-predication and thread-group-synchronisation actually improves performance: just spin on the counter on a single thread/warp.)

D. Streaming the weights is slow and introduces latency, so it’s a win to unwind the loop. Instead of the straightforward:

{
    [load weights for iteration i]
    ...
    [do work for iteration i]
}

we instead do:

[load weights for iteration 0]

loop i
{
    [load weights for iteration i+1 into temp]
    ...
    [do work for iteration i]
    ...
    [weights = temp]
}

By overlapping the work for iteration $i$ with the loading for iteration $i+1$ we hide a lot of the memory latency. (And this is enough: unwinding the loop two iterations is slower again.)

E. Not maximising occupancy. My 860m is wide enough to have enough threads so each one has 4 multiply-adds to do. But it’s actually faster to do 8 multiply-adds per thread and have half as many thread groups to synchronise. (It doesn’t work to half the number of threads again though.)

F. And a bunch of minor things. For instance, it’s better to spread the intermediate vector results across several cache lines, so there’s no contention between different SMs trying to write the same line. But the difference this makes is pretty much in the noise.

Results

After optimisation, this artificial benchmark ends up running at about 1.5% compute efficiency (at least on the 860m in my laptop). This is terrible, terrible performance. In some sense it’s an order-of-magnitude improvement on the Baidu approach, but we’re not really comparing apples-to-apples. (This is one case where it’s actually a good thing that my GPU is small!)

Why is it so bad? In a word: synchronisation. The only way that different SMs can communicate is via L2, with latency in the hundreds of cycles. At each stage of the inner loop, a single thread only has 8 multiply-adds to do, so performance is dominated by the time spent waiting on L2.

Conclusion

This exercise highlights a fundamental limitation of an NVIDIA-style “big” GPU: we can’t parallelise efficiently across the whole processor if we need frequent synchronisation.

We can see in the Baidu results that a CPU is a better choice in terms of efficiency for this problem. This is because CPUs are optimised for high single-threaded performance, with narrow(ish) vector processors, and the relative cost of synchronising through memory is much lower.

Even the CPU doesn’t look great in terms of absolute time. To minimise latency we need to parallelise (since we’ll never be able to run a small number of floating-point units fast enough), and that means solving the problem of frequent communication across a large parallel array of ALUs.

Odd terms in the SH irradiance expansion

Here’s one last observation about SH irradiance which is perhaps more a curiosity than anything else. Or at least we couldn’t work out how to make it useful!

Consider the integral for computing the L1 vector of coefficients $\vec{\boldsymbol{R}}_1$ :

$\vec{\boldsymbol{R}}_1=\frac{1}{4\pi} \oint \vec{\boldsymbol{\omega}}\, R(\vec{\boldsymbol{\omega}}) \, d\Omega$

This looks somewhat similar to the definition of irradiance, and if we dot the whole expression by the normal vector we get:

$\vec{\boldsymbol{n}} \! \cdot \! \vec{\boldsymbol{R}}_1=\frac{1}{4\pi}\oint\vec{\boldsymbol{n}}\!\cdot\!\vec{\boldsymbol{\omega}}\, R(\vec{\boldsymbol{\omega}}) \, d\Omega$

$=\frac{1}{4}\left( \frac{1}{\pi} \int_{H+}\vec{\boldsymbol{n}}\!\cdot\!\vec{\boldsymbol{\omega}}\, R(\vec{\boldsymbol{\omega}}) \, d\Omega +\frac{1}{\pi}\int_{H-}\vec{\boldsymbol{n}}\!\cdot\!\vec{\boldsymbol{\omega}}\, R(\vec{\boldsymbol{\omega}}) \, d\Omega \right)$

which yields

$\vec{\boldsymbol{n}}\!\cdot\!\vec{\boldsymbol{R}}_1=\frac{1}{4} \left( I(\vec{\boldsymbol{n}}) - I(-\vec{\boldsymbol{n}}) \right)$

This is actually quite surprising — what it says is that the L1 band captures the antipodal difference in irradiance perfectly. If you know the exact irradiance in a direction, and you know the L1 vector, then you can compute the exact irradiance in the opposite direction. Or, to put it another way, the only odd terms in the irradiance expansion are the linear terms in the L1 band.

Perhaps this is just a curiosity, but it feels like it might give us a useful clue for improving higher-order SH irradiance models.

Converting SH Radiance to Irradiance

In the last post we explained an alternative, simpler way of defining the Spherical Harmonic basis functions. We are now going to use this definition of the L0 and L1 basis functions to investigate a classic lighting use case: converting radiance to irradiance.

Radiance is the incoming light at a point. Since this is a spherical function, it can be approximated by a truncated SH series, and this approximation can be computed by sampling (in, say, your favourite path tracer or real-time global illumination system). However, for shading geometry we need to know outgoing light, also known as irradiance.

For the purposes of this post we assume the perfectly diffuse Lambertian BRDF, which is a constant function. Therefore irradiance is the following hemispherical integral:

$I(\vec{\boldsymbol{n}})=\frac{1}{\pi}\int_H\vec{\boldsymbol{n}}\!\cdot\!\vec{\boldsymbol{\omega}} R(\vec{\boldsymbol{\omega}}) d\Omega$

where $\vec{\boldsymbol{n}}\!\cdot\!\vec{\boldsymbol{\omega}}$ is the geometry term.

Computing the SH approximation we can replace $R(\vec{\boldsymbol{\omega}})$ by $R_{sh}(\vec{\boldsymbol{\omega}})$ and compute the SH coefficients of independently:

$I_0=\frac{1}{\pi}\int_H\vec{\boldsymbol{n}}\!\cdot\!\vec{\boldsymbol{\omega}}\, R_0 \, d\Omega = R_0$

$\vec{\boldsymbol{I}}_1=\frac{1}{\pi}\int_H \vec{\boldsymbol{n}}\!\cdot\!\vec{\boldsymbol{\omega}} \, 3 \vec{\boldsymbol{R}}_1\!\cdot\!\vec{\boldsymbol{\omega}} \, d\Omega =2\vec{\boldsymbol{R}}_1$

$I_2^{ij}=\frac{1}{\pi}\int_H \vec{\boldsymbol{n}}\!\cdot\!\vec{\boldsymbol{\omega}} \, \tfrac{15}{2}\omega_i \omega_j R_2^{ij} =\tfrac{15}{8}R_2^{ij}$

Putting it all together, we find that the irradiance reconstruction is:

$I_{sh}(\vec{\boldsymbol{\omega}})=R_0+2\vec{\boldsymbol{R}}_1\!\cdot\!\vec{\boldsymbol{\omega}}+\tfrac{15}{8} \omega_i \omega_j R_2^{ij}+\ldots$

Here the radiance reconstruction constants have been absorbed into the irradiance computation. This means we have only one set of constants to apply to convert the measured radiance coefficients into irradiance.

…and we’re done?

For quite a while the Enlighten SH solver generated L1 irradiance using this linear reconstruction. We considered it an optimisation to fold the conversion into the solver itself. However it’s pretty easy to see this has serious problems. One nice feature of our definition for L1 is that we know $0 \le |\vec{\boldsymbol{R}}_1| \le R_0$ . The lower bound is attained when the lighting is completely symmetrical or ambient (with no preferred direction), while the upper bound occurs if the incoming light comes from a single direction. However, the factor of two in the linear radiance-to-irradiance conversion means that for very strongly directional lighting environments, our approximation for irradiance will be negative in some direction. This is clearly undesirable since irradiance can never be negative in reality, and causes obvious visual artefacts in practice.

Improving L1 irradiance with a non-linear model

To improve this situation, we eventually unbaked the conversion constants in the solver (so it once again generated SH radiance), and applied a more complex conversion to irradiance in shader code. The rest of this post contains the motivation and derivation of this improved L1 irradiance model.

We already know that the formula given above is the best possible linear conversion to irradiance. Therefore we need to look for a non-linear alternative. To narrow this down we’re first going to find a better polynomial approximation for the pure directional case $|\vec{\boldsymbol{R}}_1| = R_0$ . Once we’re happy with that we can generalise as $|\vec{\boldsymbol{R}}_1| / R_0$ varies. Finally, we need to ensure the energy and dynamic range of the outgoing irradiance are correct for our measured radiance values.

Pure directional irradiance

For an ideal point light source with energy 1 in direction $\vec{\boldsymbol{d}}$ , Lambertian irradiance is given by:

$I(\vec{\boldsymbol{n}}) = \begin{cases} 4 (\vec{\boldsymbol{d}} . \vec{\boldsymbol{n}}) & \text{if}\ \vec{\boldsymbol{d}} . \vec{\boldsymbol{n}} \ge 0 \\ 0 & \text{otherwise} \end{cases}$

This has a minimum value of 0 (attained on a whole hemisphere), and a maximum value of 4. Compare this to the linear reconstruction for L1 irradiance:

$I_{lin}(\vec{\boldsymbol{n}}) = 1 + 2 (\vec{\boldsymbol{d}}.\vec{\boldsymbol{n}})$

This has a minimum of -1 and a maximum of 3. Although this integrates to the correct energy as written, when we use it in practice in a shader we’ll clamp negative values to 0, with the net result that the total energy of the function will be too high. Therefore the linear reconstruction gives us the wrong energy, and the wrong dynamic range (incorrect maximum value).

Best smooth approximation

To find our non-linear approximation in the pure directional case, let’s first specify that the approximation is smooth. This makes sense since we want it to be one endpoint of a smoothly varying overall model. To satisfy this, we will try to fit a polynomial function $f$ (which will automatically be smooth) in the variable $x = \vec{\boldsymbol{d}}.\vec{\boldsymbol{n}}$ .

There are constraints on this function that we already know: the total energy (integral under the curve) should be 2, the minimum value 0 and the maximum value 4.

Additionally, we are going to specify derivative conditions at -1 (the point opposite the light source). The true irradiance is 0 on the whole back-facing hemisphere. We can’t attain that with a polynomial, but we can at least specify that $f'(-1) = f''(-1) = 0$ .

This gives us five constraints to satisfy, so we can fit a unique quartic $ax^4 + bx^3 + cx^2 + dx + e$ . Solving for the coefficients gives us:

$f(x) = \frac{1}{2} (1 + x)^3$$

Generalise as length of L1 vector varies

We now know what to do at the two extreme cases. When $|\vec{\boldsymbol{R}}_1 | = R_0$ we can plug in our non-linear approximation. Alternatively, if $|\vec{\boldsymbol{R}}_1 | = 0$ we have no directional variation at all, so irradiance is simply a constant function. But what should be do in between the two extremes?

For each value in between there is a range of possibilities, from a linear model to something “as non-linear” as our extreme model. It’s reasonable to suppose that our model should become more linear as it becomes more ambient (less directional). So let’s define the model power to be:

$p = 1 + 2 \frac{| \vec{\boldsymbol{R}}_1 |}{R_0}$

This varies between 1 and 3 based on the length of the L1 vector. Our final model will then be based on $(1+x)^p$ where $x = \hat{\boldsymbol{R}_1}.\vec{\boldsymbol{n}}$ , that is the dot product with the normalised direction of the L1 vector.

At this point it’s useful to introduce a new variable $q = \frac{1}{2}(1 + x)$ , so the model is of the form $a + bq^p$ where $a$ and $b$ are constants we need to find.

We have two constraints to apply to find $a$ and $b$ : total energy, and dynamic range.

We are assuming that the total energy is 2 (only the length of L1 is varying, not L0). Therefore we want to find $b$ so that $\int_{-1}^{1} bq^p dx = 2$ . It follows that $b = p + 1$ . Our model is then a linear interpolation between the ambient and directionally-varying terms:

$a + (1-a)(p+1)q^p$

The correct dynamic range (maximum – minimum) varies linearly with the length of L1 and equals $4 \frac{|\vec{\boldsymbol{R}}_1 |}{R_0}$ . The dynamic range of the model is $(1-a)(p+1)$ . Substituting in the definition of $p$ and solving for $a$ gives:

$a = (1 - \frac{|\vec{\boldsymbol{R}}_1 |}{R_0}) / (1 + \frac{|\vec{\boldsymbol{R}}_1 |}{R_0})$

Final improved L1 irradiance model

Putting it all together, the final model is:

$I_{sh}(\vec{\boldsymbol{n}}) = R_0 (a + (1 - a)(p + 1)q^p )$

where

$q = \frac{1}{2}(1 + \hat{\boldsymbol{R}_1}.\vec{\boldsymbol{n}})$

$p = 1 + 2 \frac{| \vec{\boldsymbol{R}}_1 |}{R_0}$

$a = (1 - \frac{|\vec{\boldsymbol{R}}_1 |}{R_0}) / (1 + \frac{|\vec{\boldsymbol{R}}_1 |}{R_0})$

Conclusion

We’ve discussed how to convert SH radiance (incoming light) into SH irradiance (outgoing bounced light), assuming a diffuse Lambertian BRDF. The natural, linear best-fit method generates functions which can attain negative values, which in the case of light is an obviously undesirable property.

In the case of L1 we fix this with a non-linear reconstruction, fitting the model to the measured direction, energy and dynamic range. This model gives very good “bang for the buck”: it produces decent shading quality, despite only using a small number of coefficients (four per colour channel).

Unfortunately, it’s frustratingly unclear how to extend this non-linear reconstruction naturally to higher SH bands. Instead, the most practical current approach is to remove negative values by “de-ringing”; Peter-Pike Sloan presented a robust numerical method for doing this at SIGGRAPH Asia 2017.

Alternative definition of Spherical Harmonics for Lighting

The Spherical Harmonic (SH) series has proved itself a useful mathematical tool in various domains like quantum mechanics and acoustics, as well as in lighting computations.

However the traditional definition from quantum mechanics (other variations exist) is somewhat intimidating:

$Y_{\ell }^{m}(\theta ,\varphi )=(-1)^{m}{\sqrt {{(2\ell +1) \over 4\pi }{(\ell -m)! \over (\ell +m)!}}}\,P_{\ell }^{m}(\cos {\theta })\,e^{im\varphi }$

The constants here are well-motivated: they are required so that quantum mechanical probability distributions are normalised to 1… but if we blindly plug this definition straight in for a 3D graphics/lighting use case, we end up evaluating lots of trigonometric functions and carrying around complicated factors of $\pi$ that aren’t necessary. In the literature, these constants are often quoted as decimal quantities without explanation, making everything even more confusing. If instead we make a simplifying redefinition, our SH coefficients become easier to interpret and the computations require no trigonometry and fewer magic constants (and thus, hopefully, less debugging…)

Background

The Spherical Harmonic (SH) series of functions is the analogue of the Fourier Series for functions on the surface of a 2-sphere ${S}^2 = \{ (x, y, z) \in \mathbb{R}^3 : x^2 + y^2 + z^2 = 1 \}$ . The complete set of functions is an infinite-dimensional basis for functions on the sphere, but in practical use the series is truncated to give an approximation of an arbitrary function by a finite weighted sum of basis functions.

In computer graphics, spherical harmonics are used as a form of compression, which in turn greatly accelerates computations. For instance, the incoming light at a point in space is a spherical function (since it varies with direction), but using SH its approximation can be compactly represented by a handful of coefficients (the weights for the first few basis functions). This allows computations to be performed on a vector of these SH coefficients, rather than the spherical functions themselves.

Spherical harmonics are a good choice for this compressed representation because the SH approximation is invariant under rotation. That is, the SH approximation of a rotated function is the same as the rotated SH approximation of the original function.

Spherical Harmonic Bands

The Spherical Harmonic series is composed of separate “bands” of functions, usually denoted L0, L1, L2 etc. The L0 band is a single constant function; the L1 band consists of three linear functions; the L2 band contains five quadratic functions; and so on. For real-time lighting it is unusual to go past the L2 band (which requires nine total coefficients per channel) since the data and computation requirements become large, and the L2 approximation is already pretty accurate. However use cases in physics and astronomy can require up to L3000!

We will return to the bands later, but for now let’s start with some definitions.

Truncated Weighted Sum

Given a basis $B_i (\vec{\boldsymbol{\omega}})$ , a spherical function $R(\vec{\boldsymbol{\omega}})$ can be written as a weighted sum:

$R(\vec{\boldsymbol{\omega}}) = R_0 B_0(\vec{\boldsymbol{\omega}}) + R_1 B_1(\vec{\boldsymbol{\omega}}) + R_2 B_2(\vec{\boldsymbol{\omega}}) + \ldots$

where $R_0, R_1, \ldots$ are scalar coefficients.

Truncating the sum gives a finite approximation:

$R_{tr}(\vec{\boldsymbol{\omega}}) \, = \, R_0 B_0(\vec{\boldsymbol{\omega}}) + \ldots + R_n B_n(\vec{\boldsymbol{\omega}})$

where the vector of coefficients $(R_0, \ldots , R_n)$ fully describes the approximation.

Standard Inner Product

Recall that the standard definition of the inner product of spherical functions $R$ , $S$ is a spherical integral:

$\langle R, S \rangle_{std} = \oint \, R(\vec{\boldsymbol{\omega}}) \, S(\vec{\boldsymbol{\omega}}) \, d\Omega$

where $d\Omega$ is the measure over the sphere and $\vec{\boldsymbol{\omega}}$ is the variable of integration.

Therefore if the basis is orthogonal we compute the coefficients for a function with integrals:

$R_i = \langle R, B_i \rangle_{std} = \oint \, R(\vec{\boldsymbol{\omega}}) \, B_i(\vec{\boldsymbol{\omega}}) \, d\Omega \quad \quad i = 0, 1, \ldots$

Constant Basis Function

The standard definition of the first SH basis function — the single function in L0 — is:

$Y_0^0 (\vec{\boldsymbol{\omega}}) = \frac{1}{\sqrt{4\pi}}$

This defintion is motivated by ensuring normalisation over the sphere, that is:

$\langle Y_0^0, Y_0^0 \rangle_{std} \, = \, \oint (Y_0^0)^2 \, d\Omega \, = \, \oint \frac{1}{4\pi} \, d\Omega \, = \, 1$

However for our use case this “extra” factor of $\sqrt{4 \pi}$ is redundant. Consider what happens when we approximate a constant function $R(\vec{\boldsymbol{\omega}}) = \alpha$ . First we compute the first SH coefficient:

$R_0 = \oint R(\vec{\boldsymbol{\omega}}) Y_0^0 d\Omega =\oint \alpha \frac{1}{\sqrt{4 \pi}} d\Omega = \sqrt{4\pi} \alpha$

Now to construct the SH approximation we form the weighted sum of basis functions, that is we multiply the coefficient and the basis function:

$R_{sh}(\vec{\boldsymbol{\omega}}) = R_0 Y_0^0 = \frac{\sqrt{4\pi}\alpha }{\sqrt{4\pi}} = \alpha$

As expected, the function is reconstructed exactly, but the process is clearly more complicated than necessary: we gain a factor of $\sqrt{4 \pi}$ in the SH coefficient only to cancel it with the same factor in the basis function.

The factor of $4 \pi$ arises since it is the surface area of the unit sphere, but that isn’t relevant to our intended use case. It is much simpler to subsume this factor in a new definition of the inner product.

Normalised Inner Product

Redefine the inner product:

$\langle R, S \rangle \, = \, \frac{1}{4\pi} \oint R(\vec{\boldsymbol{\omega}}) \, S(\vec{\boldsymbol{\omega}}) \, d\Omega$

The idea is to “normalise so the surface area of the sphere is one.” With this new definition, for $R(\vec{\boldsymbol{\omega}}) = \alpha$ we now obtain the more natural outcome that the corresponding SH coefficient $R_0 = \alpha$ .

In fact this definition results in a 1-to-1 correspondence between the constructive approach of computing SH coefficients via sampling and the theoretical integral form:

$R_0 = \frac{1}{n} \sum_{i=1}^{n} R(\vec{\boldsymbol{\omega}}_i) \quad \mapsto \quad R_0 = \frac{1}{4\pi} \oint R(\vec{\boldsymbol{\omega}}) \, d\Omega$

where $\vec{\boldsymbol{\omega}}_1, \ldots, \vec{\boldsymbol{\omega}}_n$ are samples taken from a uniform spherical distribution.

This means that to compute the $i$ th SH coefficient we simply sum the values of the $i$ th basis function for $n$ given sample directions and divide the result by $n$ .

Notation

We will abuse the notation $B_l^*$ for the basis functions, where the subscript $l$ is the band, and the superscript is an index within the band. This is the similar to the standard notation, except that we use $B$ rather than $Y$ to distinguish our renormalised basis functions.

L0 Band

With our renormalised inner product, we can define the first SH basis function:

$B_0^0 (\vec{\boldsymbol{\omega}}) = 1$

This is the only basis function which has a non-zero average value over the sphere. This means that the first SH coefficient for a function represents the average energy over the sphere — the subsequent coefficients and basis functions neither add nor remove energy (instead they simply “move it around”).

L1 Band

Define:

$B_1^{-1,0,1} = x, y, z$

Note that we have chosen functions which are not unit-length with respect to our inner product. This is deliberate. We choose to take care of the normalisation in the reconstruction step — that is, when we compute the approximation $R_{sh}$ . If the function we are measuring is radiance (incoming light) then for shading we need to convert it to irradiance (outgoing light) for a given normal direction, and the reconstruction coefficients will get swept into this conversion.

In fact, it is convenient to think of the L1 band as a single vector $\vec{\boldsymbol{B}} _1 = (x, y, z)$ with corresponding SH coefficients $\vec{\boldsymbol{R}}_1$ .

The direction of $\vec{\boldsymbol{R}}_1$ is the average direction of the function $R$ . If $R$ is radiance, then it is the average direction of the incoming light (weighted by intensity).

By construction, the length of $\vec{\boldsymbol{R}}_1$ will vary between zero and $R_0$ , the first SH coefficient. The ratio $|\vec{\boldsymbol{R}}_1| / R_0$ is an indication of “how directional” the function is. If the ratio is one then $R$ is completely directional — all of the energy is at a single point on the sphere. Conversely, if the ratio is zero then $R$ is symmetrical, and the L1 band gives us no directional information at all.

L2 Band

There are six different quadratic combinations of the variables $x, y, z$ :

$x^2, y^2, z^2, xy, yz, xz$

However, since we are on the surface of a 2-sphere we know that $x^2 + y^2 + z^2 = 1$ . Therefore there is one linear dependency between these six functions, and we end up with five basis functions in L2.

The simplest way to capture the L2 band is a 3×3 matrix:

$B_2^{ij} = \omega _i \omega _j - \tfrac{1}{3} \delta _{ij} \quad \quad \quad i, j \in \{1,2,3\}$

with a corresponding matrix $R_2^{ij}$ of SH coefficients, where $\omega_{1,2,3} = x, y, z$ and $\delta_{ij}$ is one if $i = j$ and zero otherwise. The negative one-third term is required to ensure each function has zero integral over the sphere, so is orthogonal to the constant basis function.

This 3×3 matrix $R_2^{ij}$ is symmetric, meaning that $R_2^{ij} = R_2^{ji}$ , and traceless, meaning that the diagonal elements sum to zero: $R_2^{00} + R_2^{11} + R_2^{33} = 0$ . Therefore it does indeed have five degrees of freedom (and an actual implementation would not store nine coefficients!)

Whereas L0 and L1 have relatively simple physical meanings, it is harder to grasp L2 intuitively. The linear L1 band is “antipodal” in the sense that if you add weight in the $+x$ direction then it must be balanced by negative weight in the opposite $-x$ direction. For the quadratic L2 band, the $+x$ and $-x$ directions are the same (since their squares are equal). Instead, to ensure a net-zero overall contribution, if weight is added to the $x$ axis then the balancing negative weight is shared equally in the orthogonal $y$ and $z$ axes.

Although L2 is already a good approximation, in the case of CG lighting this property can have counter-intuitive effects: adding a light may cause “negative light” to appear in some orthogonal direction.

Reconstruction

With the above definitions, the SH approximation to the original function is:

$R_{sh}(\vec{\boldsymbol{\omega}}) = R_0 + 3 \vec{\boldsymbol{R}}_1 \! \cdot \! \vec{\boldsymbol{\omega}} + \tfrac{15}{2}\omega_i \omega_j R_2^{ij} + \ldots$

This formulation requires only simple rational constants applied in the reconstruction step.

Conclusion

Everything should be made as simple as possible, but no simpler.

Albert Einstein

We have shown how to redefine the SH basis functions for L0, L1 and L2 in a way that is simpler and more natural for many use cases. This definition removes all trigonometric functions, factors of $\pi$ and square roots, and just leaves the rational constants required for normalisation. The process can be naturally extended to higher SH bands as required.

Although we’ve mentioned lighting a few times in this post, nothing here is lighting-specific. In the next post we’ll discuss a lighting problem: how best to convert from SH radiance (incoming light) to irradiance (outgoing light).