projects

A Visual Exploration of Neural Radiance Fields

May 2022

Neural Radiance Fields (NeRFs) provide a substantial step up in our ability to construct interactive, photorealistic 3D objects. The core of this capability comes from a coordinate-based neural representation of low-dimensional signals. The goal of this post is to gain a strong intuition about what that means, and the trade-offs involved.

Volume Rendering

Before we dive into the details of how NeRF works, let us first get a better understanding of representing 3D objects. A “volume” is a dataset spanning 3 spatial dimensions V:R3FV : \mathbb{R}^3 \rightarrow F and can include several “features” for each point in space, for example it could have a measure of density or texture. Volumes are not defined only by their surface structure (like meshes - the other common representation), but also by their interior structure. This is why we can naturally use volumes to represent objects like clouds, smoke, or liquids which are semi-transparent objects. In our case let’s define the “interesting” features for our volume as a colour RGBRGB and a density σ\sigma.

Creating a 2D image from a volume (known as rendering) consists of computing the “radiance” - the reflection, refraction, and emittance of light of an object from a source to a particular observer - which can then be used to produce an RGBRGB pixel grid showing what our model looks like.

A common technique for volume rendering (and the one used in NeRF) is Ray Marching. Ray Marching constructs a series of rays; r(t)=o+td\mathbf{r}(t) = \mathbf{o} + t\mathbf{d}, defined by an origin vector and a direction vector for each pixel. We then sample along this ray to get both a colour, cic_i, and a density, σi\sigma_i.

To form an RGBA sample, a transfer function (from classical volume rendering 1) is applied to these samples. This transfer function progressively accumulates from the point close to the eye to the point where we exit our volume. We can think of this discretely, as:

c=MTiαicic = \sum^M T_i \alpha_i c_i

where Ti=j=1i1(1αj)T_i = \prod^{i-1}_{j=1} (1 - \alpha_j), αi=1eσiδi\alpha_i = 1 - e^{-\sigma_i \delta_i} , and δi\delta_i is the sampling distance.

In essence, the transfer function is designed to translate our common-sense understanding that if something is behind a dense / “non-transparent” object then it will contribute less to an image than if it was behind a less-dense / “transparent” object. The key requirement for the rendering component is that it is differentiable, and hence can be optimised over efficiently 2.

With that out of the way, the innovation behind NeRF has little to do with the rendering approach - instead, it is about how you capture a volume. How do you encode the volume to allow it to be queryable at any arbitrary point in space? While also aware of non-”matte” surfaces?

Volume Representation

Let us first consider a naive approach for encoding volumes - voxel grids. Voxel grids apply an explicit discretisation to the 3D space, chunking up into cubes of space, and storing the required metadata in an array (often referencing a material that includes colour, and viewpoint-specific features which the ray marcher must take into account). The challenge with this internal representation is that it takes no advantage of any natural symmetries and the storage and rendering requirements explode in O(N3)O(N^3). We also need to handle the viewpoint-specific reflections and lighting separately in the rendering process which is a challenge in itself.

You can get a good feel for the performance impact by increasing the level of granularity in the Sine-wave Voxel Grid. I encourage you to compare this level of detail and fluidity to our NeRF Lego bulldozer.

Neural Radiance Fields (NeRF)

What is volume if not a function mapping? Given that there are clear symmetries in any particular volume, we’d like a method to learn to exploit those symmetries. Densely connected neural networks have been found to suit this task well when given sufficient data.

Neural Radiance Fields 3 4 5 use densely connected ReLU Neural Networks (8 layers of 256 units) to represent this volume. The resulting Neural Network, FθF_\theta, can output a RGBσRGB\sigma value for each spatial/viewpoint pair. Implicitly this encodes all of the material properties that normally have to be manually specified; including lighting. Put more formally we can see this as:

(x,y,zSpatial,θ,ϕViewpoint)Fθ(r,g,bColor,σDensity) (\underbrace{x, y, z}_\text{Spatial}, \underbrace{\theta, \phi}_\text{Viewpoint}) \rightarrow F_{\theta} \rightarrow (\underbrace{r, g, b}_\text{Color}, \underbrace{\sigma}_\text{Density})
2: NeRF multi-layer perceptron

Optimising

Adding these two components together - a compact, continuous volume estimator - and a volume renderer, we have sufficient components to construct a fully-differentiable pipeline to optimise our model. We can then minimise our pixel-squared error loss over our parameters.

One of the most important things to remember is that there is zero training involved in basic NeRF models. There is no prior knowledge of the image content, nor any information about common materials or shapes. Instead, each NeRF model is trained from scratch for each scene we want to synthesise. We can see the impact that has on both training time, and the data requirements in our training simulator.

Challenges

In addition to the extended training times, and substantial data requirements, there are also a few common failure cases that should be highlighted.

Special Sauce

You may notice that the above examples don’t have the same clarity as the baked model. This is no accident, beyond just training time, state-of-the-art performance when optimising these densely connected networks requires a few additional techniques.

Held-out Viewing Directions

Rather than adding viewing direction to the original input to the function, it is best to leave the directions out of the first 4 layers. This reduces the number of view-dependent (often floating) artifacts that occur from a premature optimisation of a view-dependent feature before realising the underlying spatial structure.

Fourier Transforms

In truth, the input to the Neural Networks isn’t the raw RGB images and viewpoints, instead, a pre-processing step is included to transform the position into sinusoidal signals of exponential frequencies.

[sin(v)cos(v)sin(2v)cos(2v)sin(2L1v)cos(2L1v)]\begin{bmatrix} sin(\mathbf{v})cos(\mathbf{v}) \\ sin(2 \mathbf{v})cos(2 \mathbf{v}) \\ \vdots \\ sin(2^{L - 1} \mathbf{v})cos(2^{L - 1} \mathbf{v}) \end{bmatrix}

This seemingly preprocessing trick leads to substantially better results and removes a tendency to “blur” resulting images.

Why is this? The network architecture for NeRF (Densely connected ReLU) is incapable of modeling signals with fine detail and fails to represent a signal’s spatial and temporal derivatives, even though these are essential to the physical signals. This is a similar realisation to the work of Sitzmann 6, although approached by pre-processing as opposed to changing the network’s activation function.

A ReLU MLP with Fourier Features (of which Positional Encoding is one type) allows for these high-frequency functions to be represented in low dimensional domains because the ReLU MLP acts as a dot product kernel, and the dot product of Fourier Features are stationary.

Hierarchical Sampling

One of the first questions that likely came to mind when talking about classical volume rendering was why we were using a uniform sampling system when the vast majority of models will have large quantities of empty space (with low densities) and diminishing value from samples behind dense objects.

The original NeRF 3 paper handles this by learning two models simultaneously - a coarse-grained model that provides density estimates from particular position/viewpoint pairs and a fine-grained model identical to the original model. The coarse-grained model is then used to re-weight the sampling of the fine-grained model and generally produces better results.

Conclusion

With this article, you should have obtained an overview of the original Neural Radiance Field paper, and developed a deeper understanding of how they work. As we have seen, Neural Radiance Fields can also offer a flexible framework for encoding 3D scenes for rendering and while they have several shortcomings, there are significant extensions possible to both alleviate and expand on the system proposed.

Footnotes

  1. Volume rendering
    Drebin, R.A., Carpenter, L. and Hanrahan, P., 1988. ACM Siggraph Computer Graphics, Vol 22(4), pp. 65—74. ACM New York, NY, USA.

  2. For the full details of how Ray Marching works, I’d recommend 1000 Forms of Bunny’s guide, and you can also see the code used for the TensorFlow renderings which uses a camera projection for acquiring rays and a 64 point linear sample.

  3. Nerf: Representing scenes as neural radiance fields for view synthesis
    Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R. and Ng, R., 2020. European conference on computer vision, pp. 405—421. 2

  4. Nerf in the wild: Neural radiance fields for unconstrained photo collections
    Martin-Brualla, R., Radwan, N., Sajjadi, M.S., Barron, J.T., Dosovitskiy, A. and Duckworth, D., 2021. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7210—7219.

  5. D-nerf: Neural radiance fields for dynamic scenes
    Pumarola, A., Corona, E., Pons-Moll, G. and Moreno-Noguer, F., 2021. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10318—10327.

  6. Implicit neural representations with periodic activation functions
    Sitzmann, V., Martel, J., Bergman, A., Lindell, D. and Wetzstein, G., 2020. Advances in Neural Information Processing Systems, Vol 33.