Technology and Art
This is part of a series of posts breaking down the paper Plenoxels: Radiance Fields without Neural Networks, and providing (hopefully) well-annotated source code to aid in understanding.
The final code has been moved to its own repository at plenoxels-pytorch.
We continue looking at Plenoxels: Radiance Fields without Neural Networks. We have built our volumetric renderer; it is now time to start calculating loss from training images. We start off with trying out some test images, the source of which will be our renderer itself, as a kind of control set.
The code for this article can be found here: Volumetric Rendering Code
Before we begin however, let’s talk quickly about some technical details of the code.
1. How is the camera basis calculated?
We wanted to simplify our calculations for creating the camera coordinate system. Usually, you’d be given a camera basis or a camera transformation matrix explicitly. To make it easier to generate scenes and not worry about having to calculate an orthogonal basis by hand every time, our camera simply uses two parameters: the look_at
and camera_center
parameters. Together, these immediately give us a vector which points in the direction that the camera is looking, which is the Z-axis in the camera’s system.
We now need to generate two orthogonal vectors from this to complete the coordinate system. First, we need to define the “up-vector”. In most situations, we want the camera to be upright; loosely, the world’s Z-axis should always be up.
In the calculation below, \(x_c\), \(y_c\), and \(z_c\) represent the three basis vectors of the camera, and \(z_w\) represents the world’s Z-axis.
\[z_c.y_c = 0 \\ y_c + \lambda z_c = z_w\]We need to find \(y_c\). We can simply deduce from the above that:
\[z_c.(z_w - \lambda z_c) = 0 \\ z_c.z_w - \lambda z_c.z_c = 0 \\ \lambda = \frac{z_c.z_w}{z_c.z_c} = \frac{z_c.z_w}{ {|z_c|}^2}\]Then we can write \(y_c\) as:
\[y_c = z_w - \frac{z_c.z_w}{ {|z_c|}^2} z_w\]This is what the first part of the camera_basis_from()
function does.
Now that we have two axes, the third axis can be calculated by simply taking the cross product of \(z_c\) and \(y_c\) to get \(x_c\). Thus, \((x_c, y_c, z_c)\) constitute an orthogonal basis for the camera, with \(z_c\) pointing in the direction where the camera is looking.
The calculation is depicted in the figure below.
Note that we also normalise these vectors to unit length before returning the basis.
2. How are the rays on screen calculated?
We have seen above how to calculate the camera basis. We need to now calculate where the focal plane is, because the viewing rays originating from the camera will intersect this plane. In fact, this plane will be important in deciding which viewing rays to actually fire. The normal to the focal plane is instantly given by \(z_c\), the camera’s Z-axis. We want to render everything which is within the bounds of the focal plane screen. This in turn depends upon the length and height of the screen. We can pick any dimension: in the code, we have chosen a square of size \([-1, 1] \times [-1, 1]\), with a resolution of 100 samples in each direction.
For each of these points \((x_s,y_s)\), we want to translate from \({\mathbb{R}}^2 \in [0,1] \times [0,1]\) into world coordinates. This is done by simply scaling the camera’s unit basis vectors \(x_c\) and \(y_c\) by the corresponding amounts.
One last thing remains: to translate this result to the actual focal plane, because the camera’s basis (any basis, for that matter of fact) is centered at \((0,0,0)\). For this we need the coordinates of the focal point. This is done by simply scaling the unit basis vector \(z_c\) by the focal length of the camera, and translating the result to the camera center \(X_c\). The calculation is summarised below: keep in mind that \(f\), \(x_s\), and \(y_s\) are scalars.
\[X(x_s,y_s) = x_s.x_c + y_s.y_c + (f.z_c + X_c)\]The calculation is visualised in the diagram below.
3. How are the RGB values calculated from the spherical harmonics?
In the code, the convention is to use a scale of \([0.0, 1.0]\) for each colour channel. Depending upon the coefficients and the number of ray samples, the calculated colour of any point in the view screen can routinely lie outside \([0,1]\). The paper does not elaborate much on how to clamp these values, beyond using a ReLU for linearity. However, the original NERF paper uses a sigmoid function, which restricts everything to \([0.0, 1.0]\), which is what we use here. It’s easy to implement, differentiable, and gives good results.
In preparation for making our data structures more flexible, as well as making the structures more amenable for PyTorch’s optimisation routines, we will need refactor our data structure. So far, we have simply used a simple 3D tensor, whose elements represent the voxels. We will continue to maintain this structure for now, but we will introduce a more pointer based structure to easily reference and store the voxels that our viewing rays actually intersect during each training cycle, because those are the only parameters that we will be optimising every time we train; the other voxel values are frozen.
The renderer will thus need refactoring; however, we will continue to use the original renderer to generate ground truth that we can use to validate that the refactored renderer still generates correct images.
The new data structure is shown below; the structure of one viewing ray is elaborated in this diagram.
Here is a quick explanation of the structures involved.
voxel_pointers
structure.(start, end)
tuple, corresponding to one ray.voxel_positions
structure.So far, our renderer has simply iterated over a contiguous rectangular area and rendered each point in that area. For training, we will select a stochastic set of rays (5000 in the paper, about 500-1000 in our implementation) and use those to calculate the loss. Thus, our renderer should be prepared to render only specific rays. Some refactoring on this front is also done.
Let’s test the optimisation on images generated with the exact renderer which will be used in training. The training scheme is as follows:
PlenoxelModel
is our custom PyTorch model. This will hold our parameters. These parameters are essentially a flat list of voxels which are intersected by about 1000 viewing rays. The parameters to be optimised are the opacity and the 27 spherical harmonic coefficients in each voxel.These are using just one image, and the reconstruction is correct only from that specific viewpoint.
There is one thing that we should also address: the code as it is, had some bugs that resulted in the rendering looking wonky from certain angles, and not working at all from other angles. Sometimes bugs may not become apparent till much later; having said that, here are some debugging tips I found helpful. Note that most, if not all, of these are general practices when you are trying to debug any code, not just ML code. These become especially important, though, when you are debugging a rendering error which involves thousands of pixels.
grad_fn
in the tensor description, or simply something like tensor(XXX., requires_grad=True)
in the middle of your calculations, go back to the previous step and see which calculation was not done using PyTorch’s APIs. In particular, do not wrap outputs using torch.tensor(...)
.We have come quite far in our implementation of the paper. We will progress to training using multiple training images in the sequel and (probably) add TV regularisation.