NeRD: Neural Reflectance Decomposition from Image Collections

Overview of the method. NeRD decomposes a scene from multiple input images into a neural volume with explicit BRDFs. The information can be extracted to traditional textured meshes.

Abstract

Decomposing a scene into its shape, reflectance, and illumination is a challenging but essential problem in computer vision and graphics. This problem is inherently more challenging when the illumination is not a single light source under laboratory conditions but is instead an unconstrained environmental illumination. Though recent work has shown that implicit representations can be used to model the radiance field of an object, these techniques only enable view synthesis and not relighting. Additionally, evaluating these radiance fields is resource and time-intensive. By decomposing a scene into explicit representations, any rendering framework can be leveraged to generate novel views under any illumination in real-time. NeRD is a method that achieves this decomposition by introducing physically-based rendering to neural radiance fields. Even challenging non-Lambertian reflectances, complex geometry, and unknown illumination can be decomposed to high-quality models.

Publication
In ArXiv

Introduction

NeRD is a novel method that can decompose image collections from multiple views taken under varying or fixed illumination conditions. The object can be rotated, or the camera can turn around the object. The result is a neural volume with an explicit representation of the appearance and illumination in the form of the BRDF and Spherical Gaussian (SG) environment illumination.

The method is based on the general structure of NeRF. However, NeRF encodes the scene to an implicit BRDF representation where a Multi-Layer-Perceptron (MLP) is queried for outgoing view directions at every point. Extracting information from NeRF is therefore not easily done, and the inference time for novel views takes around 30 seconds. Also, NeRF is not capable of relighting an object under any illumination. By introducing physically-based representations for lighting and appearance, NeRD can relighting an object, and information can be extracted from the neural volume. After our extraction process, the result is a regular texture mesh that can be rendered in real-time. See our results where we provide a web-based interactive renderer.

Method

Decomposing the scene requires that the integral over the hemisphere from the rendering equation is decomposed into its parts. Here, we use a simplified version without self-emittance. $$L_o(x,\omega_o) = \int_\Omega L_i(x,\omega_i) f_r(x,\omega_i,\omega_o) (\omega_i \cdot n) d\omega_i$$ Here, $L_o$ is the outgoing radiance for a point $x$ in the direction $\omega_o$. This radiance is calculated by integrating all influences over the hemisphere $\Omega$, which are based on the incoming light $L_i$ for each direction $\omega_i$. The surface behavior is expressed as the BRDF $f_r$, which describes how incoming light $\omega_i$ is directed to the outgoing direction $\omega_o$. Lastly, a cosine term is used $(\omega_i \cdot n)$, which reduces the received light based on the angle towards the light source.

The inverse of this integral is highly ambiguous, and we use several approximations to solve it. We do not use any interreflections or shadowing, which means that we do not compute the incoming radiance recursively. Additionally, our illumination is expressed as SG, which reduces a full continuous integral to - in our case - 24 evaluation of the environment SGs.

Steps of the query process. In a) the volume is constructed by tracing rays from each camera into the scene. Then samples are placed along with the rays in b), and based on the density at each sampling point, additional samples are placed in c). The samples are then evaluated into a BRDF, which is re-rendered using the jointly optimized illumination in d).

Inspired by NeRF the method uses two MLPs, which encode the each position in the volume $\textbf{x} = (x,y,z)$ to a volume density $\sigma$ and a color or BRDF parameters. Figure 1 shows an overview of this optimization process.

Results

Click the images for an interactive 3D visualization.

Real-world

Real-world (Preliminary in the wild)

The images from the Statue of Liberty are collected from Flickr, Unsplash, Youtube and Vimeo videos. In total about 120 images are used for training from various phones, cameras and drones. Overall even the COLMAP registration is not perfect due to the simplistic shared camera model.

Synthetic Examples
Mark Boss
Mark Boss
Ph.D. Student

I’m a Ph.D. student at the University of Tübingen with research interests in the intersection of machine learning and computer graphics.

Related