Fundamentals
These papers address more fundamental problems of view-synthesis with NeRF methods.
Linning Xu, Yuanbo Xiangli, Sida Peng, Xingang Pan, Nanxuan Zhao, Christian Theobalt, Bo Dai, Dahua Lin
Figure 1. We perform large urban scene rendering with a novel grid-guided neural radiance fields. An
example of our target large urban scene is shown on the left, which spans over 2.7km2 ground areas
captured by over 5k drone images. We show that the rendering results from NeRF-based methods tend to
display noisy artifacts when adapting to large-scale scenes with high-resolution feature grids. Our
proposed two-branch model combines the merits from both approaches and achieves photorealistic novel
view renderings with remarkable improvements over existing methods. Both two branches gain significant
enhancements over their individual baselines. (Project page: https://city-super.github.io/gridnerf)
The authors propose a compact multiresolution ground feature plane representation combined with a
positional encoding-based NeRF branch to handle scenes with complex geometry and texture. This integration
allows for efficient rendering of photorealistic novel views with fine details.
Liao Wang, Qiang Hu, Qihan He, Ziyu Wang, Jingyi Yu, Tinne Tuytelaars, Lan Xu, Minye Wu
Figure 1. Our proposed ReRF utilizes a residual radiance field and a global MLP to enable highly
compressible and streamable radiance field modeling. Our ReRF-based codec scheme and streaming player
gives users a rich interactive experience.
ReRF extends NeRF to handle long-duration dynamic scenes in real-time by modeling residuals between
adjacent timestamps in a spatial-temporal feature space with a global coordinate-based MLP feature
decoder. The method uses a compact motion grid and employs a sequential training scheme. ReRF enables
online streaming of FVVs of dynamic scenes.
Sicheng Li, Hao Li, Yue Wang, Yiyi Liao, Lu Yu
Figure 1. Illustration. We exploit smooth viewpoint trajectory to accelerate NeRF rendering, achieved by
performing volume rendering at a low resolution and recovering the target image guided by multiple
viewpoints. Our method enables fast rendering with a low memory footprint.
NeRF-MVS improves the efficiency of NeRF rendering by keyframe prior knowledge, allowing the information
of earlier rendered images to influence later ones. Moreover, the use of a 2D renderer decreases memory
consumption and overhead.
Lingzhi Li, Zhen Shen, Zhongshu Wang, Li Shen, Liefeng Bo
Figure 1. Our compression pipeline realizes 100X compression rate while maintaining the rendering
quality of the original volumetric model.
The VQRF framework compresses volume-grid-based radiance fields to overcome their memory and disk space
issues. The framework includes voxel pruning, a trainable vector quantization, and an efficient joint
tuning strategy, which achieves high generalization with negligible loss of visual quality.
Shaowen Xie, Hao Zhu, Zhen Liu, Qi Zhang, You Zhou, Xun Cao, Zhan Ma
Figure 1. PSNR of various INRs on 2D image fitting over different training epochs.
DINER solves the spectral bias in implicit neural representation by rearranging input coordinates to
enable improved mapping of the signal to a more uniform distribution. The method demonstrated improved
reconstruction quality and speed when compared with other state-of-the-art algorithms.
Jiawei Yang, Marco Pavone, Yue Wang
Figure 3. Illustration of occlusion regularization. We show 3 training views (solid rectangles) and 2
novel views (dashed rectangles) rendered by a frequency-regularized NeRF. The floaters in the novel views
appear to be near-camera dense fields in the training views (dashed circles) so that we can penalize them
directly without the need for the costly novel-view rendering in .
FreeNeRF is a simple baseline that surpasses previous state-of-the-art methods in neural view synthesis
performance with sparse inputs. This is achieved by introducing two regularization terms at no additional
computational cost, aimed at controlling frequency range inputs and penalizing near-camera density fields.
The proposed technique presents an effective approach for low-data regimes of neural radiance fields
training.
Ziyu Wan, Christian Richardt, Aljaž Božič, Chao Li, Vijay Rengarajan, Seonghyeon Nam, Xiaoyu Xiang,
Tuotuo Li, Bo Zhu, Rakesh Ranjan, Jing Liao
Figure 1. Our framework efficiently learns the neural duplex radiance field from a NeRF model for
high-quality real-time view synthesis.
NeRF to Mesh (N2M) converts Nerf models to mesh for real-time rendering. The representation uses a duplex
mesh and screen-space convolution for fast rendering. Furthermore, a multi-view distillation optimization
strategy is used to improve performance.
Zhaoshuo Li, Thomas Müller, Alex Evans, Russell H. Taylor, Mathias Unberath, Ming-Yu Liu, Chen-Hsuan Lin
Figure 1. We present Neuralangelo, a framework for high-fidelity 3D surface reconstruction from RGB
images using neural volume rendering, even without auxiliary data such as segmentation or depth. Shown
in the figure is an extracted 3D mesh of a courthouse.
Neuralangelo synthesizes high-quality 3D models from images by first learning a multi-resolution implicit
function representative of a geometry structure, then using a smoothed representation to improve the
optimization of the hash grid.
Radu Alexandru Rosu, Sven Behnke
Figure 1. Given multi-view images, we recover both high quality geometry as an implicit SDF and
appearance which can be rendered in real-time.
PermutoSDF proposes a permutohedral lattice implicit surface representation that can recover
high-frequency geometric details on arbitrary mesh at a lower cost than voxel hashing. The underlying
approach uses hybrid density and SDF optimization for high-quality surface reconstruction from RGB images.
Ze-Xin Yin, Jiaxiong Qiu, Ming-Ming Cheng, Bo Ren
Figure 1. (a) Though Mip-NeRF 360 can handle unbounded scenes, it still suffers from reflective
surfaces, as the virtual images violate the multi-view consistency, which is of vital importance to
NeRF-based methods. (b) Our method can help conventional NeRF-like methods learn the virtual images with
little extra cost.
MS-NeRF proposes a multi-space approach to address the problem of blurry or distorted renderings caused by
reflective objects in existing NeRF methods. By using feature fields in parallel subspaces, the MS-NeRF
shows better performance with minimal computational overheads needed for training and inferring.
Fernando Rivas-Manzaneque, Jorge Sierra-Acosta, Adrian Penate-Sanchez, Francesc Moreno-Noguer, Angela
Ribeiro
Figure 1. NeRFLight is able to double the FPS/MB ratio of the second best method while obtaining similar
quality metrics to state-of-the-art.
The authors introduce a lightweight neural radiance field, called SplitNeRF, which separates the density
field into regions. A different decoder is employed to model the density field of each region,
simultaneously using the same feature grid. This new architecture produces a smaller but more compact
representation than existing methods.
Youngho Yoon, Kuk-Jin Yoon
Figure 1. Cross-guided optimization between single image superresolution and radiance fields. They
complement weaknesses of one another with their respective strengths by using the SR update module,
rendered train-view RGBs, and uncertainty maps.
The proposed method uses a cross-guided optimization framework that performs multi-view image
super-resolution (MVSR) on train-view images during the optimization process of radiance fields. This
iterative process improves multi-view consistency and high-frequency details, which ultimately leads to
better performance in high-resolution novel view synthesis. The method outperforms existing methods on
various benchmark datasets.
Zhihao Liang, Zhangjin Huang, Changxing Ding, Kui Jia
Figure 2. Overview of HelixSurf: Helix-shaped neural implicit Surface learning. HelixSurf integrates the
neural implicit surface learning (cf. Section 4.1) and PatchMatch based MVS (cf. Section 4.2) in a
robust and efficient manner. We optimize HelixSurf with an iterative intertwined regularization, which
uses the intermediate prediction from one strategy as guidance to regularize the learning/optimization
of the other one; given that MVS predictions are less reliable for textureless surface areas, we
additionally devise a scheme that regularizes the learning on such areas by leveraging the homogeneity
per superpixel in observed multi-view images (cf. Section 4.1.1). We also propose a scheme for point
sampling along rays (cf. Section 4.3), which significantly improves the efficiency. At the inference stage
of HelixSurf, we conduct grid sampling to query the learned SDF values at sampled points and run
Marching Cubes to get the reconstruction results.
HelixSurf is proposed and combines traditional multi-view stereo with neural implicit surface learning for
reconstruction of complex scene surfaces using complementary benefits from the two methods. The authors
show favorable results with orders of magnitude faster computation times, even without relying on
auxiliary training data.
Zhijie Wu, Yuhe Jin, Kwang Moo Yi
Figure 1. Teaser – We propose neural Fourier filter bank to perform spatial and frequency-wise
decomposition jointly, inspired by wavelets. Our method provides significantly improved reconstruction
quality given the same computation and storage budget, as represented by the PSNR curve and the error
image overlay. Relying only on space partitioning without frequency resolution (InstantNGP) or frequency
encodings without space resolution (SIREN) provides suboptimal performance and convergence. Simply
considering both (ModSine) enhances scalability when applied to larger scenes, but not in terms of
quality and convergence.
NFFB is a Fourier decomposition inspired neural field that separates scene signals into spatial and
frequency components. It outperforms other state-of-the-art models using this paradigm in terms of model
compactness and convergence speed across 2D image fitting, 3D shape reconstruction and neural radiance
fields.
Andreas Meuleman, Yu-Lun Liu, Chen Gao, Jia-Bin Huang, Changil Kim, Min H. Kim, Johannes Kopf
Figure 2. Space parameterization. (a) NDC, used by NeRF for forward-facing scenes, maps a frustum to a
unit cube volume. While a sensible approach for forward-facing cameras, it is only able to represent a
small portion of a scene as the frustum cannot be extended beyond a field of view of 120◦ or so without
significant distortion. (b) Mip-NeRF360’s space contraction squeezes the background and fits the entire
space into a sphere of radius 2. It is designed for inward-facing 360 scenes and cannot scale to long
trajectories. (c) Our approach allocates several radiance fields along the camera trajectory. Each
radiance field maps the entire space to a cube (Equation (5)) and, each having its own center for
contraction (Equatione (7)), the high-resolution uncontracted space follows the camera trajectory and
our approach can adapt to any camera path.
The authors present an algorithm for reconstructing the radiance field of a large-scale scene from a
single casually captured video. To address the size of the scene and difficulty of camera pose estimation,
the method uses local radiance fields trained with frames within a temporal window in a progressive
manner. The method is shown to be more robust than existing approaches and performs well even under
moderate pose drifts.
Zhiqin Chen, Thomas Funkhouser, Peter Hedman, Andrea Tagliasacchi
Figure 1. Teaser – We present a NeRF that can run on a variety of common devices at interactive frame
rates.
Poly-NeRF introduces a new representation for NeRFs that relies on standard hardware rendering instead of
its traditional algorithms for fast NeRF representation.
Jaehoon Choi, Dongki Jung, Taejae Lee, Sangwook Kim, Youngdong Jung, Dinesh Manocha, Donghwan Lee
Figure 1. Example reconstruction results collected from a smartphone in the wild. (a) Data acquisition
setup. (b) Images captured from a smartphone. (c) A reconstructed mesh. (d) A novel view of textured
mesh. Our proposed method can reconstruct the highquality geometric mesh with a visually realistic
texture.
The authors propose a pipeline to capture textured meshes using a smartphone by incorporating structure
from motion and neural implicit surface reconstruction. The method allows for high-quality mesh
reconstruction with improved texture synthesis and does not require in-lab environments or accurate masks.
Mingfang Zhang, Jinglu Wang, Xiao Li, Yifei Huang, Yoichi Sato, Yan Lu
Figure 1. We propose the Structural Multiplane Image (S-MPI) representation to bridge the tasks of
neural view synthesis and 3D reconstruction. It consists of a set of posed RGBα images with geometries
approximating the 3D scene. The scene-adaptive S-MPI overcomes the critical limitations of standard MPI
, e.g., discretization artifacts (D) and repeated textures (R), and achieves a better depth map compared
with the previous planar reconstruction method, PlaneFormer .
S-MPI extends MPI by using compact and expressive structures. The authors propose a transformer-based
network to predict S-MPI layers, their masks, poses, and RGBA contexts, and handle non-planar regions in a
special case. Experiments show superior performance over MPI-based methods for view synthesis and planar
reconstruction.
Edoardo Mello Rella, Ajad Chhatkuli, Ender Konukoglu, Luc Van Gool
Figure 1: Vector Field (VF) visualization on 2D surfaces. Each column shows the shape to represent in
black, and a zoomed in sample reconstruction of the surface inside the red box. VF can represent shapes
similar to SDF or binary occupancy, but also open surfaces. As the surface normals are directly encoded,
VF can reconstruct very sharp angles through additional regularization when needed.
This paper introduces a new 3D implicit field representation called Vector Field (VF) that encodes unit
vectors in 3D space. VF is directed at the closest point on the surface, and encodes surface normal
directly, which has superior accuracy in representing any type of shape, outperforming other standard
methods on several datasets including ShapeNet.
Baorui Ma, Junsheng Zhou, Yu-Shen Liu, Zhizhong Han
Figure 2. Visualization of level sets on a cross section. We pursue better gradient consistency in a
field learned from 3D point clouds in (a) and multi-view images in (b). We minimize our level set
alignment loss with NeuralPull in (a) and NeuS in (b), which leads to more accurate SDFs with better
parallelism of level sets and less artifacts in empty space.
The parallelism of level sets is the key factor affecting inference accuracy of neural SDFs. Toward Better
Gradient Consistency for Neural Signed Distance Functions from Point Clouds or Multi-View Images improves
accuracy using an adaptive alignment loss and constraining gradients at queries and their projections.
Vishwanath Saragadam, Daniel LeJeune, Jasper Tan, Guha Balakrishnan, Ashok Veeraraghavan, Richard G.
Baraniuk
Figure 1. Wavelet implicit neural representation (WIRE). We propose a new nonlinearity for implicit
neural representations (INRs) based on the continuous complex Gabor wavelet that has high representation
capacity for visual signals. The top row visualizes two commonly used nonlinearities: SIREN with
sinusoidal nonlinearity and Gaussian nonlinearity, and WIRE that uses a continuous complex Gabor
wavelet. WIRE benefits from the frequency compactness of sine, and spatial compactness of a Gaussian
nonlinearity. The bottom row shows error maps for approximating an image with strong edges. SIREN
results in global ringing artifacts while Gaussian nonlinearity leads to compact but large error at
edges. WIRE produces results with the smallest and most spatially compact error. This enables WIRE to
learn representations rapidly and accurately, while being robust to noise and undersampling of data.
WIRE uses complex Gabor wavelets which favorably combine good properties for image processing tasks that
use implicit neural representation such as image denoising, image inpainting, and super-resolution. The
authors argue that WIRE outperforms current state-of-the-art INRs in terms of both accuracy and
robustness.
Seunghyeon Seo, Donghoon Han, Yeonjin Chang, Nojun Kwak
Figure 1. Comparison with the vanilla mip-NeRF and other regularization methods. Given the same number
of training batch and iterations, our MixNeRF outperforms mip-NeRF and DietNeRF by a large margin with
comparable or shorter training time. Compared to RegNeRF , ours achieves superior performance with about
42% shortened training time. The size of the circles are proportional to the number of input views,
indicating 3/6/9-view, respectively. More details are provided in Sec. 4.2.
MixNeRF improves the training efficiency of novel-view synthesis by modeling a ray with a mixture density
model. The method also introduces a new task of ray depth estimation as a training objective and
regenerates blending weights to improve robustness.
Jiaxiong Qiu, Peng-Tao Jiang, Yifan Zhu, Ze-Xin Yin, Ming-Ming Cheng, Bo Ren
Figure 1. 3D object surface reconstruction under high specular reflections (HSR). Top: A real-world scene
captured by a mobile phone. Middle: The state-of-the-art method NeuS fails to reconstruct the target
object (i.e., the Buddha). Bottom: We propose NeuS-HSR, which recovers a more accurate target object
surface than NeuS.
While implicit neural rendering provides high-quality 3D object surfaces, high specular reflections in
scenes through glass cause ambiguities that challenge their reconstruction. NeuS-HSR combats this
challenge by decomposing the rendered image into two appearances: the object and auxiliary plane. The
auxiliary plane appearance is generated by a novel auxiliary plane module that integrates physics
assumptions with neural networks.
Xiaoxiao Long, Cheng Lin, Lingjie Liu, Yuan Liu, Peng Wang, Christian Theobalt, Taku Komura, Wenping
Wang
Figure 1. We show three groups of multi-view reconstruction results generated by our proposed NeuralUDF
and NeuS respectively. Our method is able to faithfully reconstruct the highquality geometries for both
the closed and open surfaces, while NeuS can only model shapes as closed surfaces, thus leading to
inconsistent typologies and erroneous geometries.
NeuralUDF proposes utilizing an unsigned distance function representation for surfaces as opposed to a
signed distance function representation. A volume rendering scheme is introduced to learn the neural UDF
representation, allowing for high-quality reconstruction of non-closed shapes with complex topologies.
Andreea Dogaru, Andrei Timotei Ardelean, Savva Ignatyev, Egor Zakharov, Evgeny Burnaev
Figure 1. We propose a new hybrid approach for learning neural implicit surfaces from multi-view images.
In previous methods, the volumetric ray marching training procedure is applied for the whole bounding
sphere of the scene (middle left). Instead, we train a coarse sphere-based surface reconstruction
(middle right) alongside the neural surface to guide the ray sampling and ray marching. As a result, our
method achieves an increased sampling efficiency by pruning empty scene space and better quality of
reconstructions (right).
SphereGuided is a method that makes use of a coarse sphere-based surface representation to increase
efficiency of ray marching in neural distance functions. The sphere-based representation is trained in
joint fashion with the Neural Distance Function. The joint representation leads to improved accuracy for
high-frequency details without requiring additional forward passes.
Bingfan Zhu, Yanchao Yang, Xulong Wang, Youyi Zheng, Leonidas Guibas
Figure 2. (a): When the geometry is correct, observations (the projections on the 2D image plane
represented by squares) of the same surface point from different viewpoints are similar; (b): When the
geometry is incorrect, i.e., a cube is reconstructed as a sphere, the radiance of a surface point (dot
on the sphere) can exhibit large directional view-dependence. However, as long as the viewdependent
radiance function has enough capacity (c in Eq. (1)), volume rendering of the wrong geometry can still
achieve small photometric reconstruction error. Thus, one should constrain the view-dependent capacity
of the radiance function to avoid overfitting; (c): On the other hand, when the surface is non-Lambertian
or the light field is unstable, one should not over-constrain the viewdependent capacity; otherwise, the
geometry may be traded for photometric reconstruction quality.
VDN-NeRF is an approach that normalizes the view-dependence of Neural Radiance Fields without explicitly
modeling the underlying factors. The method distills invariant information already present in the learned
NeRFs and applies it during joint training for view synthesis. This method improves geometry for
non-Lambertian surfaces and dynamic lighting without changing the volume rendering pipeline.
Xiaoxu Meng, Weikai Chen, Bo Yang
Figure 1. We show three groups of surface reconstruction from multi-view images. The front and back
faces are rendered in blue and yellow respectively. Our method (left) is able to reconstruct
high-fidelity and intricate surfaces of arbitrary topologies, including those non-watertight structures,
e.g. the thin single-layer shoulder strap of the top (middle). In comparison, the state-of-the-art NeuS
method (right) can only generate watertight surfaces, resulting in thick, double-layer geometries.
NeAT proposes learning arbitrary surface topologies from multi-view images in a neural rendering
framework. The signed distance function is used to represent the 3D surface with a branch that estimates
surface existence probability. NeAT also includes neural volume rendering and performs well in
reconstructing both watertight and non-watertight surfaces.
Yiming Gao, Yan-Pei Cao, Ying Shan
Figure 1. Examples to illustrate the task of online photorealistic reconstruction of an indoor scene.
The online photorealistic reconstruction of large-scale indoor scenes: given an online input image
stream of a previously unseen scene, the goal is to progressively build and update a scene
representation that allows for highquality rendering from novel views.
SurfelNeRF combines a neural radiance field with a surfel-based representation to achieve efficient online
reconstruction and high-quality rendering of large-scale indoor scenes. The method employs a flexible
neural network to store geometric attributes and extracted appearance features from input images. Results
show that SurfelNeRF outperforms state-of-the-art methods in both feedforward inference and per-scene
optimization settings.
Jingwang Ling, Zhibo Wang, Feng Xu
Figure 1. Our method can reconstruct neural scenes from singleview images captured under multiple
lightings by effectively leveraging a novel shadow ray supervision scheme.
ShadowNeuS introduces shadow ray supervision to NeRF and shows improvements in single-view shape
reconstruction from binary shadow as well as RGB images. The code and data are available at
'https://github.com/gerwang/ShadowNeuS'.
Xiaoshuai Zhang, Abhijit Kundu, Thomas Funkhouser, Leonidas Guibas, Hao Su, Kyle Genova
Figure 1. We propose to represent the scene with a set of local neural radiance fields, named nerflets,
which are trained with only 2D supervision. Our representation is not only useful for 2D tasks such as
novel view synthesis and panoptic segmentation, but also capable of solving 3D-oriented tasks such as 3D
segmentation and scene editing. The key idea is our learned structured decomposition (top right).
Nerflets are a newly proposed set of local neural radiance fields that perform efficient and
structure-aware 3D scene representation from images. Each nerflet contributes to panoptic, density, and
radiance reconstruction, optimized jointly to represent decomposed instances of the scene. Nerflets enable
the extraction of panoptic and photometric renderings from any view and enable tasks such as 3D panoptic
segmentation and interactive editing.
Zhemin Li, Hongxia Wang, Deyu Meng
Figure 1. Overview of proposed improve scheme for INR. (a) INR is a fully connected neural network which
maps from coordinate to pixel value. (b) INRR is a regularization term represented by an INR which can
capture the self-similarity. (c) INR-Z improve the performance of INR by combining the neighbor pixels
with coordinate together as the input of another INR.
INRR uses a learned Dirichlet Energy regularizer to improve the generalization ability of Implicit Neural
Representations (INR), particularly for non-uniformly sampled data. Its effectiveness is demonstrated
through numerical experiments and it can also improve other signal representation methods.
Sara Sabour, Suhani Vora, Daniel Duckworth, Ivan Krasin, David J. Fleet, Andrea Tagliasacchi
Figure 1. NeRF assumes photometric consistency in the observed images of a scene. Violations of this
assumption, as with the images in the top row, yield reconstructed scenes with inconsistent content in
the form of “floaters” (highlighted with ellipses). We introduce a simple technique that produces clean
reconstruction by automatically ignoring distractors without explicit supervision.
Robust NeRF proposes a solution for removing distractors (moving objects, lighting variations, shadows)
from static NeRF scenes during training. The method models the distractors as optimization outliers and
improves NeRF performance on synthetic and real-world datasets.
Pierre Zins, Yuanlu Xu, Edmond Boyer, Stefanie Wuhrer, Tony Tung
Figure 1. Reconstructions with various methods using 14 images of a model from BlendedMVS .
The authors propose a new optimization framework that combines volumetric integration with local depth
prediction in multi-view 3D shape reconstructions. Their approach demonstrates better geometry estimations
over existing methods, achieving pixel-accuracy while retaining the benefits of volumetric integration.
Botao Ye, Sifei Liu, Xueting Li, Ming-Hsuan Yang
Figure 1. Reconstruction and Plane Segmentation Results. Our method can reconstruct smooth and complete
planar regions by employing the super-plane constraint and further obtain plane segmentation in an
unsupervised manner.
S3PRecon introduces a self-supervised super-plane constraint for neural implicit surface representation
methods. The method improves the reconstruction of texture-less planar regions in indoor scenes without
requiring any other ground truth annotations. The use of super-planes proves to be more effective than
traditional annotated planes.
Yiqun Wang, Ivan Skorokhodov, Peter Wonka
Figure 1. The challenge of using the tri-plane representation directly. First column: reference image.
Second to the fifth column: NeuS, Learning SDF using tri-planes, OURS without self-attention convolution,
and OURS.
PET-NeuS extends NeuS with positional encoding, learnable convolution, and tri-planes as a mixture of MLP
and tri-planes. These improvements achieve high-fidelity reconstruction on standard datasets.
Xudong Huang, Wei Li, Jie Hu, Hanting Chen, Yunhe Wang
Figure 2. An overview of the reference-based SR model, which consists of a two-branch backbone and a
fusion module. The backbones extract the feature maps from degradation-dominant (LRRef and novel view
LR) and detail-dominant information (HR Ref and novel view LR) respectively. The two feature maps are
then fused and further refined by a fusion module.
RefSR-NeRF is an end-to-end framework that reconstructs high frequency details of NeRF rendering that are
lost due to resolution explosion, by first generating a low-resolution NeRF model and then using a
high-resolution reference image to reconstruct the high frequency details. The proposed RefSR model learns
the inverse degradation process from NeRF to the target high-resolution image and outperforms NeRF and its
variants in terms of speed, memory usage, and rendering quality.
Zhengqi Li, Qianqian Wang, Forrester Cole, Richard Tucker, Noah Snavely
Figure 1. Recent methods for synthesizing novel views from monocular videos of dynamic scenes–like
HyperNeRF – struggle to render high-quality views from long videos featuring complex camera and scene
motion. We present a new approach that addresses these limitations, illustrated above via an application
to 6DoF video stabilization, where we apply our approach and prior methods on a 30-second, shaky video
clip, and compare novel views rendered along a smoothed camera path (left). On a dynamic scenes dataset
(right) , our approach significantly improves rendering fidelity, as indicated by synthesized images and
LPIPS errors computed on pixels corresponding to moving objects (yellow numbers). Please see the
supplementary video for full results.
DynIBAR synthesizes novel views from a video with a volumetric image-based rendering approach, which
synthesizes new viewpoints by aggregating features from nearby views in a scene-motion-aware manner. The
method allows for photo-realistic novel view synthesis from long videos featuring complex scene dynamics.
Peng Wang, Yuan Liu, Zhaoxi Chen, Lingjie Liu, Ziwei Liu, Taku Komura, Christian Theobalt, Wenping Wang
Figure 1. Top: (a) Forward-facing camera trajectory. (b) 360◦
F^2-NeRF is introduced as a novel fast and free grid-based NeRF for novel view synthesis, called
Fast-Free-NeRF. F^2-NeRF proposes a new perspective warping approach that enables handling of arbitrary
camera trajectories to handle both bounded and unbounded scenes. The proposed method is fast and can still
obtain high performance with a few minutes of training.
Bowen Cai, Jinchi Huang, Rongfei Jia, Chengfei Lv, Huan Fu
Figure 1. We show the surface reconstruction results produced by NeuDA and the two baseline methods,
including NeuS and Intsnt-NeuS . Intsnt-NeuS is the reproduced NeuS leveraging the multi-resolution hash
encoding technique . We can see NeuDA can promisingly preserve more surface details. Please refer to
Figure 5 for more qualitative comparisons.
NeuDA utilizes hierarchical anchor grids to improve surface reconstruction. These anchor grids encode both
high- and low-frequency geometry and appearance with a simple hierarchical positional encoding method. The
model shows substantial improvement on reconstruction over prior work.
Han Yan, Celong Liu, Chao Ma, Xing Mei
Figure 1. We propose PlenVDB, a sparse volume data structure for accelerating NeRF training and
rendering. Given a set of training views, our method directly optimizes a VDB model. Then a novel view
can be rendered with the model. Two advantages, i.e., fast voxel access for faster speed, and efficient
storage for smaller model size, enable efficient NeRF rendering on mobile devices.
PlenVDB accelerates training and inference in neural radiance fields by using the VDB data structure.
PlenVDB is shown to be more efficient in training, data representation, and rendering than previous
methods and can achieve high FPS on mobile devices.
Deborah Levy, Amit Peleg, Naama Pearl, Dan Rosenbaum, Derya Akkaynak, Simon Korman, Tali Treibitz
Figure 1. NeRFs have not yet tackled scenes in which the medium strongly influences the appearances of
objects, as in the case of underwater imagery. By incorporating a scattering image formation model into
the NeRF rendering equations, we are able to separate the scene into ‘clean’ and backscatter components.
Consequently, we can render photorealistic novel-views with or without the participating medium, in the
latter case recovering colors as if the image was taken in clear air. Results on the Curac¸ao scene: A
RAW image (left) is brightened and white balanced (WB) for visualization, showing more detail, while
areas further from the camera (top-right corner) are occluded and attenuated by severe backscatter which
is effectively removed in our restored image. Please zoom-in to observe the details.
The authors introduce SeaThru-NeRF for rendering scenes in scattering media such as underwater or foggy
scenes. The X-NeRF architecture also allows for the medium's parameters to be learnt from data along with
the scene.
Malte Prinzler, Otmar Hilliges, Justus Thies
Figure 1. Based on sparse input views, we predict depth and feature maps to infer a volumetric scene
representation in terms of a radiance field which enables novel viewpoint synthesis. The depth
information allows us to use input views with high relative distance such that the scene can be captured
more completely and with higher synthesis quality compared to previous state-of-the-art methods.
DINER uses depth and feature maps to guide volumetric scene representation for 3D rendering of novel
views. The method incorporates depth information in feature fusion and uses efficient scene sampling,
resulting in higher synthesis quality even for input views with high disparity.
Daniel Rho, Byeonghyeon Lee, Seungtae Nam, Joo Chan Lee, Jong Hwan Ko, Eunbyung Park
Figure 1. Rate-distortion curves on the NeRF synthetic dataset. The numbers inside parenthesis denote
the axis resolution of grids.
The authors propose a new method for representing wavelet-transformed grid-based neural fields in a more
memory-efficient way. They achieved state-of-the-art compression results while still maintaining
high-quality reconstruction.
Brian K. S. Isaac-Medina, Chris G. Willcocks, Toby P. Breckon
Figure 1. Comparison of Exact-NeRF (ours) with mip-NeRF 360 . Our method is able to both match the
performance and obtain superior depth estimation over a larger depth of field.
Exact-NeRF improves mip-NeRF, which approximates the Integrated Positional Encoding (IPE) using an
expected value of a multivariate Gaussian. Exact-NeRF instead uses a pyramid-based integral formulation to
calculate the IPE exactly, leading to improved accuracy and a natural extension in the case of unbounded
scenes.
Kun Zhou, Wenbo Li, Yi Wang, Tao Hu, Nianjuan Jiang, Xiaoguang Han, Jiangbo Lu
Figure 1. We propose NeRFLiX, a general NeRF-agnostic restorer that is capable of improving neural view
synthesis quality. The first example is from Tanks and Temples , and the last one is a user scene
captured by a mobile phone. RegNeRF-V3 means the model trained with three input views.
NeRFLiX restores realistic details in synthesized images by leveraging aggregation: it can effectively
remove artifacts with a degradation-driven mixer of large-scale NeRF training data. Beyond the artifact
removal, the proposed inter-viewpoint aggregation framework improves the performance and photorealism of
existing NeRF models.
Yu-Tao Liu, Li Wang, Jie yang, Weikai Chen, Xiaoxu Meng, Bo Yang, Lin Gao
Figure 1. We show comparisons of the input multi-view images (top), watertight surfaces (middle)
reconstructed with state-of-the-art SDF-based volume rendering method NeuS , and open surfaces (bottom)
reconstructed with our method. Our method is capable of reconstructing high-fidelity shapes with both
open and closed surfaces from multi-view images.
NeUDF is an implicit neural surface representation that uses the unsigned distance function to enable the
reconstruction of arbitrary surfaces from multiple views. It addresses issues arising due to the volume of
the sigmoid function and the surface orientation ambiguity in open surfaces rendering.
Runzhao Yang, Tingxiong Xiao, Yuxiao Cheng, Jinli Suo, Qionghai Dai
Figure 1. The scheme of the proposed approach TINC. For a target data, we divide the volume into
equal-size blocks via octree partitioning, with some neighboring and far-apart blocks of similar
appearances. Then each block can be representation with an implicit neural function, implemented as an
MLP. After sharing parameters among similar blocks, we can achieve a more compact neural network with a
tree shaped structure. Here we highlight the similar blocks sharing network parameters with the same
color.
TINC is a new compression technique for implicit neural representation (INR) that uses a tree-structured
MLP to fit local regions and extract shared features. This approach improves the compression of INR and
outperforms competing methods.
Zhe Jun Tang, Tat-Jen Cham, Haiyu Zhao
Figure 1. We illustrate two views of the Blender ’Drums’ Scene. The surface of the drums exhibit either
a translucent surface or a reflective surface at different angles. As shown, Ref-NeRF model has severe
difficulties interpolating between the translucent and reflective surfaces as the viewing angle changes.
Our method demonstrates its superiority over NeRF rendering models by producing such accurate
view-dependent effects. In addition, the specularity of the cymbals are rendered much closer to ground
truth compared to Ref-NeRF.
ABLE-NeRF is a self-attention-based alternative to traditional volume rendering. It utilizes learnable
embeddings to better represent view-dependent effects, resulting in superior rendering of glossy and
translucent surfaces compared to prior work. ABLE-NeRF outperforms Ref-NeRF on all 3 image quality metrics
in the Blender dataset.
Peng Dai, Yinda Zhang, Xin Yu, Xiaoyang Lyu, Xiaojuan Qi
Figure 1. Our hybrid neural rendering model generates highfidelity novel view images. Please note
characters in the book where the result of Point-Nerf is blurry and the GT is contaminated by blur
artifacts.
The hybrid rendering model proposed integrates image-based and neural 3D representations to jointly
produce view-consistent novel views of large-scale scenes, handling naturally-occurring artifacts such as
motion blur. The authors also propose to simulate blur effects on rendered images with quality-aware
weights, leading to better results for novel view synthesis than competing point-based methods.
Jinguang Tong, Sundaram Muthu, Fahira Afzal Maken, Chuong Nguyen, Hongdong Li
Figure 2. An overview of the ReNeuS framework. The scene is separated into two sub-spaces w.r.t. the
interface. The internal scene is represented by two multi-layer perceptrons (MLPs) for both geometry and
appearance. For neural rendering, we recursively trace a ray through the scene and collect a list of
sub-rays. Our neural implicit representation makes the tracing process controllable. We show a ray
tracing process with the depth of recursion Dre = 3 here. Color accumulation is conducted on irradiance
of the sub-rays in the inverse direction of ray tracing. We optimized the network by the difference
between rendered image and the ground truth image. Target mesh can be extracted from SDF by Marching
Cubes
ReNeuS proposes a neural rendering approach for reconstructing the 3D geometry of an object confined in a
transparent enclosure. The method models the scene as two subspaces, one inside the transparent enclosure
and another outside it. ReNeuS also presents a hybrid rendering strategy for complex light interactions
and combines volume rendering with ray tracing.
Junli Cao, Huan Wang, Pavlo Chemerys, Vladislav Shakhrai, Ju Hu, Yun Fu, Denys Makoviichuk, Sergey
Tulyakov, Jian Ren
Figure 1. Examples of deploying our approach on mobile devices for real-time interaction with users. Due
to the small model size (8.3MB) and fast inference speed (18 ∼ 26ms per image on iPhone 13), we can
build neural rendering applications where users interact with 3D objects on their devices, enabling
various applications such as virtual try-on. We use publicly available software to make the on-device
application for visualization .
The authors introduce a lightweight and mobile-friendly architecture for real-time neural rendering with
NeLF techniques. The proposed network achieves high quality results while being small and efficient,
outperforming MobileNeRF in terms of speed and storage needs.
Priors and Generative
Priors can either aid in the reconstruction or can be used in a generative manner. For example, in the
reconstruction, priors either increase the quality of neural view synthesis or enable reconstructions from
sparse image collections.
Zifan Shi, Yujun Shen, Yinghao Xu, Sida Peng, Yiyi Liao, Sheng Guo, Qifeng Chen, Dit-Yan Yeung
Figure 2. Framework of PoF3D, which consists of a pose-free generator and a pose-aware discriminator.
The pose-free generator maps a latent code to a neural radiance field as well as a camera pose, followed
by a volume renderer (VR) to output the final image. The pose-aware discriminator first predicts a
camera pose from the given image and then use it as the pseudo label for conditional real/fake
discrimination, indicated by the orange arrow.
PoF3D is a generative model for 3D image synthesis that needs no prior on the 3D object pose. It first
infers the pose from the latent code using an efficient pose learner. Then, it trains the discriminator to
differentiate between synthesized images with the predicted pose as a condition and real images. Jointly
trained in an adversarial manner, PoF3D generates high-quality 3D-aware image synthesis.
Shangzhan Zhang, Sida Peng, Tianrun Chen, Linzhan Mou, Haotong Lin, Kaicheng Yu, Yiyi Liao, Xiaowei Zhou
Figure 1. Given only a single semantic map as input (first row), our approach optimizes neural fields for
view synthesis of natural scenes. Photorealistic images can be rendered via neural fields (the last two
rows).
The paper presents a method for synthesizing multi-view consistent color images of natural scenes from
single semantic masks, using a semantic field as the intermediate representation. The method outperforms
baseline methods and produces photorealistic, multi-view consistent videos without requiring multi-view
supervision or category-level priors.
Dario Pavllo, David Joseph Tan, Marie-Julie Rakotosaona, Federico Tombari
Figure 1. Given a collection of 2D images representing a specific category (e.g. cars), we learn a model
that can fully recover shape, pose, and appearance from a single image, without leveraging multiple
views during training. The 3D shape is parameterized as a signed distance function (SDF), which
facilitates its transformation to a triangle mesh for further downstream applications.
An end-to-end hybrid inversion NeRF is introduced to recover 3D shape, pose, and appearance from a single
image. Unlike prior art which made use of multiple views during training, the new approach is capable of
recovering the geometry, pose, and appearance of an object with a single image, and demonstrates
state-of-the-art results on multiple real and artificial rendering tasks.
Xin Huang, Qi Zhang, Ying Feng, Xiaoyu Li, Xuan Wang, Qing Wang
Figure 1. We propose LIRF to reconstruct radiance fields of unseen scenes for novel view synthesis.
Given that current generalizable NeRF-like methods cast an infinitesimal ray to render a pixel at
different scales, it causes excessive blurring and aliasing. Our method instead reasons about 3D conical
frustums defined by the neighbor rays through the neighbor pixels (as shown in (a)). Our LIRF outputs
the feature of any sample within the conical frustum in a continuous manner (as shown in (b)), which
supports NeRF reconstruction at arbitrary scales. Compared with the previous method, our method can be
generalized to represent the same unseen scene at multiple levels of details (as shown in (c)).
Specifically, given a set of input views at a consistent image scale ×1, LIRF enables our method to both
preserve sharp details in close-up shots (anti-blurring as shown in ×2 and ×4 results) and correctly
render the zoomed-out images (anti-aliasing as shown in ×0.5 results).
LIRF is a generalizable neural rendering approach for novel view rendering that aggregates information
from conical frustums to construct a ray. The method outperforms state-of-the-art methods on novel view
rendering of unseen scenes at arbitrary scales.
Chao-Yuan Wu, Justin Johnson, Jitendra Malik, Christoph Feichtenhofer, Georgia Gkioxari
Figure 1. Multiview Compressive Coding (MCC). (a): MCC encodes an input RGB-D image and uses an
attention-based model to predict the occupancy and color of query points to form the final 3D
reconstruction. (b): MCC generalizes to novel objects captured with iPhones (left) or imagined by DALL·E
2 (middle). It is also general – it works not only on objects but also scenes (right).
MCC is a framework for single-view 3D reconstruction that can learn representations from 3D points of
objects or scenes. Through large-scale training from RGB-D videos, generalization is improved.
Chiheon Kim, Doyup Lee, Saehoon Kim, Minsu Cho, Wook-Shin Han
Figure 1. The reconstructed images of 178×178 ImageNette by TransINR (left) and our generalizable INRs
(right).
The authors propose a generalizable implicit neural representation that can represent complex data
instances by modulating a small set of weights in early MLP layers as an instance pattern composer. The
remaining weights learn composition rules for common representation across instances. The method's
flexibility is shown by testing it on audio, image, and 3D data.
Dogyoon Lee, Minhyeok Lee, Chajin Shin, Sangyoun Lee
Figure 2. Overall pipeline for DP-NeRF. DP-NeRF consists of three stages. (a) The rigid blurring kernel
(RBK) constructs the blurring system using the SE(3) Field based on the physical priors. (b) The
adaptive weight proposal (AWP) refines the composition weights using the depth feature (ζp i,j) of the
samples on the ray of the target pixel (p), the scene (s) information, and the rigidly transformed ray
directions (rp s;0,...,k). (c) Finally the coarse and fine blurred colors, ˆBp and ˜Bp, are composited
using the weighted sum of the ray transformed colors. Lc and Lf denote the coarse and fine RGB
reconstruction loss, respectively.
DP-NeRF is an adaptation of the NeRF model to handle blurred images with physical priors, providing better
geometric and appearance consistency. Specifically, it introduces a rigid blurring kernel and adaptive
weight proposal to refine the color composition error while taking into consideration the relationship
between depth and blur.
Dae-Young Song, HeeKyung Lee, Jeongil Seo, Donghyeon Cho
Figure 2. An Overview of our method. (a) First, the hallucinator generates a back-side image IB using IF
. The depth estimator receives IF and IB and estimates a front depth map DF and a back depth map DB
simultaneously. (b) With IF , IB, DF , and DB, the 2D encoder extracts a feature map to be transformed
into the pixel-aligned feature. Meanwhile, DF and DB are projected to form a depth volume V . A 3D
feature map is extracted by the 3D encoder from V , and aligned to the voxel-aligned feature. Both
aligned features are concatenated in the channel axis and used by MLPs to estimate the final occupancy
vector for given query points.
DIFu is an IF-based method for reconstructing clothed human from a single image. It proposes new encoders
and networks to extract features from both 2D and 3D space, utilizing a depth prior to generate a 3D
representation of a human body in addition to an RGB image. The method also estimates colors of 3D points
using a texture inference branch.
Suyi Jiang, Haoran Jiang, Ziyu Wang, Haimin Luo, Wenzheng Chen, Lan Xu
Figure 1. The proposed HumanGen can generate 3D humans with fine-detailed geometry and appearance while
seamlessly lifting various 2D latent editing tools into 3D.
HumanGen presents a 3D human generation scheme with detailed geometry and $\text{360}^{\circ}$ realistic
free-view rendering. It marries the 3D human generation with priors from 2D generators and 3D
reconstructors through an "anchor image" to disentangle the generation of geometry and appearance and
enhance appearance generation. The effectiveness of the proposed method is demonstrated in extensive
experiments for state-of-the-art 3D human generation.
Titas Anciukevicius, Zexiang Xu, Matthew Fisher, Paul Henderson, Hakan Bilen, Niloy J. Mitra, Paul
Guerrero
Figure 1. We propose a 3D-aware image diffusion model that can be used for monocular 3D reconstruction,
3D-aware inpainting, and unconditional generation, while being trained with only monocular 2D
supervision. Here we show results on ShapeNet and FFHQ.
RenderDiffusion is presented in this paper as the first Diffusion Model for 3D generation from 2D input. A
novel image denoising technique is presented, providing a 3D consistent representation that can be
rendered from any viewpoint. Performance was shown on several datasets for 3D scene inference and
generation, as well as 2D inpainting for editing 3D scenes.
Haochen Wang, Xiaodan Du, Jiahao Li, Raymond A. Yeh, Greg Shakhnarovich
Figure 1. Results for text-driven 3D generation using Score Jacobian Chaining with Stable Diffusion as
the pretrained model.
The paper proposes a novel method to extend 2D models trained on an image generation task to a 3D
voxel-based rendering setup. The method uses a diffusion model to learn gradients, which are
backpropagated through the Jacobian of a differentiable renderer to aggregate 2D scores at multiple camera
viewpoints into a 3D score, allowing 2D models to be repurposed for 3D data generation. A novel estimation
mechanism is proposed to address the issue of distribution mismatch in this application.
Senmao Li, Joost van de Weijer, Yaxing Wang, Fahad Shahbaz Khan, Meiqin Liu, Jian Yang
Figure 1. 3D-aware I2I translation: given a view-consistent 3D scene (the input), our method maps it
into a high-quality target-specific image. Our approach produces consistent results across viewpoints.
The paper proposes a novel 3D-aware framework for 3D-consistent multi-class image-to-image translation.
The approach employs a multi-class 3D-aware GAN architecture followed by a 3D-aware I2I translation
system, showing improved results over naive 2D-I2I methods for multi-class I2I translation.
Gal Metzer, Elad Richardson, Or Patashnik, Raja Giryes, Daniel Cohen-Or
Figure 1. Our three text-guided models: a purely text-guided Latent-NeRF, Latent-NeRF with Sketch-Shape
guidance for more exact control over the generated shape, and Latent-Paint for texture generation for
explicit shapes. The top row represents the models’ inputs.
Latent-NeRF combines text-guided image generation with a latent space diffusion model to introduces a
sketch-based guiding mechanism to generate 3D objects. The authors show its applicability to text-to-3D
and apply latent score distillation to 3D meshes for texture synthesis.
Hao Yang, Lanqing Hong, Aoxue Li, Tianyang Hu, Zhenguo Li, Gim Hee Lee, Liwei Wang
Figure 3. Pipeline of our ContraNeRF. 1) We first use a shared CNN to extract features for input source
views. Then for each source view, we query features from other source views and aggregate them to get
the geometrically enhanced feature maps (Sec.4.1). 2) For each pair of source views, we compute the
contrastive loss using our GeoContrast (Sec.4.2). Specifically, for the pixel in the i-th source view,
we project it to the j-th source view and sample a collection of projections to get positive and
negative samples. Then the weighted contrastive loss is calculated by considering the distance between
the positive sample and the negative samples. 3) Finally, for each ray in the target view, we compute
colors and densities for a set of samples along the ray by aggregating local features from source views,
and accumulate colors and densities to render images (Sec.4.3).
This paper proposes a geometry-aware contrastive learning method to train NeRF for synthetic-to-real
generalization in novel view synthesis. The method enhances geometry awareness with cross-view attention
and produces better results compared to existing methods in terms of image quality.
Mikaela Angelina Uy, Ricardo Martin-Brualla, Leonidas Guibas, Ke Li
Figure 1. SCADE Overview. We present SCADE, a novel technique for NeRF reconstruction under sparse,
unconstrained views for in-the-wild indoor scenes. We leverage on generalizable monocular depth priors
and address to represent the inherent ambiguities of monocular depth by exploiting our ambiguity-aware
depth estimates (left). Our approach accounts for multimodality of both distributions using our novel
space carving loss that seeks to disambiguate and find the common mode to fuse the information between
different views (middle). SCADE enables better photometric reconstruction especially in highly ambiguous
scenes, e.g. non-opaque surfaces (right).
SCADE introduces per-view depth estimates into NeRF using a continuous, multimodal distribution of depth
estimates predicted with a cIMLE. This depth information helps disambiguate geometry from sparse input
views and improves NeRFs performance on sparse, unconstrained indoor scenes.
Yufan Ren, Fangjinhua Wang, Tong Zhang, Marc Pollefeys, Sabine Süsstrunk
Figure 1. Generalizable implicit reconstructions from three views (top). The state-of-the-art method
SparseNeuS produces oversmoothed surfaces (left), while our (VolRecon) reconstructs finer details
(right). Best viewed on a screen when zoomed in.
VolRecon is a neural implicit scene reconstruction method that uses signed ray distance functions (SRDF)
to provide generalizable implicit scene reconstruction. The method combines projection features from
multi-view features and volume features interpolated from a coarse global feature volume. The SRDF values
are computed for sampled points on a ray and then used to render color and depth. The method outperforms
SparseNeuS by 30% in sparse view reconstruction and is comparable to MVSNet in full view reconstruction on
the DTU dataset.
Zhiyuan Ma, Xiangyu Zhu, Guojun Qi, Zhen Lei, Lei Zhang
Figure 1. OTAvatar animation results. The source subjects in HDTF dataset are animated by OTAvatar using
a single portrait as the reference. We use the pose and expression coefficients of 3DMM to represent
motion and drive the avatar. Note that these subjects are not included in the training data of OTAvatar.
OTAvatar creates face avatars from only one reference image and a motion code, achieving controllability,
generalizability, and efficiency through a tri-plane rendering method.
Animesh Karnewar, Andrea Vedaldi, David Novotny, Niloy Mitra
Figure 1. We present HOLODIFFUSION as the first 3D-aware generative diffusion model that produces
3D-consistent images and is trained with only posed image supervision. Here we show a few different
samples generated from models trained on different classes of the CO3D dataset .
This paper introduces a scalable diffusion model that is trained on 2D posed images while operating on 3D
volumes. The proposed image formation model overcomes the limitation of the cubic growth of memory and
compute complexity encountered in 3D modelling. The approach is evaluated on the CO3D dataset, which
demonstrates competitive sample quality and fidelity to existing approaches for 3D generative modelling.
Congyue Deng, Chiyu "Max'' Jiang, Charles R. Qi, Xinchen Yan, Yin Zhou, Leonidas Guibas, Dragomir
Anguelov
Figure 1. From left to right: We present a single-image NeRF synthesis framework for in-the-wild images
without 3D supervision by leveraging general priors from large-scale image diffusion models. Given an
input image, we optimize for a NeRF by minimizing an image distribution loss for arbitrary-view
renderings with the diffusion model conditioned on the input image. We design a two-section semantic
feature as the conditioning input to the diffusion model. The first section is the image caption s0 which
carries the overall semantics; the second section is a text embedding s∗ extracted from the input image
with textual inversion, which captures additional visual cues. Our two-section semantic feature provides
an appropriate image prior, allowing the synthesis of a realistic NeRF coherent to the input image.
NeRDi optimizes NeRF representations using an arbitrary view rendering with a pretrained image diffusion
model. Additionally, the authors propose a geometric loss based on estimated depth maps and introduce
two-section language guidance to improve multiview coherence and generalization.
Lucy Chai, Richard Tucker, Zhengqi Li, Phillip Isola, Noah Snavely
Figure 1. Our approach enables unconditional synthesis of unbounded 3D nature scenes with a persistent
scene representation (left), using a scene layout grid representing a large-scale terrain model
(depicted above as the checkered ground plane). This representation enables us to generate arbitrary
camera trajectories, such as the six numbered views shown along a cyclic camera path (center). The
persistence inherent to our representation stands in contrast to prior auto-regressive methods that do
not preserve consistency under circular camera trajectories (right); while the two images shown on the
right are at the start and end of a cyclic path, the terrain depicted is completely different. Our
method is trained solely from unposed, single-view landscape photos.
This paper presents a method to synthesize unbounded nature scenes with persistent 3D world modeling. The
representation consists of an extendable, planar scene layout grid, which can be rendered from arbitrary
camera poses via a 3D decoder and volume rendering, and a panoramic skydome. The generative world model is
learned from single-view internet photos.
J. Ryan Shue, Eric Ryan Chan, Ryan Po, Zachary Ankner, Jiajun Wu, Gordon Wetzstein
Figure 1. Our method leverages existing 2D diffusion models for 3D shape generation using hybrid
explicit–implicit neural representations. Top: triplane-based 3D shape diffusion process using our
framework. Bottom: Interpolation between generated shapes.
The paper present a method for 3D-aware generation of neural fields using diffusion models with triplane
feature representations and continuous occupancy fields. The method outperforms other 3D generation
approaches.
Jaehyeok Shim, Changwoo Kang, Kyungdon Joo
Figure 1. Visualization of 3D shape generation for various categories using SDF-Diffusion. SDF-diffusion
gradually removes the Gaussian noise and generates high-resolution 3D shapes in the form of an SDF voxel
by a two-stage framework. Unlike previous methods, which use point clouds, we can generate 3D meshes
without concern of complex post-processing. For visualization purposes, we show less-noise data instead
of the initial Gaussian noise.
SDF-Diffusion proposes a 3D shape generation framework that improves over existing methods by using
denoising diffusion models with continuous 3D representation via signed distance fields. The generative
process is separated into two stages to alleviate memory issues, with the second stage performing
super-resolution. The proposed framework achieves high-fidelity 3D shape generation and competitive
performance on the ShapeNet dataset.
Yongqiang Zhang, Zhipeng Hu, Haoqian Wu, Minda Zhao, Lincheng Li, Zhengxia Zou, Changjie Fan
Figure 6. Rendered 2D images on DTU. The first column shows ground truth 2D views from DTU. The second
and third columns show our method and NeuS.
The accuracy of implicit neural surface rendering can be sensitive to surface tangent direction, leading
to bias. This work proposes rendering with a scaled SDF field and additional supervision through surface
point estimation from multi-view stereo or similar methods, improving performance over existing SDF-based
reconstruction. Evaluation on DTU dataset demonstrates the superiority of the proposed method.
Zhenyu Zhang, Renwang Chen, Weijian Cao, Ying Tai, Chengjie Wang
Figure 1. (a) Comparison with graphics-renderer-based methods LAP . Our method models geometry details
and photo-realistic texture. (b) Results of neural rendering methods EG3D and our method. Our method
produces high-quality geometry, robust texture modeling under rotation and deformation.
NPF is a generative model that extracts multi-image priors from a photo collection to learn a face
prototype and deform it with appropriate facial expressions and personal idiosyncrasies. It is optimized
to fit a target image in the collection to recover specific details of appearance and geometry, showing
robust 3D face modeling performance on benchmarks compared to state-of-the-art methods.
Chen-Hsuan Lin, Jun Gao, Luming Tang, Towaki Takikawa, Xiaohui Zeng, Xun Huang, Karsten Kreis, Sanja
Fidler, Ming-Yu Liu, Tsung-Yi Lin
Figure 1. Results and applications of Magic3D. Top: high-resolution text-to-3D generation. Magic3D can
generate high-quality and high-resolution 3D models from text prompts. Bottom: high-resolution
prompt-based editing. Magic3D can edit 3D models by fine-tuning with the diffusion prior using a
different prompt. Taking the low-resolution 3D model as the input (left), Magic3D can modify different
parts of the 3D model corresponding to different input text prompts. Together with various creative
controls on the generated 3D models, Magic3D is a convenient tool for augmenting 3D content creation.
Magic3D is a two-stage optimization framework that uses a diffusion prior to obtain a coarse 3D model with
a hash grid structure. This is then used to initialize a fast optimization of a high-resolution textured
3D mesh model. The method is faster and achieves higher resolution while outperforming DreamFusion in user
studies.
Norman Müller, Yawar Siddiqui, Lorenzo Porzi, Samuel Rota Bulò, Peter Kontschieder, Matthias Nießner
Figure 7. Qualitative synthesis results of single-view reconstruction on ScanNet . Given a single posed
image and its foreground mask, our method can recover meaningful proposals for the radiance fields
describing the objects (shown in separate rows).
DiffRF is a volumetric radiance field synthesis approach based on denoising diffusion probabilistic
models. It operates on an explicit voxel grid representation and uses a rendering loss to address sample
ambiguity and artifact reduction. Its use of multi-view consistent priors enables free-view synthesis, and
it naturally enables conditional generation during inference.
Jamie Wynn, Daniyar Turmukhambetov
Figure 1. Image and depth map rendered from a test view. All NeRF models were trained with 3 views of
the LLFF dataset’s “Room” scene. Our priors encourage NeRF to explain the TV and table geometry with
flat surfaces in the density field, and to explain the view-dependent color changes with the color
field.
The paper introduces a pretrained RGBD patch denoising model that is used as a prior regularizer for
implicit neural representations of 3D scenes. Results show added regularization improves the quality of
generated 3D scenes, particularly for few-view situations.
Yunzhi Zhang, Shangzhe Wu, Noah Snavely, Jiajun Wu
Figure 1. From a single image, our model learns to infer object intrinsics—the distributions of the
geometry, texture, and material of object instances within the image. The model can then generate new
instances of the object type, and it allows us to view the object under different poses and lighting
conditions. Project page at https://cs.stanford.edu/˜yzzhang/projects/rose/.1
This work proposes a generative model that can learn the intrinsics of an object category from a single
image, including the distribution of geometry, texture, and material. The model can render instances of
the object type in different poses, sizes, and lighting conditions. It shows promising results in various
downstream tasks such as shape and image generation, view synthesis, and relighting.
Seung Wook Kim, Bradley Brown, Kangxue Yin, Karsten Kreis, Katja Schwarz, Daiqing Li, Robin Rombach,
Antonio Torralba, Sanja Fidler
Figure 1. We introduce NeuralField-LDM, a generative model for complex open-world 3D scenes. This figure
contains a panorama constructed from NeuralField-LDM’s generated scene. We visualize different parts of
the scene by placing cameras on them.
NeuralField-LDM generates complex 3D environments using a hierarchical diffusion model. The model first
generates voxel grids for training via an auto-encoder, which is then condensed into a latent
representation via a second auto-encoder. The system achieves a substantial improvement over existing
state-of-the-art scene generation models.
Minjung Son, Jeong Joon Park, Leonidas Guibas, Gordon Wetzstein
Figure 1. SinGRAF generates different plausible realizations of a single 3D scene from a few unposed
input images of that scene. In this example, i.e., the “office 3” scene, we use 100 input images, four
of which are shown in the top row. Next, we visualize four realizations of the 3D scene as panoramas,
rendered using the generated neural radiance fields. Note the variations in scene layout, including
chairs, tables, lamps, and other parts, while staying faithful to the structure and style of the input
images.
SinGRAF synthesizes photorealistic 3D objects with very few input images of a scene. The authors use a
progressive-scale patch discrimination approach during training in a 3D GAN architecture and show that
SinGRAF outperforms related works both in quality and diversity.
Rajhans Singh, Ankita Shukla, Pavan Turaga
Figure 1. Samples generated by our Poly-INR model on the ImageNet dataset at various resolutions. Our
model generates images with high fidelity without using convolution, upsample, or self-attention layers,
i.e., no interaction between the pixels.
Poly-INR is a neural network for image representation without the use of sinusoidal positional encoding.
Instead, the method represents the image with a polynomial function and eliminates the need for positional
encodings. The model is evaluated on large datasets and performs comparably to state-of-the-art generative
models with fewer trainable parameters.
Chen Yang, Peihao Li, Zanwei Zhou, Shanxin Yuan, Bingbing Liu, Xiaokang Yang, Weichao Qiu, Wei Shen
Figure 1. Illustration of the free view synthesis task. Free view synthesis aims at synthesizing
photo-realistic images on both interpolation and extrapolation setting. In this paper, we propose a
novel method, NeRFVS, based on the neural radiance field (NeRF) to achieve indoor scene free view
synthesis. Our NeRFVS significantly reduces the distortions and floaters (as also evidenced by the PSNR
numbers), producing high-quality images on interpolation and extrapolation setting.
NeRFVS improves upon NeRF for navigation in indoor environments by utilizing holistic priors including
depth maps and view coverage information to guide NeRF learning. The approach brings high-fidelity free
navigation results for indoor scenes.
Jianchuan Chen, Wentao Yi, Liqian Ma, Xu Jia, Huchuan Lu
Figure 1. The effect of inaccurately estimated SMPL. Compared with GNR , our method still yields a
reasonable result.
GM-NeRF uses a geometry-guided attention mechanism to overcome misalignment issues for multi-view 2D human
images and neural rendering with partial gradient backpropagation for perceptual supervision, which leads
to high fidelity and better geometric reconstruction in free-viewpoint human image synthesis.
Bokui Shen, Xinchen Yan, Charles R. Qi, Mahyar Najibi, Boyang Deng, Leonidas Guibas, Yin Zhou, Dragomir
Anguelov
Figure 1. Leveraging in-the-wild data for generative assets modeling embodies a scalable approach for
simulation. GINA-3D uses real-world driving data to perform various synthesis tasks for realistic 3D
implicit neural assets. Left: Multi-sensor observations in the wild. Middle: Asset reconstruction and
conditional synthesis. Right: Scene composition with background neural fields .
GINA-3D is a generative model that creates IK assets using a real-world driving dataset of camera and
LiDAR sensors. It uses a learned tri-plane latent structure to overcome challenges caused by occlusions,
lighting variations, and long-tail distributions. GINA-3D achieves state-of-the-art results in diversity
and quality for both generated images and geometries.
Jingxiang Sun, Xuan Wang, Lizhen Wang, Xiaoyu Li, Yong Zhang, Hongwen Zhang, Yebin Liu
Figure 1. Our 3D GAN synthesizes generative, high-quality, and 3D-consistent facial avatars from
unstructured 2D images. Unlike current animatable 3D GANs that only modify yaw-pitch head poses and
facial expressions, our approach enables fine-grained control over fullhead rotations, facial
expressions, eye blinks, and gaze directions with strict 3D consistency and a high level of
photorealism. Our approach also provides strong 3D priors for downstream tasks such as 3D-aware
stylization.
This paper presents a novel 3D GAN framework for unsupervised learning of generative, high-quality and
3D-consistent facial avatars from unstructured 2D images. The proposed representation combines both
fine-grained expression control of a mesh-guided explicit deformation and the flexibility of an implicit
volumetric representation. The method demonstrates state-of-the-art 3D-aware synthesis quality and
animation ability in the experiments section.
Dejia Xu, Yifan Jiang, Peihao Wang, Zhiwen Fan, Yi Wang, Zhangyang Wang
Figure 1. Overview of the proposed pipeline. NeuralLift-360 learns to recover a 3D object from a single
reference image y. We derive a prior distillation loss Ldiff for CLIP-guided diffusion prior and utilize
rough depth estimation with the help of a ranking loss Lranking.
The NeuralLift-360 framework can synthesize a plausible 3D object from a single image and provide 360
degree views guided by denoising diffusion models. This technology helps to ease 3D content creation
processes and XR design.
Yu Yin, Kamran Ghasedi, HsiangTao Wu, Jiaolong Yang, Xin Tong, Yun Fu
Figure 1. Image animation results of our method. NeRFInvertor achieves 3D-consistent and ID-preserving
animation (i.e. novel views and expressions) of real subjects given only a single image.
This paper proposes a method for surgically fine-tuning NeRF-GAN models for generating high-fidelity
animation of real subjects using a single image. The method employs 2D loss functions on rendered images
to reduce identity gap and explicit and implicit 3D regularization to remove geometric and visual
artifacts. The results show a high-fidelity, realistic, and consistent 3D animation of real faces.
Dynamic
Capturing dynamic objects is a trend that started in previous conference years. This can either be
solved with parametrization or via neural priors.
Kacper Kania, Stephan J. Garbin, Andrea Tagliasacchi, Virginia Estellers, Kwang Moo Yi, Julien Valentin,
Tomasz Trzciński, Marek Kowalski
Figure 2. BlendFields – We implement our approach as a volumetric model, where the appearance (i.e.
radiance) is the sum of the main appearance corrected by blending a small set of K expression-specific
appearances. These appearances are learnt from extreme expressions, and then blended at test-time
according to blend weights computed as a function of the input expression e.
The authors propose a graphics-inspired method that models fine-grained details in face appearance by
blending appearance from a sparse set of extreme poses. It reproduces the appearance of volumetric changes
in a face to create new expressions while preserving fine-grained details. The proposed method is
generalizable to other objects.
Wojciech Zielonka, Timo Bolkart, Justus Thies
Figure 1. Given a short monocular RGB video, our method instantaneously optimizes a deformable neural
radiance field to synthesize a photo-realistic animatable 3D neural head avatar. The neural radiance
field is embedded in a multi-resolution grid around a 3D face model which guides the deformations. The
resulting head avatar can be viewed under novel views and animated at interactive frame rates.
INSTA introduces a method to create photo-realistic digital avatars instantaneously, leveraging a dynamic
neural radiance field around a parametric face model. Unlike prior methods requiring days to reconstruct
an avatar, INSTA reconstructs one in less than 10 minutes using only monocular RGB video on modern GPUs.
Bo He, Xitong Yang, Hanyu Wang, Zuxuan Wu, Hao Chen, Shuaiyi Huang, Yixuan Ren, Ser-Nam Lim, Abhinav
Shrivastava
Figure 1. Comparison of D-NeRV and NeRV when representing diverse videos. NeRV optimizes representation
to every video independently while D-NeRV encodes all videos by a shared model.
D-NeRV is proposed to efficiently scale implicit neural representations to a higher number of videos. The
model leverages decoupling of the video-specific and motion information, temporal attention, and
task-oriented flow. The model yields better compression and recognition accuracy compared to existing
methods like NeRV.
Chen Geng, Sida Peng, Zhen Xu, Hujun Bao, Xiaowei Zhou
Figure 1. Convergence rate of training. Given a monocular video of a human performer, our model can be
learned in ∼5 minutes to produce photorealistic novel view rendering, which is 100 times faster than
Neural Body .
The authors propose a part-based voxelized human representation and a novel 2D motion parameterization
scheme for learning neural volumetric videos of dynamic humans. Compared to other per-scene optimization
methods, the proposed method is much faster while maintaining competitive rendering quality.
Shangzhe Wu, Ruining Li, Tomas Jakab, Christian Rupprecht, Andrea Vedaldi
Figure 1. Learning Articulated 3D Animals in the Wild. Our method trains on a collection of single-view
images of an animal category and produces a model that can predict the articulated 3D shape of a new
instance from a single test image, which can be animated and relit.
MagicPony combines neural fields and meshes to predict 3D representation that includes shape,
articulation, viewpoint, texture, and lighting of an articulated animal, learned purely from in-the-wild
single-view images of the object category. A new viewpoint sampling scheme is introduced to address the
local optimal viewpoints problem.
Hao Chen, Matt Gwilliam, Ser-Nam Lim, Abhinav Shrivastava
Figure 1. Top: hybrid neural representation with learnable and content-adaptive embedding (ours).
Bottom: video regression for hybrid and implicit neural representations.
HNeRV is an extension to implicit neural representations for videos, which introduces a learnable encoder
to generate content-adaptive embeddings as decoder input and HNeRV blocks to evenly distribute model
parameters. The new architecture outperforms implicit methods in video regression for several metrics and
also on downstream tasks such as video compression and inpainting.
Ricong Huang, Peiwen Lai, Yipeng Qin, Guanbin Li
Figure 1. Comparison between previous explicit, implicit representations and our parametric implicit
representation (PIR). (a) Explicit representations (e.g., 3D face models) have interpretable parameters
but lack expressive power. (b) Implicit representations (e.g., NeRF) have strong expressive power but
are not interpretable. (c) Our PIR takes the best of both approaches and is both interpretable and
expressive, thus paving the way for controllable and high-quality audio-driven facial reenactment.
The authors present a novel approach for audio-driven facial reenactment using a parametric implicit face
representation that combines the best of both explicit and implicit methods. The proposed method
incorporates contextual information into audio-to-expression parameters encoding, uses conditional image
synthesis with a tri-plane structure, and formulates facial reenactment as a conditional image inpainting
problem using a novel data augmentation technique. Results show that the method generates realistic and
high-quality talking heads.
Shishira R Maiya, Sharath Girish, Max Ehrlich, Hanyu Wang, Kwot Sin Lee, Patrick Poirson, Pengxiang Wu,
Chen Wang, Abhinav Shrivastava
Figure 1. Overview of NIRVANA: Prior video INR works perform either pixel-wise or frame-wise prediction.
We instead perform spatio-temporal patch-wise prediction and fit individual neural networks to groups of
frames (clips) which are initialized using networks trained on the previous group. Such an
autoregressive patch-wise approach exploits both spatial and temporal redundancies present in videos
while promoting scalability and adaptability to varying video content, resolution or duration.
NIRVANA proposes a patch-wise neural network architecture to compress videos efficiently by leveraging
temporal redundancy. Quantization is used for efficient parameter storage and improved compression
performance. The network scales up well with resolution and length, demonstrating variable bitrate
compression.
Benjamin Attal, Jia-Bin Huang, Christian Richardt, Michael Zollhoefer, Johannes Kopf, Matthew O'Toole,
Changil Kim
Figure 2. Overview of HyperReel for static scenes. Given a set of images and camera poses, the training
objective is to reconstruct the measured color associated with every ray. (a) For a ray originating at
the camera origin o and traveling in direction ⃗ω, we first reparameterize the ray using Pl¨ucker
coordinates. (b) A network Eϕ takes this ray as input and outputs the parameters for a set of geometric
primitives {Gk} (such as axis-aligned planes and spheres) and displacement vectors {dk}. (c) To generate
sample points {xk} for volume rendering, we compute the intersections between the ray and the geometric
primitives, and add the displacement vectors to the results. (d) Finally, we perform volume rendering
via Equation 2 to produce a pixel color and supervise training based on the corresponding observation.
HyperReel is introduced as a novel 6-DoF video representation that uses a ray-conditioned sample
prediction network and a compact dynamic volume representation. Its performance surpasses previous and
present methods in quality and small memory requirements, while also producing high-resolution renderings.
Qi Zhao, M. Salman Asif, Zhan Ma
Figure 1. Results of the proposed DNeRV with 3M parameters for (a) video interpolation on UVG and (b)
video inpainting on Davis . The superior performance shows the effectiveness and generalization
capability of DNeRV on neural representation for videos.
DNeRV proposes new implicit neural representation for video by incorporating explicit frame difference in
the model. This helps to improve the performance of the network in tasks such as video compression,
inpainting, and interpolation.
Zhengming Yu, Wei Cheng, Xian Liu, Wayne Wu, Kwan-Yee Lin
Figure 2. Framework of MonoHuman. In the Shared Bidirectional Deformation module, points in different
observation space Xo will be deformed to Xc in the same canonical space by backward deforming Db. And Xc
can be deformed back into the points X ′ o in observation space by symmetrical forward deformation Df. A
consistent loss Lconsis is used to regularize the motion weight. In Forward Correspondence Search
module, Xc is deformed to Xi and Xj in two correspondence space to get the feature fi, fj and the color
ci cj. These features are blended to the feature F by the weights Blend MLP generated. F is used to
guide the rendering network finally.
MonoHuman creates view-consistent and high-fidelity virtual 3D models of human bodies from monocular
videos through a bidirectional deformation field and keyframe information for rendering. The method
achieves superior results compared to state-of-the-art methods that rely on pose-dependent representations
or frame-independent optimization.
Zhiyang Guo, Wengang Zhou, Min Wang, Li Li, Houqiang Li
Figure 1. Given a set of multi-view images capturing a pose sequence of a single hand or two interacting
hands (left), HandNeRF models the scene in a unified manner with neural radiance fields, enabling
rendering of novel hand poses from arbitrary viewing directions (right).
HandNeRF allows the reconstruction of accurate appearance and geometry for interacting hands based on
multi-view images of a single or multiple hands. It employs a pose-driven deformation field to establish
correspondence and optimize a pose-disentangled NeRF, leveraging pose priors to generate pseudo depth maps
and a neural feature distillation method for color optimization. HandNeRF achieves state-of-the-art
results on InterHand2.6M dataset.
Vinoj Jayasundara, Amit Agrawal, Nicolas Heron, Abhinav Shrivastava, Larry S. Davis
Figure 1. Overview of our approach. Pose-independent temporal deformation is used in conjunction with
pose-dependent motion fields (rigid and non-rigid). We choose one of the input frames as the canonical
view, allowing us to use cyclic-consistency for regularization.
FlexNeRF uses a pose-dependent motion field and pose-independent temporal deformations to render
photorealistic freeviewpoint human motion. The method optimizes a canonical time and pose configuration
and incorporates temporal and cyclic consistency constraints and intermediate representation segmentation
loss to provide high-quality outputs with sparse views. The method is an improvement over existing
state-of-the-art methods and is demonstrated on public benchmark datasets and a self-captured fashion
dataset.
Chaoyang Wang, Lachlan Ewen MacDonald, László A. Jeni, Simon Lucey
Figure 1. We propose a method to use optical flow supervision for deformable NeRF. It noticeably
improves novel view synthesis for monocular videos with rapid object motions. In the figure, we
visualize rendered novel view images and depth maps for the first and last frame of the input video.
The paper presents a method called Deformable Neural Radiance Fields (Deformable NeRF) which uses optical
flow as its constraint for forward deformation, instead of inverse deformation. They show that by
inverting the backward deformation field, they can simplify the problem and extend to a broader range of
backward deformation fields. The method demonstrated significant improvements in monocular novel view
synthesis with rapid object motion.
Kuan-Chieh Wang, Zhenzhen Weng, Maria Xenochristou, Joao Pedro Araujo, Jeffrey Gu, C. Karen Liu, Serena
Yeung
Figure 1. By leveraging the shared 3D information across videos, NeMo can accurately recover the dynamic
ranges of athletic motion, where existing 3D human mesh recovery (HMR) methods struggle. To illustrate,
VIBE, a baseline video-based HMR method, fails to capture the large step taken by the subject in the
“Baseball Pitch” example, and swaps the arms in “Tennis Serve” (in subfigure A). Furthermore, NeMo
recovers global root trajectory more accurately than existing global HMR method, GLAMR (in subfigure B).
The Neural Motion (NeMo) field introduces a multi-modal framework for monocular human mesh recovery (HMR)
by combining multiple views into a single, optimized representation. NeMo outperforms existing HMR
methods, recovering 3D motion in sports better than existing baselines.
Zhiwen Yan, Chen Li, Gim Hee Lee
Figure 1. Comparison of novel views rendered by HyperNeRF (left) and our NeRF-DS (right), on the
“americano” scene in the HyperNeRF dataset 2(top) and the “basin” scene in our dynamic specular dataset
(bottom). Our NeRF-DS model significantly improves the reconstruction quality by a surface-aware dynamic
NeRF and a mask guided deformation field.
NeRF-DS improves dynamic NeRF by conditioning the neural radiance field function on surface position and
orientation in the observation space and includes a mask of moving objects to guide the deformation field,
leading to better performance on moving specular objects than existing methods.
Sungheon Park, Minjung Son, Seokhwan Jang, Young Chun Ahn, Ji-Yeon Kim, Nahyup Kang
Figure 1. We propose simple yet effective feature interpolation methods for training neural radiance
fields of dynamic scenes based on temporal interpolation. We provide two different feature vector
representations, neural representation (top) and grid representation (bottom), both of which are the
concatenation of static feature vectors and temporally-interpolated dynamic feature vectors. The neural
representation exhibits high-quality rendering performance with small-sized models while the grid
representation shows competitive rendering results with astonishingly fast training speed.
InterpNeRF offers two feature interpolation methods depending on whether the underlying representation is
neural networks or grids. By interpolating space-time features, the method achieves state-of-the-art
performance in both rendering quality and training speed.
Yu-Lun Liu, Chen Gao, Andreas Meuleman, Hung-Yu Tseng, Ayush Saraf, Changil Kim, Yung-Yu Chuang,
Johannes Kopf, Jia-Bin Huang
Figure 1. Robust space-time synthesis from dynamic monocular videos. Our method takes a casually
captured video as input and reconstructs the camera trajectory and dynamic radiance fields. Conventional
SfM system such as COLMAP fails to recover camera poses even when using ground truth motion masks. As a
result, existing dynamic radiance field methods that require accurate pose estimation do not work on
these challenging dynamic scenes. Our work tackles this robustness problem and showcases high-fidelity
dynamic view synthesis results on a wide variety of videos.
This paper proposes a new method to reconstruct a dynamic scene by jointly estimating the camera
parameters, static and dynamic radiance fields, without relying on SfM algorithms. The proposed method
provides more robustness on challenging video cases than previous methods and exhibits favorable
performance in quantitative and qualitative experiments.
Jeff Tan, Gengshan Yang, Deva Ramanan
Figure 1. By distilling knowledge from dynamic NeRFs fitted to offline video data at scale , we present
a method to train categoryspecific real-time video shape predictors, which output temporally-consistent
viewpoint, articulation, and appearance given casual input videos. Our method replaces expensive
test-time optimization with a single forward pass, allowing real-time inference on a RTX-3090 GPU.
Compared to existing model-based methods for reconstructing humans and animals in motion , our method
does not require pre-defined 3D templates or ground-truth 3D data to train. (Project page:
https://jefftan969.github.io/dasr).
The authors introduce a real-time method for the articulation of 3D models from videos, called Temporal
NeRF. It employs the use of Dynamic NeRFs to train a fast, feed-forward network which can reconstruct
objects for arbitrary categories without requiring 3D shapes or viewpoints as input.
Chuhan Chen, Matthew O'Toole, Gaurav Bharaj, Pablo Garrido
Figure 1. Main results on test samples. Our method models dynamic 3D deformations as an ensemble of
local deformation fields, centered around 3D facial landmarks, shown as red, blue, and green dots in
this example (details in Sec. 3). Our formulation synthesizes 3D neural heads from 2D videos with
fine-grained geometric details, as shown in column 3 (depth field).
A part-based, local semantic rig-like control 3D head avatar reconstruction method based on implicit shape
models is proposed, which decomposes the global deformation field into local ones. The method has sharper
locally controlled deformations than previous implicit monocular approaches and is especially good at
rendering nonlinear deformations of the mouth interior, asymmetric expressions, and facial details.
Figure 1. HexPlane for Dynamic 3D Scenes. Instead of regressing colors and opacities from a deep MLP, we
explicitly compute features for points in spacetime via HexPlane. Pairing with a tiny MLP, it allows
above 100× speedups with matching quality.
HexPlane is introduced as an explicit 4D scene representation that computes features using six learned
planes. HexPlane outperforms NeRF and leads to high-quality reconstructions and novel view synthesis in
dynamic scenes.
Yuheng Jiang, Kaixin Yao, Zhuo Su, Zhehao Shen, Haimin Luo, Lan Xu
Figure 1. Our Instant-NVR adopts a separated instant neural representation to achieve photo-realistic
rendering for human-object interacting scenarios.
Instant-NVR proposes an approach for instant volumetric human-object tracking and rendering in complex
environments using a single RGBD camera. It bridges the gap between traditional non-rigid tracking and
recent instant radiance field techniques through a multi-thread tracking-rendering mechanism that uses an
efficient on-the-fly reconstruction scheme. The approach achieves real-time photo-realistic novel view
synthesis under complex human-object interactions.
Heng Yu, Joel Julin, Zoltan A. Milacski, Koichiro Niinuma, Laszlo A. Jeni
Figure 1. Our proposed DyLiN for dynamic 3D scene rendering achieves higher quality than its HyperNeRF
teacher model and the stateof-the-art TiNeuVox model, while being an order of magnitude faster. Right:
DyLiN is of moderate storage size (shown by dot radii). For each method, the relative improvement in
Peak Signal-to-Noise Ratio over NeRF (∆PSNR) is measured for the best-performing scene.
DyLiN allows non-rigid deformations in light field networks through deformation learning, and CoDyLiN
augments the model with controllable attributes. Both models are trained to learn from existing datasets
through knowledge distillation. DyLiN and CoDyLiN outperformed state-of-the-art methods and the teacher
model, respectively, in attribute input and deformation scenarios.
Ruizhi Shao, Zerong Zheng, Hanzhang Tu, Boning Liu, Hongwen Zhang, Yebin Liu
Figure 1. Given 4 sparse static RGB camera views of a dynamic scene (a), our proposed Tensor4D
decomposition enables multiview reconstruction to achieve fine-grained geometry reconstruction even on
human fingers (b) and temporal-consistent novel view synthesis on a 3D holographic display (c,d,e). The 4
cameras are settled on four conners of the display. The proposed method demonstrates low-cost, portable
and highly immersive telepresence experience.
Tensor4D is a dynamic scene modeling approach that utilizes a 4D spatio-temporal tensor representation.
Tensor4D decomposes the 4D tensor hierarchically by projecting it into three time-aware volumes and nine
compact feature planes to tackle the memory issue. The method is validated on both synthetic and
real-world scenes and achieves high-quality dynamic reconstruction and rendering even from sparse-view
camera rigs or a monocular camera.
Sara Fridovich-Keil, Giacomo Meanti, Frederik Warburg, Benjamin Recht, Angjoo Kanazawa
Figure 1. Planar factorization of d-dimensional spaces. We propose a simple planar factorization for
volumetric rendering that naturally extends to arbitrary-dimensional spaces, and that scales gracefully
with dimension in both optimization time and model size. We show the advantages of our approach on 3D
static volumes, 3D photo collections with varying appearances, and 4D dynamic videos.
K-planes is a white-box model used to represent radiance fields in arbitrary dimensions. The method uses d
choose 2 planes to represent a d-dimensional scene, providing dimension-specific priors and a natural
decomposition of static and dynamic components. K-planes outperforms existing methods with low memory
overhead and fast optimization.
Sida Peng, Yunzhi Yan, Qing Shuai, Hujun Bao, Xiaowei Zhou
Figure 1. The basic idea of dynamic MLP maps. Instead of modeling the volumetric video with a big MLP
network , we exploit a 2D convolutional neural network to dynamically generate 2D MLP maps at each video
frame, where each pixel storing the parameter vector of a small MLP network. This enables us to
represent volumetric videos with a set of small MLP networks, thus significantly improving the rendering
speed.
The paper introduces a novel representation of dynamic volumetric data, the "MLP maps", by encoding
shallow MLP network parameters into a 2D grid. With this voxel representation, real-time view synthesis of
dynamic scenes is possible. This approach improves storage costs and rendering speed compared to previous
methods and achieves state-of-the-art results on NHR and ZJU-MoCap datasets.
Editable and Composable
This section covers NeRFs that propose composing, controlling, or editing methods.
Stephen Tian, Yancheng Cai, Hong-Xing Yu, Sergey Zakharov, Katherine Liu, Adrien Gaidon, Yunzhu Li,
Jiajun Wu
Figure 1. While typically studied visual manipulation settings are carefully controlled environments, we
consider scenarios with varying and even harsh lighting, in addition to novel object configurations,
that are more similar to real-world scenarios.
An object-centric neural scattering function (OSF) is used as a representation for object manipulation in
a model-predictive control framework. OSFs produce better generalization and improved control in
manipulations under varying lighting conditions.
Jun-Kun Chen, Jipeng Lyu, Yu-Xiong Wang
Figure 1. Our NeuralEditor offers native support for general and flexible shape editing of neural
radiance fields via manipulating point clouds. By generating a precise point cloud of the scene with a
novel point cloud-guided NeRF model, our NeuralEditor produces highfidelity rendering results in both
shape deformation and more challenging scene morphing tasks.
NeuralEditor offers 3D point-cloud editing capability by directly editing the explicit point cloud
structured variant of NeRF. It employs K-D trees' it to facilitate space partitioning, rendering, and
precise derivation of the edited scene’s point clouds. The authors evaluated NeuralEditor against
state-of-the-art algorithms and achieved high-quality rendering results.
Silvan Weder, Guillermo Garcia-Hernando, Aron Monszpart, Marc Pollefeys, Gabriel Brostow, Michael
Firman, Sara Vicente
Figure 1. Removal of unsightly objects. Our method allows for objects to be plausibly removed from NeRF
reconstructions, inpainting missing regions whilst preserving multi-view coherence.
The proposed method enables removal of unwanted objects from NeRFs, using guidance from provided masks.
Confidence-based view selection maintains coherence and consistency of the resulting NeRF.
Yue Chen, Xuan Wang, Xingyu Chen, Qi Zhang, Xiaoyu Li, Yu Guo, Jue Wang, Fei Wang
Figure 1. We decompose a dynamic human into 3D UV Volumes along with a 2D texture. The disentanglement
of appearance and geometry enables us to achieve (a) high-fidelity real-time novel view synthesis guided
by (b) a smooth UV avatar, (c) retexturing of a 3D human by editing a 2D texture, (d) reshaping and (e)
reposing by changing the parameters of a human model while keeping the texture untouched.
UV Volumes is introduced as a way to render a photo-realistic video of a human subject in real-time. A UV
volume includes the high-dimensional appearance mapped into a lower dimensional space, allowing the use of
a shallower neural network and enabling novel applications like retexturing.
Richard Liu, Noam Aigerman, Vladimir G. Kim, Rana Hanocka
Figure 1. DA Wand enables interactive decaling through a conditional selection of local distortion-aware
patches. Our method identifies large patches within both developable and high-curvature regions which
induce a low distortion parameterization.
The authors introduce a neural network framework for mesh parameterization for interactive workflows,
called Distortion Aware Selection Networks (DAS-Nets). DAS-Nets take a point on a mesh and provide the
user with a meaningful large area around the selection for low-distortion parameterization.
Ashkan Mirzaei, Tristan Aumentado-Armstrong, Konstantinos G. Derpanis, Jonathan Kelly, Marcus A.
Brubaker, Igor Gilitschenski, Alex Levinshtein
Figure 1. An example of the inputs and outputs of our 3D inpainting framework. In addition to the images
captured from the scene and their corresponding camera parameters, users are tasked with providing a few
points in a single image to indicate which object they wish to remove from the scene (upper-left inset).
These sparse annotations are then automatically transferred to all other views, and utilized for
multiview mask construction (upper-right inset). The resulting 3D-consistent mask is used in a
perceptual optimization problem that results in 3D scene inpainting (lower row), with rendered depth
from the optimized NeRF shown for each image as an inset.
The paper introduces a method for 3D-inpainting called NeRF-advected inpainting, where the authors use a
3D segmentation mask and NeRF in the optimization to perform view-consistent removal of undesired objects
in a 3D scene, on a newly introduced benchmark dataset.
Zhengfei Kuang, Fujun Luan, Sai Bi, Zhixin Shu, Gordon Wetzstein, Kalyan Sunkavalli
Figure 1. We propose PaletteNeRF, a novel method for efficient appearance editing of neural radiance
fields (NeRF). Taking (a) multi-view photos as training input, our approach reconstructs a NeRF and
decomposes its appearance into a set of (b) 3D palette-based color bases. This enables (c) intuitive and
photorealistic recoloring of the scene with 3D consistency across arbitrary views. Further, we show that
(d) our method supports various palette-based editing applications such as illumination modification and
3D photorealistic style transfer.
PaletteNeRF is a method for efficient appearance editing of neural radiance fields (NeRF) in 3D scenes. It
decomposes the appearance of each 3D point into a linear combination of palette-based bases that are
shared across the scene. The method also allows for semantic-aware appearance editing.
Zicheng Zhang, Yinglu Liu, Congying Han, Yingwei Pan, Tiande Guo, Ting Yao
Figure 1. Illustrations of different strategies for photorealistic 3D scene stylization. The 2D PST
method generates disharmonious orange color on the stem and petaline edge. The regions bounded by the
dotted ellipses are inconsistent due to the white spot in the first view. When employing a radiance
field to reconstruct the results of 2D PST method, the disharmony and inconsistency are still retained.
Our LipRF successfully eliminates these downsides, and renders high-quality stylized view syntheses.
LipRF proposes learning Lipschitz mapping to stylize a pre-trained NeRF and generate photorealistic
stylized scenes from novel views while maintaining consistency and avoiding cross-view inconsistency.
LipRF pre-trains a radiance field and then uses 2D photorealistic style transfer as a prior to learn
Lipschitz network to stylize pre-trained appearance. It uses an adaptive regularization to balance
reconstruction and stylization and gradient aggregation to optimize cost-efficiently.
Chengxuan Zhu, Renjie Wan, Yunkai Tang, Boxin Shi
Figure 1. Pipeline of the proposed method. By taking the positional encoding of position and viewing
direction, the “scene MLP” (colored in blue) models the scene with occlusion faithfully, and the
“background MLP” (colored in green) models the background only. To this end, a cost volume is
constructed using images captured from neighboring reference views, guiding the joint optimization of
scene MLP and camera parameters, as we discuss in Sec. 4.1. To supervise the training of background MLP
and remove occlusions, we propose to aggregate the information from scene MLP about whether the output
of background NeRF should be similar to the observed color, by learning a supervision mask from the
weights along the ray using a “mask MLP” (colored in amber), as explained in Sec. 4.2. The reference and
target views constitute the family of neighboring images whose features are warped to construct the cost
volume.
The proposed method, called Occlusion-Aware NeRF, can eliminate occlusions in images by training and
optimizing a neural radiance field with a depth constraint and without external supervision. Results show
effective elimination of occlusions and improved overall image quality.
Marco Toschi, Riccardo De Matteo, Riccardo Spezialetti, Daniele De Gregorio, Luigi Di Stefano, Samuele
Salti
Figure 1. A random subset of scenes from our dataset. Our dataset enables the study of joint novel view
synthesis and relighting from real images, as it provides scenes framed from the same viewpoint under
varying light positions (shown along rows) as well as captured from different viewpoints under the same
light position (shown along columns).
ReNe is a new dataset that captures real-world objects under varying light directions and accompanies this
data with accurate camera and light positions for generating re-lit NeRF renders. This allows for
benchmarking new models or estimating reflective properties with good accuracy.
Chengwei Zheng, Wenbin Lin, Feng Xu
Figure 2. EditableNeRF pipeline. The query point x is first warped into the canonical space by a warp
field and a latent code βt in frame t. Next, we compute the key point weights of this canonical point x′
and use it to calculate a linear combination of all key point positions kt, called weighted key points.
After that, we feed the following NeRF MLP with the weighted key points and x′, then the output density
and color are used for volumetric rendering. In the training stage, optical flow and depth maps are used
to supervise key point positions.
EditableNeRF proposes a way to edit dynamic scenes modeled by a single camera image sequence, using picked
keypoints as handles for editing. Their method adapts these points' positions to changes in topology.
Kunhao Liu, Fangneng Zhan, Yiwen Chen, Jiahui Zhang, Yingchen Yu, Abdulmotaleb El Saddik, Shijian Lu,
Eric Xing
Figure 1. Zero-shot 3D Style Transfer. Given a set of Multi-view Content Images of a 3D scene, StyleRF
can transfer arbitrary Reference Styles to the 3D scene in a zero-shot manner, rendering high-quality
Stylized Novel Views with superb multi-view consistency.
StyleRF resolves the three-way dilemma of accurate geometry reconstruction, high-quality stylization, and
being generalizable by performing the style transformation within the feature space of a radiance field.
This leads to high-quality zero-shot style transfer and reliable geometry restoration via volume
rendering. The approach is based on an explicit grid of high-level features.
Yuechen Zhang, Zexin He, Jinbo Xing, Xufeng Yao, Jiaya Jia
Figure 1. Given a pair of one reference view from a radiance field and its stylization, Ref-NPR
propagates style more faithfully to novel views with semantic correspondence compared with
state-of-the-art scene stylization method ARF .
Ref-NPR is a controllable stylization method for 3D scenes that uses radiance fields. It uses a single
stylized 2D view as a reference and exploits semantic correspondences in content images to fill in
occluded regions with perceptually similar styles, resulting in high-quality stylized novel view
sequences.
Chong Bao, Yinda Zhang, Bangbang Yang, Tianxing Fan, Zesong Yang, Hujun Bao, Guofeng Zhang, Zhaopeng Cui
Figure 1. We propose a novel semantic-driven image-based editing approach, which allows users to edit a
photo-realistic NeRF with a single-view image or with text prompts, and renders edited novel views with
vivid details and multi-view consistency.
SinERF is a method for editing neural radiance fields using a single input image. The method encodes
geometric and texture editing in 3D space by using a prior-guided editing field, which is aided by cyclic
constraints, a color compositing mechanism, and a feature-cluster-based regularization. The experiments on
both synthetic and real-world data show the effectiveness of the proposed method.
Decomposition
In this section, the radiance of NeRF is split into geometry, BRDF, and illumination. This enables
consistent relighting under any illumination.
Haoqian Wu, Zhipeng Hu, Lincheng Li, Yongqiang Zhang, Changjie Fan, Xin Yu
Figure 1. Our method integrates lights through path tracing with Monte Carlo sampling, while Invrender
uses Spherical Gaussians to approximate the overall illumination. In this way, our method simultaneously
optimizes indirect illuminations and materials, and achieves better decomposition of inter-reflections.
In practical Inverse Rendering Using Neural Radiance Fields with Monte Carlo Path Tracing, MCPT and
radiance caching are used on neural radiance fields to improve the decomposition of geometry, materials,
and illumination.
Jingsen Zhu, Yuchi Huo, Qi Ye, Fujun Luan, Jifan Li, Dianbing Xi, Lisha Wang, Rui Tang, Wei Hua, Hujun
Bao, Rui Wang
Figure 1. I2-SDF. Left: State-of-the-art neural implicit surface representation method fails in
reconstructing small objects inside an indoor scene (e.g. lamps and chandeliers), which is resolved by
our bubbling method. Middle and Right: Our intrinsic decomposition and raytracing method enable
photo-realistic scene editing and relighting applications.
I^2-SDF uses a neural SDF-based approach to jointly recover shapes, incident radiance, and materials for
intrinsic indoor scene reconstruction and editing, as well as a bubble loss for small objects and
error-guided adaptive sampling for improved quality on large-scale scenes. Additionally, the method
decomposes the neural radiance field into a spatially-varying material of the scene using Monte Carlo
raytracing and semantic segmentation, enabling accurate photorealistic relighting and editing
applications.
Zongrui Li, Qian Zheng, Boxin Shi, Gang Pan, Xudong Jiang
Figure 5. Framework overview of DANI-Net. DepthMLP takes the positional code as the input and outputs
depth wi. MaterialMLP takes the same positional code as the input and outputs ρd i and ck i . The shadow
sij is calculated based on Eq. (3). Spatially varying anisotropic specularity ρs ij is obtained through
Eq. (5). Rendered images are generated through Eq. (2). The inverse rendering loss LIR measures the
rendering error between rendered images and observed images, and backpropagates to the light calibration
and shape estimation through the differentiable shadow path (in red and green colors) and the
anisotropic reflectance path (in blue and green colors).
DANI-Net is a deep inverse rendering framework for uncalibrated photometric stereo. It incorporates
differentiable handling of shadows, and models anisotropic reflectance, achieving superior and robust
performance on general materials with complex shapes.
Siqi Yang, Xuanning Cui, Yongjie Zhu, Jiajun Tang, Si Li, Zhaofei Yu, Boxin Shi
Figure 1. Illustration of our overall pipeline. For each image, the corresponding camera rays and
sampled 3D points are positionallyencoded for MLPs (gray block). Our IntrinsicCNN and LightingCNN
modules derive lighting-(in)dependent intrinsic components and second-order spherical harmonics (SH)
lighting coefficients (green block), while Our NeRF module provides the pseudo labels of intrinsics and
sky mask by volumetric rendering through MLPs (blue block). Our SkyMLP module renders the sky with
viewing direction and extracted lighting (orange block). Given lighting extracted from the
input/reference image, the reconstructed/relighted image is rendered by photometric image formation
along with the rendered sky (yellow block).
The paper presents an outdoor scene relighting method by combining volume rendering using Neural Radiance
Fields (NeRF) with Intrinsic Image Decomposition (IID) obtained through multi-view stereo. The use of NeRF
enables editing the appearance code without explicit photo interpretation. The method achieves improved
results on outdoor scene relighting.
Mingwu Zheng, Haiyu Zhang, Hongyu Yang, Di Huang
Figure 2. Method overview. NeuFace is composed of an appearance module implemented by Spatial MLP and
Integrated Basis MLP, a geometry module achieved by SDF MLP, a physically-based forward rendering, and a
neural photometric calibration module. In forward rendering, the complex appearance reflectance
properties are explicitly decomposed to (a) diffuse albedo, (b) light integral, and (c) BRDFs integral.
Closed-form solutions are provided for the first two terms under learnable Spherical Harmonics (SH)
lighting. For the BRDFs integral term, a low-rank prior is introduced to learn a set of neural bases
which describe the complex material characteristics of facial skin.
NeuFace is able to use neural rendering techniques to accurately and efficiently capture the complex
properties of 3D facial representations with physically-meaningful BRDFs to improve the rendering quality.
Zian Wang, Tianchang Shen, Jun Gao, Shengyu Huang, Jacob Munkberg, Jon Hasselgren, Zan Gojcic, Wenzheng
Chen, Sanja Fidler
Figure 1. We present FEGR, an approach for reconstructing scene geometry and recovering intrinsic
properties of the scene from posed camera images. Our approach works both for single and
multi-illumination captured data. FEGR enables various downstream applications such as VR and AR where
users may want to control the lighting of the environment and insert desired 3D objects into the scene.
The proposed approach jointly reconstructs geometry, materials, and HDR lighting in a large urban
environment. The method uses a neural field for primary visibility and an explicit mesh for secondary
visibility, enabling photorealistic relighting, physics-based manipulations, and shadow casting.
Zhen Li, Lingli Wang, Mofang Cheng, Cihui Pan, Jiaqi Yang
Figure 1. Given a set of posed sparse-view images for a largescale scene, we reconstruct global
illumination and SVBRDFs. The recovered properties are able to produce convincing results for several
mixed-reality applications such as material editing, editable novel view synthesis and relighting. Note
that we change roughness of all walls, and albedo of all floors. The detailed specular reflectance shows
that our method successfully decomposes physically-reasonable SVBRDFs and lighting. Please refer to
supplementary videos for more animations.
TexIR enables efficient and realistic multi-view inverse rendering for large indoor scenes. It uses a
hybrid lighting representation called Texture-based Lighting (TBL) and a three-stage material optimization
strategy to recreate the SVBRDFs and global illumination.
Ruichen Zheng, Peng Li, Haoqian Wang, Tao Yu
Figure 1. Given sparse-view RGBD input (left), our method reconstructs detailed geometry (middle left),
albedo (middle right), and visibility for realistic 3D human relighting under different illuminations
(right).
The proposed method enables high-quality 3D reconstruction and relighting of digital humans by
incorporating occupancy, albedo, and visibility fields. The visibility field allows evaluating light
attenuation and resolves occlusion ambiguity, while a novel rendering-inspired loss enforces alignment
between visibility and occupancy fields for end-to-end training.
Ziang Cheng, Junxuan Li, Hongdong Li
Figure 1. Our method reconstructs object geometry and reflectance from unstructured flashlight and
non-flashlight images captured under unknown environment lights. We use a co-located camera-light
configuration that is available with most smartphones. Reconstructed objects can be easily converted to
industry-ready triangle meshes and PBR textures. The image to the right is rendered from our mesh
export.
The authors of "Flash-and-Reflectance: In-the-Wild Photometric Stereo for Unknown Lighting" introduce a
smartphone flashlight as a minimal source of controlled lighting for a multi-view photometric stereo,
which is then disentangled into its flash and reflectance components to enable physically accurate inverse
rendering.
Haian Jin, Isabella Liu, Peijia Xu, Xiaoshuai Zhang, Songfang Han, Sai Bi, Xiaowei Zhou, Zexiang Xu, Hao
Su
Figure 1. Given multi-view captured images of a real scene (a), our approach – TensoIR – is able to
achieve high-quality shape and material reconstruction with high-frequency details (b). This allows us
to render the scene under novel lighting and viewpoints (c), and also change its material properties
(d).
TensoIR proposes a novel inverse rendering approach based on tensor factorization and neural fields
(TensoRF). TensoIR can jointly estimate the radiance field and physical properties, leading to
photo-realistic novel view synthesis. The method can handle input images captured under single or multiple
unknown lighting conditions.
Other
Several works with excellent results in various fields.
Yufei Wang, Yi Yu, Wenhan Yang, Lanqing Guo, Lap-Pui Chau, Alex Kot, Bihan Wen
Figure 1. The comparison between the previous SOTA methods (in blue) and our proposed method (in green).
Different from the previous work where the sampling strategy is hand-crafted or learned by a pre-defined
sampling loss, we learn the sampling and reconstruction process in a unified end-to-end manner. In
addition, the sampling of previous works is in the raw pixel space, which in fact still includes a large
amount of spatial redundancy and precision redundancy. Instead, we conduct sampling in the feature
space, and more compact metadata is obtained for pixels in the feature space via the adaptive
allocation. The saved metadata is annotated in dashed box.
The paper presents a novel method to compress raw images using sRGB-guided context models. The proposed
framework learns to encode a compact representation of the image in the latent space. This allows for
allocation of bits to the most significant areas of the image. The method shows superior results compared
to other compression methods on both uncompressed sRGB and JPEG images.
Fabio Tosi, Alessio Tonioni, Daniele De Gregorio, Matteo Poggi
Fig. 1. Zero-Shot Generalization Results. On top, predictions by RAFT-Stereo trained with our approach
on user-collected images, without using any synthetic datasets, ground-truth depth or (even) real stereo
pairs. At the bottom, a zoom-in over the Backpack disparity map, showing an unprecedented level of
detail compared to existing strategies not using ground-truth trained with the same data.
The authors propose a framework for training stereo networks without explicit ground truth. Instead, NeRF
rendering is used to generate stereo data, and depth maps are used as proxy labels. The resulting models
outperform existing self-supervised methods on the Middlebury dataset, and approach or outperform
supervised models in zero-shot generalization.
Hyunyoung Jung, Zhuo Hui, Lei Luo, Haitao Yang, Feng Liu, Sungjoo Yoo, Rakesh Ranjan, Denis Demandolx
Figure 1. Predicting 50% downsized images on Sintel , we compare AnyFlow to RAFT . AnyFlow shows clearer
boundaries, accurate shapes, and better detection of small objects.
AnyFlow is a network that estimates optical flow from images of arbitrary resolutions. The approach uses a
continuous coordinate-based flow representation that can generate high-quality flows for small objects and
tiny details from low-resolution inputs.
Wenqi Xian, Aljaž Božič, Noah Snavely, Christoph Lassner
Figure 1. Method Overview. The optical stack leads to light ray distortion and vignetting. We show that
invertible residual networks are a powerful tool to model the distortion for projection and ray casting
across many lenses and in many scenarios. Additionally, we propose a novel type of calibration board
(top left) that can optionally be used to improve calibration accuracy. For evaluation, we propose the
‘SynLens’ dataset to evaluate lens models at scale.1
NeuroLens is a neural lens model for distortion and vignetting used for point projection and ray casting,
and can be optimized through both operations. It outperforms standard packages and recent approaches while
being much easier to use and extend.
Ziyan Wang, Giljoo Nam, Tuur Stuyck, Stephen Lombardi, Chen Cao, Jason Saragih, Michael Zollhoefer,
Jessica Hodgins, Christoph Lassner
Figure 2. Method Overview. Our method is comprised of two stages: state compression and dynamic
modeling. In the first stage, we train an autoencoder for hair and head appearance from multiview RGB
images using differentiable volumetric raymarching; at the same time we create an encoding space of hair
states. In the dynamic modeling stage, we sample temporally adjacent hair encodings to train a temporal
transfer module (T2M) that performs the transfer between the two, based on head motion and head-relative
gravity direction.
The neuWigs framework independently models hair from the head to achieve data-driven realistic hair
animation. The first stage learns a low dimensional latent space and appearance via an auto-encoder, and
employs multi-view hair segmentation masks to better disentangle hair and head while learning appearance.
The second stage learns a dynamics model to transfer temporal hair based on learned codes, and employs a
3D point cloud autoencoder from stage 1 for de-noising hair states.
Lin Huang, Chung-Ching Lin, Kevin Lin, Lin Liang, Lijuan Wang, Junsong Yuan, Zicheng Liu
Figure 1. We present Neural Voting Field (NVF), a 3D implicit function for camera-space 3D hand pose
estimation from an RGB image. Given an RGB input (a), for a 3D point in camera space, NVF can predict
its signed distance to hand surface and a set of 4D offset vectors (1D voting weight and 3D directional
vector to each joint). 3D pose is then obtained via dense 3D point-wise voting. Rendered predictions:
(b) Hand mesh generated by Marching Cubes from the predicted signed distances (c-h) Predicted 1D voting
weight from each predicted near-surface 3D point to a joint. (Brighter line means larger weight, darker
or no line means smaller weight) (i) Estimated 3D hand pose.
NVF proposes a unified method for camera-space 3D hand pose estimation that models 3D dense local evidence
and hand global geometry via point-wise voting and dense modeling in 3D domain. The method fully models 3D
dense local evidence and helps to alleviate common 2D-to-3D ambiguities. It outperforms existing
state-of-the-art algorithms on the FreiHAND dataset for camera-space 3D hand pose estimation.
Xiaomeng Xu, Yanchao Yang, Kaichun Mo, Boxiao Pan, Li Yi, Leonidas Guibas
Figure 1. Left: a physical scene is composed of mutually-correlated entities, e.g., points belonging to
the same part or object. We pursue a subspace of the parameters of an implicit scene representation so
that when the scene is perturbed along the gradient of a specific point, a resonance emerges with other
points having high mutual information. For example, if the brightness of a point p on the wall changes,
the other wall points should also change coherently (while points on the ceiling should be unaffected).
Right: a comparison between responses to perturbations along the gradient of the highlighted red point
in the scene. The response from the original NeRF spreads all over the scene, while the localized
response from the proposed JacobiNeRF demonstrates coherence between points with similar semantics.
JacobiNeRF modifies a standard NeRF to learn correlations between the appearance and semantically
meaningful synergies of scene elements by aligning Jacobians instead of gradients during training. The
authors demonstrate that this leads to superior performance in instance and semantic segmentation tasks.
Henry Peters, Yunhao Ba, Achuta Kadambi
Figure 1. Our model reconstructs the training scene more accurately than other architectures. Our model
does not have the noise pattern present in reconstructions from SIREN , nor does it show tiling
artifacts as in ACORN’s prediction.
pCON is a new neural scene representation designed to accurately preserve polarimetric information while
parameterizing a scene. It removes artifacts found in current coordinate network architectures when
reconstructing three polarimetric quantities of interest.
Haotong Lin, Qianqian Wang, Ruojin Cai, Sida Peng, Hadar Averbuch-Elor, Xiaowei Zhou, Noah Snavely
Figure 1. Chronology reconstruction. Given timestamped Internet photos (a) of a landmark that has
changed significantly over the years (e.g., 5Pointz, NYC, the collective graffiti art project shown
above), our method can reconstruct a time-varying 3D model, and render photo-realistic images (b) with
independent control of viewpoint, time (c) and illumination (d). Photos by Flickr users Ryan Brown,
DaniGMX, DJ Leekee, Diff Graff, Lee Smith, James Prochnik, Verity Rollins Photo under CC BY.
The paper introduces a new temporal representation for the reconstruction of photo-realistic renderings of
large-scale landmarks using Internet photos. The space-time radiance field representation includes a
per-image illumination embedding and temporal step function encoding, allowing for independent control of
viewpoint, time, and illumination. The method is validated on a new dataset of four scenes exhibiting
various temporal changes.
Kushagra Tiwary, Akshat Dave, Nikhil Behari, Tzofi Klinghoffer, Ashok Veeraraghavan, Ramesh Raskar
Figure 1. Objects as radiance-field cameras. We convert everyday objects with unknown geometry (a) into
radiance-field cameras by modeling multi-view reflections (b) as projections of the 5D radiance field of
the environment. We convert the object surface into a virtual sensor to capture this radiance field (c),
which enables depth and radiance estimation of the surrounding environment. We can then query this
radiance field to perform beyond field-of-view novel view synthesis of the environment (d).
The authors propose a method to convert glossy objects into radiance-field cameras, enabling imaging
beyond the camera's field-of-view and from seemingly impossible vantage points, and recovering the
environment radiance fields enables depth and radiance estimation from the object to its surroundings in
addition to beyond field-of-view novel-view synthesis.Focus was placed on recovering the environment
radiance fields for the depth and radiance estimation and training was end-to-end on multi-view images of
the objects.
Haim Sawdayee, Amir Vaxman, Amit H. Bermano
Figure 1. OReX reconstructs smooth 3D shapes (right) from input planar cross-sections (left). Our neural
field-based prior allows smooth interpolation between slices while respecting high-frequency features
and self-similarities.
OReX introduces a neural field interpolation prior, trained on input planes to reconstruct smooth 3D
shapes from cross-sections. The paper also includes a hierarchical input sampling scheme which resolves
high-frequency detail issues.
Thomas Tanay, Aleš Leonardis, Matteo Maggioni
Figure 1. Top: Our Multiplane Features Encoder-Renderer (MPFER) reimagines the MPI pipeline by moving
the multiplane representation to feature space. Bottom: MPFER significantly outperforms existing methods
in multiple challenging scenarios, including here, novel view synthesis from 8 highly degraded inputs.
MPI-Net is a novel view synthesis method that uses a learnable encoder-renderer pair to denoise multi-view
datasets in a 3D-based approach, outperforming 2D-based alternatives with lower compute requirements. The
method extends the MPI framework and operates depthwise and view-wise in an unsupervised manner.
Benran Hu, Junkai Huang, Yichen Liu, Yu-Wing Tai, Chi-Keung Tang
Figure 1. Region proposal results on a NeRF. Top 12 proposals in eight orientations with highest
confidence are visualized. The NeRF is trained from the Living Room scene from INRIA .
NeRF-RPN is a 3D object detection framework that directly operates on the NeRF model. It exploits a novel
voxel representation incorporating multi-scale 3D features to regress bounding boxes of objects without
the need to render the scene at any viewpoint. The framework can be trained end-to-end and is applied
without the use of class labels.
Xin Kong, Shikun Liu, Marwan Taher, Andrew J. Davison
Figure 1. vMAP automatically builds an object-level scene model from a real-time RGB-D input stream.
Each object is represented by a separate MLP neural field model, all optimised in parallel via vectorised
training. We use no 3D shape priors, but the MLP representation encourages object reconstruction to be
watertight and complete, even when objects are partially observed or are heavily occluded in the input
images. See for instance the separate reconstructions of the armchairs, sofas and cushions, which were
mutually occluding each other, in this example from Replica.
vMap is a dense SLAM system that applies small MLPs to each object for efficient mapping without the need
for 3D priors. The system can dynamically add objects to its map and optimise up to 50 individual objects
in a single scene. The authors show significantly better reconstruction quality compared to prior neural
field SLAM systems.
Chung-Yi Weng, Pratul P. Srinivasan, Brian Curless, Ira Kemelmacher-Shlizerman
Figure 1. Given a photo collection of a subject (e.g., Roger Federer) (a), our method PersonNeRF builds
a space of the subject spanned by camera view, body pose, and appearance (b). PersonNeRF enables
traversing this space and exploring unobserved combinations of these attributes (c). Here we render
novel views (top row), various body poses (middle row), and different appearances (bottom row) by
traversing the corresponding axes. Among all of the renderings shown here, only the central images of
rows correspond to a photo actually observed in the training data (corresponding input photo marked with
a red square). Photo credits to Getty Images.
PersonNeRF constructs a personalized neural volumetric 3D model of a subject from a collection of photos
with sparse observations. The model can render images of the subject from arbitrary pose and appearance
combinations, surpassing previous solutions for free-viewpoint human rendering.
Markus Worchel, Marc Alexa
Figure 3. Shadow mapping suffers from self-shadowing artifacts (“shadow acne”) that are alleviated by
biasing the depths towards the light source in the visibility test. Variance shadow mapping handles
these artifacts implicitly.
The authors introduce the idea of using pre-filtered shadow mapping to generate shadows in differentiable
rendering of triangle meshes. This technique is compared to differentiable light transport simulation and
shown to be orders of magnitude faster while still maintaining similar accuracy. They also demonstrate
that differentiable rasterization without shadows often fails to converge.
Wenjie Chang, Yueyi Zhang, Zhiwei Xiong
Figure 1. Our work aims to implicitly represent a scene with a neural network through fewer panoramas
and thus obtain the relative depth measurements. The left diagram illustrates the efficiency and
accuracy of our method against NeRF and a traditional geometric-based Structure from Motion approach .
All methods adopt 3 panoramas with camera positions as inputs and are evaluated over 10 scenes from
Matterport3D . The right shows aerial views of point clouds that are transferred from depth estimations
for better visualization. Compared with state-of-the-art methods, measurements from ours have clearer
edges and more accurate geometry.
The authors propose a framework for more accurate and efficient depth estimation from indoor panoramic
images using neural radiance fields (NRFs). Two networks, one for learning the Signed Distance Function
and another for learning the radiance field, are used. Additionally, a spherical position embedding scheme
is proposed along with a geometric consistency loss.
Jiahui Huang, Zan Gojcic, Matan Atzmon, Or Litany, Sanja Fidler, Francis Williams
Figure 1: We present Neural Kernel Surface Reconstruction (NKSR) for recovering a 3D surface from an
input point cloud. Trained directly from dense points, our method reaches state-of-the-art
reconstruction quality and scalability. NKSR is also highly generalizable: All the meshes in this figure
are reconstructed using a single trained model.
NKF-Recon reconstructs a 3D implicit surface from large-scale sparse and noisy point clouds using Neural
Kernel Fields, compactly supported kernel functions enabling memory efficiency. NKF-Recon is designed to
be robust to noise, minimizes training requirements and can handle a wide variety of scenes effectively.
Obin Kwon, Jeongho Park, Songhwai Oh
Figure 2. Localization using Floc. A target observation can be localized by cross-correlation (Conv)
between m and mtrg. Before the cross-correlation, each RNR-Map is forwarded to the CNN Fk and Fq. After
Conv, FE takes the output of the Conv and outputs a heatmap E which highlights the most plausible target
area.
RNR-Map uses a grid-based latent code representation for visual navigation. The proposed localization and
navigation frameworks utilize the visual information given out by the RNR-Map achieving significant
improvements in a variety of benchmarks.
Allan Zhou, Moo Jin Kim, Lirui Wang, Pete Florence, Chelsea Finn
Figure 1. SPARTN is an offline data augmentation method for behavior cloning eye-in-hand visual policies.
It simulates recovery in a demonstration by using NeRFs to render high-fidelity observations (right) from
noisy states, then generates corrective action labels.
SPARTN is an augmentation method for robot control in visual grasping using Neural Radiance Fields. NeRF
synthetically injects corrective noise into visual demonstrations by calculating corrective actions
without additional expert supervision or environment interaction. The new approach improves success rates
for robot grasping tasks in simulations and experiments.
Alessandro Ruzzi, Xiangwei Shi, Xi Wang, Gengyan Li, Shalini De Mello, Hyung Jin Chang, Xucong Zhang,
Otmar Hilliges
Figure 1. GazeNeRF consists of a NeRF-based two-streamMLP structure conditioned on a target gaze label
to generate the photo-realistic face images. A 3D rotation transformation R is applied on eyes stream of
GazeNeRF.
GazeNeRF proposes a two-stream architecture that predicts volumetric features separately for the face and
eye regions using conditional image-based neural radiance fields. These features are used to redirect gaze
by rigidly transforming the eye features with a 3D rotation matrix and composing them with the original
face features. The approach outperforms previous state-of-the-art 2D gaze redirection methods in terms of
accuracy and identity preservation.
Bowen Wen, Jonathan Tremblay, Valts Blukis, Stephen Tyree, Thomas Muller, Alex Evans, Dieter Fox, Jan
Kautz, Stan Birchfield
Figure 2. Framework overview. First, features are matched between consecutive segmented images, to
obtain a coarse pose estimate (Sec. 3.1). Some of these posed frames are stored in a memory pool, to be
used and refined later (Sec. 3.2). A pose graph is dynamically created from a subset of the memory pool
(Sec. 3.3); online optimization refines all the poses in the graph jointly with the current pose. These
updated poses are then stored back in the memory pool. Finally, all the posed frames in the memory pool
are used to learn a Neural Object Field (in a separate thread) that models both geometry and visual
texture (Sec. 3.4) of the object, while adjusting their previously estimated poses. of internet photos .
Together with depth, the identified correspondences are filtered by a RANSAC-based pose estimator . The
pose hypothesis that maximizes the number of inliers is then selected as the current frame’s coarse pose
estimation ˜ξt.
The paper presents a method for tracking unknown 6-DoF objects from a monocular RGBD video sequence while
simultaneously performing neural 3D reconstruction of the object. The method learns a Neural Object Field,
which is used in parallel with a pose graph optimization process to accumulate robust 3D representation
that captures both geometry and appearance. The proposed method works for arbitrary rigid objects, even
when visual texture is largely absent.
Changan Chen, Alexander Richard, Roman Shapovalov, Vamsi Krishna Ithapu, Natalia Neverova, Kristen
Grauman, Andrea Vedaldi
Figure 4. Qualitative examples. For all binaural audio, we show the left-channel and the right-channel
waveforms side-by-side. Row 1: SoundSpaces-NVAS example where given the source viewpoint and input
audio, the model synthesizes audio for three different target viewpoints (target views are for reference
only). In this case, the active speaker is the male speaker as indicated by the bounding box. For target
viewpoint 1, the view rotates about 90 degrees and the male speaker is on the left side and the
predicted left channel is louder than the right channel. Viewpoint 2 moves away from the speaker and
thus yields lower amplitude compared to the first prediction. For target viewpoint 3, it is completely
located outside of the living room, in which case, the sound could only come from the door open on the
right (louder right channel) and the reverberation also greatly increases due to the vanishing direct
sound. Row 2: Replay-NVAS example where the speaker is located on the left in the source viewpoint which
becomes the right and further from the camera in target viewpoint 2, the model also predicts lower
amplitude and louder right channel. On the right side, we show an example of the audio-visual speech
enhancement for the active speaker. The model enhances the speech to largely match with the near-range
audio (target).
ViGAS is a neural rendering approach that synthesizes sound spatially based on input audio-visual cues,
and is able to reason about the spatial cues in two large-scale audiovisual datasets. It is the first
approach to tackle the novel-view acoustic synthesis task.
Yunfan Ye, Renjiao Yi, Zhirui Gao, Chenyang Zhu, Zhiping Cai, Kai Xu
Figure 1. We leverage 2D edge detection to directly acquire 3D edge points by learning a neural implicit
field and further reconstructing 3D parametric curves that represent the geometrical shape of the object.
We introduce the details about extracting 3D edges from the proposed Neural Edge Field in 3.1, and the
coarse-to-fine optimization strategy to reconstruct parametric curves in 3.2. The whole pipeline is
self-supervised with only 2D supervision.
NEF learns a neural implicit field representing the density distribution of 3D edges and uses view-based
rendering loss. The curves are extracted from the optimized NEF using an iterative method, and NEF
outperforms existing methods in the benchmark with synthetic data.
Zehan Zheng, Danni Wu, Ruisi Lu, Fan Lu, Guang Chen, Changjun Jiang
Figure 1. Common cases of nonlinear motions in autonomous driving scenarios. Spatially uniform linear
interpolation ( ) using the middle two frames of the point cloud differs significantly from the actual
situation ( ), so it is necessary to take multiple point clouds into consideration for nonlinear
interpolation.
NeuralPCI is a 4D spatio-temporal neural field that interpolates 3D point clouds by integrating
multi-frame information to handle nonlinear large motions effectively. It achieves sate-of-the-art
performance on both NL-Drive and DHB datasets and is applicable to other domains, including extrapolation,
morphing, and auto-labeling. Code is available on Github.
Rohith Agaram, Shaurya Dewan, Rahul Sajnani, Adrien Poulenard, Madhava Krishna, Srinath Sridhar
Figure 1. We present Canonical Field Network (CaFi-Net), a self-supervised method for 3D position and
orientation (pose) canonicalization of objects represented as neural fields. We specifically focus on
neural radiance fields (NeRFs) fitted to raw RGB images of arbitrarily posed objects (left), and obtain
a canonical field (fixed novel view shown on right) where all objects in a category are consistently
positioned and oriented.
CaFi-Net learns shapes and poses of 3D objects by directly manipulating radiance fields in a Siamese
network architecture, which extract equivariant field features for canonicalization of pose-dependent
objects. This allows for Siamese pose classification without instance-wise data labeling.
Fangfu Liu, Chubin Zhang, Yu Zheng, Yueqi Duan
Figure 1. Top: Comparisons between Semantic-NeRF and our method Semantic-Ray. Semantic-NeRF (S-NeRF for
short) needs to train one specific model for each scene, while our Semantic-Ray (S-Ray for short) trains
one unified model on multiple scenes and generalizes to unseen scenes. Bottom: Experimental comparisons
between S-Ray and S-NeRF on generalization ability. We observe that our network S-Ray can effectively
fast generalize across diverse unseen scenes while S-NeRF fails in a new scene. Moreover, our result can
be improved by fine-tuning on more images for only 10 min (2k iterations), which achieves comparable
quality with the Semantic-NeRF’s result for 100k iterations per-scene optimization.
S-Ray proposes a cross-reprojection attention network to learn a semantic radiance field from multi-view
reprojections, which reduces computational expenses compared to the use of dense attention. Furthermore,
S-Ray demonstrates strong generalization ability across training and testing on multiple scenes.
Viktor Rudnev, Mohamed Elgharib, Christian Theobalt, Vladislav Golyanik
Figure 1. The original NeRF approach (first row) produces strong artefacts when processing RGB images
undergoing fast motion. Our approach (second row) operates purely on an event stream and thus produces
significantly better and sharper results, capturing important details such as specularities (see yellow
box).
The authors propose a method for 3D novel view synthesis from a single event camera input. They introduce
a neural radiance field trained in a self-supervised way from events using a tailored ray sampling
strategy. The proposed method produces higher quality and visually denser results in challenging real and
synthetic scenes.
Yawar Siddiqui, Lorenzo Porzi, Samuel Rota Buló, Norman Müller, Matthias Nießner, Angela Dai, Peter
Kontschieder
Figure 1. Given only RGB images of an in-the-wild scene as input, our method optimizes a panoptic
radiance field which can be queried for color, depth, semantics, and instances for any point in space. We
obtain poses for input images with COLMAP , and 2D panoptic segmentation masks using a pretrained
off-the-shelf network . During training, our method lifts these 2D segmentation masks, which are often
noisy and view-inconsistent, into a consistent 3D panoptic radiance field. Once trained, our model is
able to render images and their corresponding panoptic segmentation masks from both existing and novel
viewpoints.
Panoptic Lifting proposes an approach to obtain 3D panoptic volumetric representations without directly
using 3D input, but rather machine-generated 2D panoptic segmentation masks. A neural field representation
generates the 3D panoptic representation and inconsistencies between 2D instances in different views are
solved by solving a linear assignment. The method is validated on Hypersim, Replica, and ScanNet datasets
with improved performance over state of the art.
Ben Agro, Quinlan Sykora, Sergio Casas, Raquel Urtasun
Figure 1. Left: Explicit approaches predict whole-scene occupancy and flow on a spatio-temporal grid.
Right: Our implicit approach only predicts occupancy and flow at queried continuous points, focusing on
what matters for downstream planning.
Implicito is a unified approach to perception and future prediction in self-driving vehicles. It
represents occupancy and flow with a single neural network that can be queried directly by the motion
planner, avoiding unnecessary computation. The architecture includes an efficient global attention
mechanism that overcomes the limited receptive field. Results show that Implicito outperforms the
state-of-the-art in both urban and highway settings.