Attention Beats Concatenation for Conditioning Neural Fields

Authors: Daniel Rebain, Mark J. Matthews, Kwang Moo Yi, Gopal Sharma, Dmitry Lagun, Andrea Tagliasacchi

TMLR 2023 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental As we show in our experiments, high-dimensional conditioning is key to modelling complex data distributions, thus it is important to determine what architecture choices best enable this when working on such problems. To this end, we run experiments modelling 2D, 3D, and 4D signals with neural fields, employing concatenation, hyper-network, and attention-based conditioning strategies a necessary but laborious effort that has not been performed in the literature.
Researcher Affiliation Collaboration Daniel Rebain EMAIL University of British Columbia Google Research Mark J. Matthews EMAIL Google Research Kwang Moo Yi EMAIL University of British Columbia Gopal Sharma EMAIL University of British Columbia Dmitry Lagun EMAIL Google Research Andrea Tagliasacchi EMAIL Google Research Simon Fraser University
Pseudocode No The paper describes methods and procedures in paragraph text and references figures, but does not contain a clearly labeled pseudocode or algorithm block.
Open Source Code Yes Secondly, we note the large amount of energy required to run these experiments. We undertook this investigation as a service to the community so that others don t have to, and make our source code available for verifiability.
Open Datasets Yes We test on two datasets, respectively discussed in Section 4.1.1 and Section 4.1.2, with network implementation details detailed in Section A.3. 4.1.1 Tiled MNIST Figure 5 and Table 1 We design a dataset with controllable complexity to demonstrate how much the performance of an architecture can be affected by the size of the latent code and dimensionality of the data manifold. Loosely inspired by Lin et al. (2018), it consists of images formed by a 16 16 grid of MNIST digits where each digit is down-scaled to 16 16 pixels for a total image resolution of 256 256. The digits are chosen randomly from the 60, 000 images of the MNIST dataset, creating up to 60, 00016 16 unique possible combinations. In addition to the Tiled MNIST dataset, we also experiment with the Celeb A-HQ dataset introduced by Karras et al. (2018). HUMBI (Yu et al., 2020): is a large multiview dataset of 772 human subjects across a variety of demographics, captured with 107 synchronized HD cameras. SRN Cars and Chairs (Sitzmann et al., 2019): We use the rendered dataset of cars and chairs from Shape Net (Chang et al., 2015).
Dataset Splits Yes For the remaining experiments, we focus on the more challenging real-world task of novel view synthesis, one of the main application domains of neural fields. Given one or more images of an object or a scene, this is the task of generating images from novel viewpoints. We experiment with two different neural field-based approaches to novel view synthesis: neural radiance fields (Mildenhall et al., 2020), and light field networks (Sitzmann et al., 2021). Both are analyzed using the following datasets, where we randomly select 10% of views to be held-out from training and used for testing
Hardware Specification No Performing the experiments reported in this paper required a very large amount of compute time: on the order of 23k GPU-hours. Due to the significant expense involved, we chose experimental parameters carefully and focused on architectural choices which appeared most likely to affect the outcome of the experiments, and therefore the conclusions of our analysis. (Note: Only mentions 'GPU-hours', not specific GPU models or other hardware details.)
Software Dependencies No The paper does not explicitly state any specific software dependencies with version numbers (e.g., Python 3.8, PyTorch 1.9, CUDA 11.1).
Experiment Setup Yes For volume rendering, we use the hierarchical sampling approach described by Mildenhall et al. (2020), but without allocating separate coarse and fine networks instead sampling both coarse and fine values from the same network; specifically, we use 64 fine/importance samples, and 128 coarse/uniform samples for each ray, which we found to be the minimal counts required to avoid noticeable artifacts with our data. Training. All novel view synthesis methods are supervised using the pixel-wise reconstruction loss in (2) applied on the training images and rendered pixel values for training views. For all datasets and architectures, training batches consist of 64 instances, with 2 views per instance, and 64 pixels sampled per image. For training auto-encoders, we use a batch size of 128 images with 512 pixels per image. For training auto-decoders, we use a batch size of 128 images with 64 pixels per image. All MLPs are relu-activated, and use the original layer normalization strategy of the method each architecture is based on: Concatenation: none (Rebain et al., 2022) Hyper-networks: at each layer (Sitzmann et al., 2019) Attention: after skip connections (Sajjadi et al., 2022) The concatenation and hyper-network models both consist of 8-layer MLPs in all cases, while the attention models use 5 attention stages with three dense layers after each. All multi-head attention layers use 16 heads and 256-dimensional keys unless otherwise specified.