reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Attention Beats Concatenation for Conditioning Neural Fields

Authors: Daniel Rebain, Mark J. Matthews, Kwang Moo Yi, Gopal Sharma, Dmitry Lagun, Andrea Tagliasacchi

TMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	As we show in our experiments, high-dimensional conditioning is key to modelling complex data distributions, thus it is important to determine what architecture choices best enable this when working on such problems. To this end, we run experiments modelling 2D, 3D, and 4D signals with neural fields, employing concatenation, hyper-network, and attention-based conditioning strategies a necessary but laborious effort that has not been performed in the literature.
Researcher Affiliation	Collaboration	Daniel Rebain EMAIL University of British Columbia Google Research Mark J. Matthews EMAIL Google Research Kwang Moo Yi EMAIL University of British Columbia Gopal Sharma EMAIL University of British Columbia Dmitry Lagun EMAIL Google Research Andrea Tagliasacchi EMAIL Google Research Simon Fraser University
Pseudocode	No	The paper describes methods and procedures in paragraph text and references figures, but does not contain a clearly labeled pseudocode or algorithm block.
Open Source Code	Yes	Secondly, we note the large amount of energy required to run these experiments. We undertook this investigation as a service to the community so that others don t have to, and make our source code available for verifiability.
Open Datasets	Yes	We test on two datasets, respectively discussed in Section 4.1.1 and Section 4.1.2, with network implementation details detailed in Section A.3. 4.1.1 Tiled MNIST Figure 5 and Table 1 We design a dataset with controllable complexity to demonstrate how much the performance of an architecture can be affected by the size of the latent code and dimensionality of the data manifold. Loosely inspired by Lin et al. (2018), it consists of images formed by a 16 16 grid of MNIST digits where each digit is down-scaled to 16 16 pixels for a total image resolution of 256 256. The digits are chosen randomly from the 60, 000 images of the MNIST dataset, creating up to 60, 00016 16 unique possible combinations. In addition to the Tiled MNIST dataset, we also experiment with the Celeb A-HQ dataset introduced by Karras et al. (2018). HUMBI (Yu et al., 2020): is a large multiview dataset of 772 human subjects across a variety of demographics, captured with 107 synchronized HD cameras. SRN Cars and Chairs (Sitzmann et al., 2019): We use the rendered dataset of cars and chairs from Shape Net (Chang et al., 2015).
Dataset Splits	Yes	For the remaining experiments, we focus on the more challenging real-world task of novel view synthesis, one of the main application domains of neural fields. Given one or more images of an object or a scene, this is the task of generating images from novel viewpoints. We experiment with two different neural field-based approaches to novel view synthesis: neural radiance fields (Mildenhall et al., 2020), and light field networks (Sitzmann et al., 2021). Both are analyzed using the following datasets, where we randomly select 10% of views to be held-out from training and used for testing
Hardware Specification	No	Performing the experiments reported in this paper required a very large amount of compute time: on the order of 23k GPU-hours. Due to the significant expense involved, we chose experimental parameters carefully and focused on architectural choices which appeared most likely to affect the outcome of the experiments, and therefore the conclusions of our analysis. (Note: Only mentions 'GPU-hours', not specific GPU models or other hardware details.)
Software Dependencies	No	The paper does not explicitly state any specific software dependencies with version numbers (e.g., Python 3.8, PyTorch 1.9, CUDA 11.1).
Experiment Setup	Yes	For volume rendering, we use the hierarchical sampling approach described by Mildenhall et al. (2020), but without allocating separate coarse and fine networks instead sampling both coarse and fine values from the same network; specifically, we use 64 fine/importance samples, and 128 coarse/uniform samples for each ray, which we found to be the minimal counts required to avoid noticeable artifacts with our data. Training. All novel view synthesis methods are supervised using the pixel-wise reconstruction loss in (2) applied on the training images and rendered pixel values for training views. For all datasets and architectures, training batches consist of 64 instances, with 2 views per instance, and 64 pixels sampled per image. For training auto-encoders, we use a batch size of 128 images with 512 pixels per image. For training auto-decoders, we use a batch size of 128 images with 64 pixels per image. All MLPs are relu-activated, and use the original layer normalization strategy of the method each architecture is based on: Concatenation: none (Rebain et al., 2022) Hyper-networks: at each layer (Sitzmann et al., 2019) Attention: after skip connections (Sajjadi et al., 2022) The concatenation and hyper-network models both consist of 8-layer MLPs in all cases, while the attention models use 5 attention stages with three dense layers after each. All multi-head attention layers use 16 heads and 256-dimensional keys unless otherwise specified.