Neural Slot Interpreters: Grounding Object Semantics in Emergent Slot Representations

Authors: Bhishma Dedhia, Niraj Jha

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments with a bi-modal object-property and scene retrieval task demonstrate the grounding efficacy and interpretability of correspondences learned by NSI. From a scene representation standpoint, we find that emergent NSI slots that move beyond the image grid by binding to spatial objects facilitate improved visual grounding compared to conventional bounding-box-based approaches. From a data efficiency standpoint, we empirically validate that NSI learns more generalizable representations from a fixed amount of annotation data than the traditional approach.
Researcher Affiliation Academia Bhishma Dedhia Niraj K Jha Department of Electrical and Computer Engineering Princeton University EMAIL
Pseudocode Yes Algorithm 1 Neural Slot Interpreter Contrastive Learning Pseudocode
Open Source Code No The paper does not contain an explicit statement about releasing their source code, nor does it provide a direct link to a code repository.
Open Datasets Yes Our experiments encompass different tasks on scenes ranging from synthetic renderings to in-the-wild scenes viz. (1) CLEVr Hans (Stammer et al., 2020): objects scattered on a plane, (2) CLEVr Tex (Karazija et al., 2021): textured objects placed on textured backgrounds (3) MOVi-C (Greff et al., 2022): photorealistic objects on real-world surfaces, and (4) MS-COCO 2017 (Lin et al., 2015): a large-scale object detection dataset containing real-world images.
Dataset Splits Yes The dataset splits used in this work are detailed in Table 3. Table 3: Dataset splits used in experiments. Name Train Split Size Validation Split Size Test Split Size CLEVr Hans 3 9000 2250 2250 CLEVr Hans 7 21000 5250 5250 CLEVr Tex 37500 2500 10000 MOVi-C 198635 35053 6000 MS COCO 2017 99676 17590 4952
Hardware Specification Yes We list the hyperparameters for NSI and other methods used in our experiments, which were all performed on Nvidia A100 GPUs.
Software Dependencies No The paper mentions several software components like DINO Vi T, PyTorch, MLP, Transformer, Gated Recurrent Unit, Slot Attention, and the Hungarian Algorithm, but it does not specify version numbers for any of these.
Experiment Setup Yes The hyperparameters for ungrounded and HMC matching backbones are given in Table 4. The hyperparameters for the NSI alignment model are listed in Table 5.