Neural Slot Interpreters: Grounding Object Semantics in Emergent Slot Representations
Authors: Bhishma Dedhia, Niraj Jha
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments with a bi-modal object-property and scene retrieval task demonstrate the grounding efficacy and interpretability of correspondences learned by NSI. From a scene representation standpoint, we find that emergent NSI slots that move beyond the image grid by binding to spatial objects facilitate improved visual grounding compared to conventional bounding-box-based approaches. From a data efficiency standpoint, we empirically validate that NSI learns more generalizable representations from a fixed amount of annotation data than the traditional approach. |
| Researcher Affiliation | Academia | Bhishma Dedhia Niraj K Jha Department of Electrical and Computer Engineering Princeton University EMAIL |
| Pseudocode | Yes | Algorithm 1 Neural Slot Interpreter Contrastive Learning Pseudocode |
| Open Source Code | No | The paper does not contain an explicit statement about releasing their source code, nor does it provide a direct link to a code repository. |
| Open Datasets | Yes | Our experiments encompass different tasks on scenes ranging from synthetic renderings to in-the-wild scenes viz. (1) CLEVr Hans (Stammer et al., 2020): objects scattered on a plane, (2) CLEVr Tex (Karazija et al., 2021): textured objects placed on textured backgrounds (3) MOVi-C (Greff et al., 2022): photorealistic objects on real-world surfaces, and (4) MS-COCO 2017 (Lin et al., 2015): a large-scale object detection dataset containing real-world images. |
| Dataset Splits | Yes | The dataset splits used in this work are detailed in Table 3. Table 3: Dataset splits used in experiments. Name Train Split Size Validation Split Size Test Split Size CLEVr Hans 3 9000 2250 2250 CLEVr Hans 7 21000 5250 5250 CLEVr Tex 37500 2500 10000 MOVi-C 198635 35053 6000 MS COCO 2017 99676 17590 4952 |
| Hardware Specification | Yes | We list the hyperparameters for NSI and other methods used in our experiments, which were all performed on Nvidia A100 GPUs. |
| Software Dependencies | No | The paper mentions several software components like DINO Vi T, PyTorch, MLP, Transformer, Gated Recurrent Unit, Slot Attention, and the Hungarian Algorithm, but it does not specify version numbers for any of these. |
| Experiment Setup | Yes | The hyperparameters for ungrounded and HMC matching backbones are given in Table 4. The hyperparameters for the NSI alignment model are listed in Table 5. |