Explicitly Disentangled Representations in Object-Centric Learning
Authors: Riccardo Majellaro, Jonathan Collu, Aske Plaat, Thomas M. Moerland
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on a range of object-centric benchmarks reveal that our approach achieves the desired disentanglement while also numerically improving baseline performance in most cases. Our experimental evaluations indicate that, in most cases, DISA introduces the desired disentanglement property, enabling novel generative and compositional capabilities. At the same time, DISA is competitive with or outperforms the baselines at scene decomposition on three well-known synthetic datasets, while achieving significantly improved reconstruction quality. This evaluation investigates whether DISA is competitive with the baselines at unsupervised object discovery and image reconstruction. For object discovery, we compare the predicted object masks with the ground truth through the Adjusted Rand Index (ARI) score (Rand, 1971; Hubert & Arabie, 1985) computed both including (BG-ARI) and excluding (FG-ARI) the background masks. The use of the ARI score is in line with Locatello et al. (2020); Biza et al. (2023). As for the reconstruction quality, we employ the mean squared error (MSE). Table 1 summarizes the ARI scores, while Table 2 the MSE. |
| Researcher Affiliation | Academia | Riccardo Majellaro EMAIL Jonathan Collu EMAIL Aske Plaat1 EMAIL Thomas M. Moerland1 EMAIL 1Leiden Institute of Advanced Computer Science, Leiden University Work done while at LIACS, Leiden University. |
| Pseudocode | No | The paper describes the architecture and methods in detail through text and diagrams (e.g., Figure 1), but it does not include a formally labeled 'Pseudocode' or 'Algorithm' block. |
| Open Source Code | Yes | The code is available at https://github.com/riccardomajellaro/disentangled-slot-attention. |
| Open Datasets | Yes | In this section, we evaluate DISA on four well-known multi-object synthetic datasets (Kabra et al., 2019; Karazija et al., 2021): Tetrominoes, Multi-d Sprites, CLEVR, and CLEVRTex (Appendix B). For CLEVR, we train on a filtered version of this dataset called CLEVR6. Multi-object datasets. https://github.com/deepmind/multi-object-datasets/, 2019. Clevrtex: A texture-rich benchmark for unsupervised multi-object segmentation. ar Xiv preprint ar Xiv:2111.10265, 2021. CLEVR, introduced by Johnson et al. (2017), is a collection of 240 320 images of 3D scenes. Multi-d Sprites is a dataset based on d Sprites (Matthey et al., 2017). |
| Dataset Splits | Yes | On Tetrominoes, we employ a total of 60K samples for the training set and 320 for the test set, in line with Greff et al. (2019); Locatello et al. (2020). Again, we employ a total of 60K samples for the training set and 320 for the test set, as in Greff et al. (2019); Locatello et al. (2020). The number of training samples, in this case, is 70K, while the test samples are 15K. CLEVRTex (Karazija et al., 2021) consists of 50K 240 320 images (cropped and resized to 128 128 as in CLEVR), 40K of which are training samples while the remaining 10K are equally split into validation and testing sets. |
| Hardware Specification | No | The paper does not provide specific details regarding the hardware used for running the experiments, such as GPU or CPU models. |
| Software Dependencies | No | The paper mentions using 'Adam as optimizer (Kingma & Ba, 2014)', but it does not specify software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions). |
| Experiment Setup | Yes | A batch size of 64 and a learning rate of 4 10 4 (using Adam as optimizer (Kingma & Ba, 2014)) are employed on all datasets for all the architectures. An initial (linear) warm-up for 10K steps and an exponential decay schedule (from start to end) are applied to the learning rate. On Tetrominoes, we perform roughly 100K steps (107 epochs 938 batches), on Multi-d Sprites around 250K (267 epochs 938 batches), on CLEVR6 nearly 150K (274 epochs 547 batches), and on CLEVRTex exactly 150K (240 epochs 625 batches). The mean squared error is used as the reconstruction loss function. The number of iterations in the Slot Attention mechanism is fixed to 3, while the number of slots is set to 4 on Tetrominoes, 6 on Multi-d Sprites, 7 on CLEVR6, and 11 on CLEVRTex. We set the dimensionality of the slot vectors to 64 (excluded position and scale factors) in all cases and, in DISA, we define the texture components as the first half and the mask components as the second half (32 each) of a representation. With ISA and DISA, we clip the norm of gradients to 0.05 (as in ISA s original implementation). Finally, on Tetrominoes and Multi-d Sprites, we apply the variance regularization only on Stex with λ = 0.32 since the beginning of the training, while on CLEVR6 and CLEVRTex we apply it after the warm-up on both Stex and Sshape (as in Equation 7) with λ = 0.05. |