On the Transfer of Object-Centric Representation Learning
Authors: Aniket Rajiv Didolkar, Andrii Zadaianchuk, Anirudh Goyal, Michael Mozer, Yoshua Bengio, Georg Martius, Maximilian Seitzer
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Thus, in this work, we answer the question of whether current real-world capable object-centric methods exhibit similar levels of transferability by introducing a benchmark comprising seven different synthetic and real-world datasets. We analyze the factors influencing performance under transfer and find that training on complex natural images improves generalization to unseen scenarios. Furthermore, inspired by the success of task-specific fine-tuning in foundation models, we introduce a novel fine-tuning strategy to adapt pre-trained vision encoders for the task of object discovery. We find that the proposed approach results in state-of-the-art performance for unsupervised object discovery, exhibiting strong zero-shot transfer to unseen datasets. |
| Researcher Affiliation | Collaboration | 1MILA & University of Montreal 2University of Amsterdam 3Google Deepmind 4MPI for Intelligent Systems & University of Tübingen |
| Pseudocode | No | The paper describes the Slot Attention mechanism in detail in Section C.2 using prose and mathematical equations but does not present a clearly labeled pseudocode or algorithm block. |
| Open Source Code | No | The paper mentions a project website 'Website: rw-ocrl.github.io/ftdinosaur-paper' but does not explicitly state that the source code for the methodology described in *this* paper is available there. Mentions of GitHub links in the text refer to source code for other, third-party methods (Slot Diffusion, SPOT), not the authors' own implementation. |
| Open Datasets | Yes | To this end, we introduce a benchmark consisting of 7 datasets comprising a diverse range of synthetic and real-world scenes. Using this benchmark, we (1) seek to understand the zero-shot transfer capabilities of existing models, and (2) study the properties of training datasets that influence generalization. The general conclusion we draw from this benchmark is that object-centric models which are trained on naturalistic datasets consisting a variety of objects such as COCO (Lin et al., 2014) usually exhibit decent zero-shot generalization. To obtain a test bed that robustly measures zero-shot performance, we gather the evaluation splits of several datasets previously proposed by the object-centric community, with diverse properties and increasing complexity: CLEVRTEX (Karazija et al., 2021), SCANNET and YCB as used in Yang & Yang (2022), and MOVi-C and MOVi-E (Greff et al., 2022). Additionally, we add the challenging ENTITYSEG dataset (Lu et al., 2023), consisting of in-the-wild real-world images with high-quality mask annotations. For analysis, we also use the PASCAL VOC (Everingham et al., 2012) dataset, but do not include it in the zero-shot benchmark as its set of categories is fully included in the COCO dataset used for training. |
| Dataset Splits | Yes | For training, we use the COCO 2017 dataset which consists of 118 287 images. For evaluation, we use 5 000 images from the validation sets. ... ENTITYSEG ... consists of 31 789 images for training and 1 498 images for evaluation. ... PASCAL VOC ... total of 10 582 images for training, where 1 464 are from the segmentation train set and 9 118 are from the SBD dataset (Hariharan et al., 2011). For evaluating object discovery, we use the official instance segmentation validation split with 1 449 images. ... MOVi-C ... 87 633 training images for MOVi-C and 87 741 images on MOVi-E. For evaluation, we use 4 200 frames for MOVi-C and 4 176 frames for MOVi-E from the validation sets in each case. ... SCANNET and YCB ... Each of these dataset consist of 10 000 training images and 2 000 evaluation images. ... CLEVRTEX ... 40 000 images for training and 10 000 for validation and test each. We use the 5 000 images from the validation set for our evaluation. See also Table E.9 for an overview over the number of images per dataset. |
| Hardware Specification | Yes | For training our model, we use a single A100 GPU per run. ... For inference, we use a single A100 GPU for each of the baselines and the proposed approach. |
| Software Dependencies | No | The paper mentions the use of an 'Adam W optimizer' but does not specify version numbers for any programming languages, libraries, or frameworks (e.g., Python, PyTorch, TensorFlow, CUDA). |
| Experiment Setup | Yes | We detail the exact settings in App. C.4. ... As discussed in Sec. 4.1 in the main paper, we found an improved set of hyperparameters that work well for finetuning the pre-trained Vi T encoder. We split these into general hyperparameters (G-HPs), affecting all modules of the model, and encoder hyperparameters (E-HPs), only affecting the finetuning of the encoder (see also Table C.8). ... In Table C.8, we list the hyperparameters for the following models mentioned in Sec. 4 and listed in Table 1: (1) DINOSAUR + Training from Random Init., (2) DINOSAUR + FT w/G-HP s, (3) DINOSAUR + FT w/G-HP s & E-HP s, (4) DINOSAUR + FT, + Top-k, + High-Res. Finetuning. |