reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

On the Transfer of Object-Centric Representation Learning

Authors: Aniket Rajiv Didolkar, Andrii Zadaianchuk, Anirudh Goyal, Michael Mozer, Yoshua Bengio, Georg Martius, Maximilian Seitzer

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Thus, in this work, we answer the question of whether current real-world capable object-centric methods exhibit similar levels of transferability by introducing a benchmark comprising seven different synthetic and real-world datasets. We analyze the factors influencing performance under transfer and find that training on complex natural images improves generalization to unseen scenarios. Furthermore, inspired by the success of task-specific fine-tuning in foundation models, we introduce a novel fine-tuning strategy to adapt pre-trained vision encoders for the task of object discovery. We find that the proposed approach results in state-of-the-art performance for unsupervised object discovery, exhibiting strong zero-shot transfer to unseen datasets.
Researcher Affiliation	Collaboration	1MILA & University of Montreal 2University of Amsterdam 3Google Deepmind 4MPI for Intelligent Systems & University of Tübingen
Pseudocode	No	The paper describes the Slot Attention mechanism in detail in Section C.2 using prose and mathematical equations but does not present a clearly labeled pseudocode or algorithm block.
Open Source Code	No	The paper mentions a project website 'Website: rw-ocrl.github.io/ftdinosaur-paper' but does not explicitly state that the source code for the methodology described in this paper is available there. Mentions of GitHub links in the text refer to source code for other, third-party methods (Slot Diffusion, SPOT), not the authors' own implementation.
Open Datasets	Yes	To this end, we introduce a benchmark consisting of 7 datasets comprising a diverse range of synthetic and real-world scenes. Using this benchmark, we (1) seek to understand the zero-shot transfer capabilities of existing models, and (2) study the properties of training datasets that influence generalization. The general conclusion we draw from this benchmark is that object-centric models which are trained on naturalistic datasets consisting a variety of objects such as COCO (Lin et al., 2014) usually exhibit decent zero-shot generalization. To obtain a test bed that robustly measures zero-shot performance, we gather the evaluation splits of several datasets previously proposed by the object-centric community, with diverse properties and increasing complexity: CLEVRTEX (Karazija et al., 2021), SCANNET and YCB as used in Yang & Yang (2022), and MOVi-C and MOVi-E (Greff et al., 2022). Additionally, we add the challenging ENTITYSEG dataset (Lu et al., 2023), consisting of in-the-wild real-world images with high-quality mask annotations. For analysis, we also use the PASCAL VOC (Everingham et al., 2012) dataset, but do not include it in the zero-shot benchmark as its set of categories is fully included in the COCO dataset used for training.
Dataset Splits	Yes	For training, we use the COCO 2017 dataset which consists of 118 287 images. For evaluation, we use 5 000 images from the validation sets. ... ENTITYSEG ... consists of 31 789 images for training and 1 498 images for evaluation. ... PASCAL VOC ... total of 10 582 images for training, where 1 464 are from the segmentation train set and 9 118 are from the SBD dataset (Hariharan et al., 2011). For evaluating object discovery, we use the official instance segmentation validation split with 1 449 images. ... MOVi-C ... 87 633 training images for MOVi-C and 87 741 images on MOVi-E. For evaluation, we use 4 200 frames for MOVi-C and 4 176 frames for MOVi-E from the validation sets in each case. ... SCANNET and YCB ... Each of these dataset consist of 10 000 training images and 2 000 evaluation images. ... CLEVRTEX ... 40 000 images for training and 10 000 for validation and test each. We use the 5 000 images from the validation set for our evaluation. See also Table E.9 for an overview over the number of images per dataset.
Hardware Specification	Yes	For training our model, we use a single A100 GPU per run. ... For inference, we use a single A100 GPU for each of the baselines and the proposed approach.
Software Dependencies	No	The paper mentions the use of an 'Adam W optimizer' but does not specify version numbers for any programming languages, libraries, or frameworks (e.g., Python, PyTorch, TensorFlow, CUDA).
Experiment Setup	Yes	We detail the exact settings in App. C.4. ... As discussed in Sec. 4.1 in the main paper, we found an improved set of hyperparameters that work well for finetuning the pre-trained Vi T encoder. We split these into general hyperparameters (G-HPs), affecting all modules of the model, and encoder hyperparameters (E-HPs), only affecting the finetuning of the encoder (see also Table C.8). ... In Table C.8, we list the hyperparameters for the following models mentioned in Sec. 4 and listed in Table 1: (1) DINOSAUR + Training from Random Init., (2) DINOSAUR + FT w/G-HP s, (3) DINOSAUR + FT w/G-HP s & E-HP s, (4) DINOSAUR + FT, + Top-k, + High-Res. Finetuning.