reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Object-Centric Pretraining via Target Encoder Bootstrapping

Authors: Nikola Đukić, Tim Lebailly, Tinne Tuytelaars

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	When pretrained on 241k images from COCO, OCEBO achieves unsupervised object discovery performance comparable to that of object-centric models with frozen non-object-centric target encoders pretrained on hundreds of millions of images. The code and pretrained models are publicly available at https://github.com/djukicn/ocebo. We start by outlining the implementation details (training datasets, evaluation protocols, model architecture and the training setup) of OCEBO in Section 4.1. In Section 4.2, we demonstrate that OCEBO can be pretrained from scratch on real-world data without slot collapse. We justify the design choices and demonstrate data scalability, further discussing the requirements for suitable pretraining datasets. Finally, we put the performance of OCEBO in context by comparing it to state-of-the-art object-centric approaches that rely on non-object-centric target encoders pretrained on orders of magnitude more data in Section 4.3.
Researcher Affiliation	Academia	Nikola Ðuki c Tim Lebailly Tinne Tuytelaars KU Leuven EMAIL
Pseudocode	No	The paper describes the methodology using textual descriptions and mathematical equations (e.g., equations 1-7) but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	The code and pretrained models are publicly available at https://github.com/djukicn/ocebo.
Open Datasets	Yes	OCEBO is trained on MS COCO (Lin et al., 2015), the most common real-world dataset in the object-centric literature. We use the train2017 COCO split with approximately 118k images. Additionally, we construct a larger dataset of 241k images named COCO+ by combining the train2017 and unlabeled2017 splits. All datasets used are publicly available.
Dataset Splits	Yes	We use the train2017 COCO split with approximately 118k images. Additionally, we construct a larger dataset of 241k images named COCO+ by combining the train2017 and unlabeled2017 splits. We use validation splits of each dataset and 11, 24, 7 and 7 slots for MOVi-C, MOVi-E, Pascal VOC and Entity Seg, respectively.
Hardware Specification	No	The paper mentions using the LUMI supercomputer for access but does not provide specific hardware details such as GPU/CPU models, memory, or processing power used for the experiments.
Software Dependencies	No	The paper describes the use of models like Vision Transformer (ViT) and DINO, but does not provide specific version numbers for software dependencies such as deep learning frameworks (e.g., PyTorch, TensorFlow) or CUDA.
Experiment Setup	Yes	We train the model for 300 epochs with an additional mask sharpening stage of 100 epochs. As in DINO, target encoder updates are performed with momentum following a cosine schedule between 0.996 and 1. Scaling temperatures are τ = 0.1 and τt = 0.07, with the latter being linearly increased from the initial 0.04 during a 30-epoch warmup stage. Learning rate is linearly ramped up to the base value of 0.0003 during the first 10 epochs and decayed following a cosine schedule. Finally, we set λoc = λglobal = 1.