Object-Centric Pretraining via Target Encoder Bootstrapping
Authors: Nikola Đukić, Tim Lebailly, Tinne Tuytelaars
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | When pretrained on 241k images from COCO, OCEBO achieves unsupervised object discovery performance comparable to that of object-centric models with frozen non-object-centric target encoders pretrained on hundreds of millions of images. The code and pretrained models are publicly available at https://github.com/djukicn/ocebo. We start by outlining the implementation details (training datasets, evaluation protocols, model architecture and the training setup) of OCEBO in Section 4.1. In Section 4.2, we demonstrate that OCEBO can be pretrained from scratch on real-world data without slot collapse. We justify the design choices and demonstrate data scalability, further discussing the requirements for suitable pretraining datasets. Finally, we put the performance of OCEBO in context by comparing it to state-of-the-art object-centric approaches that rely on non-object-centric target encoders pretrained on orders of magnitude more data in Section 4.3. |
| Researcher Affiliation | Academia | Nikola Ðuki c Tim Lebailly Tinne Tuytelaars KU Leuven EMAIL |
| Pseudocode | No | The paper describes the methodology using textual descriptions and mathematical equations (e.g., equations 1-7) but does not include any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | The code and pretrained models are publicly available at https://github.com/djukicn/ocebo. |
| Open Datasets | Yes | OCEBO is trained on MS COCO (Lin et al., 2015), the most common real-world dataset in the object-centric literature. We use the train2017 COCO split with approximately 118k images. Additionally, we construct a larger dataset of 241k images named COCO+ by combining the train2017 and unlabeled2017 splits. All datasets used are publicly available. |
| Dataset Splits | Yes | We use the train2017 COCO split with approximately 118k images. Additionally, we construct a larger dataset of 241k images named COCO+ by combining the train2017 and unlabeled2017 splits. We use validation splits of each dataset and 11, 24, 7 and 7 slots for MOVi-C, MOVi-E, Pascal VOC and Entity Seg, respectively. |
| Hardware Specification | No | The paper mentions using the LUMI supercomputer for access but does not provide specific hardware details such as GPU/CPU models, memory, or processing power used for the experiments. |
| Software Dependencies | No | The paper describes the use of models like Vision Transformer (ViT) and DINO, but does not provide specific version numbers for software dependencies such as deep learning frameworks (e.g., PyTorch, TensorFlow) or CUDA. |
| Experiment Setup | Yes | We train the model for 300 epochs with an additional mask sharpening stage of 100 epochs. As in DINO, target encoder updates are performed with momentum following a cosine schedule between 0.996 and 1. Scaling temperatures are τ = 0.1 and τt = 0.07, with the latter being linearly increased from the initial 0.04 during a 30-epoch warmup stage. Learning rate is linearly ramped up to the base value of 0.0003 during the first 10 epochs and decayed following a cosine schedule. Finally, we set λoc = λglobal = 1. |