Multi-Scale Fusion for Object Representation
Authors: Rongzhen Zhao, Vivienne Huiling Wang, Juho Kannala, Joni Pajarinen
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 4 EXPERIMENTS. We experiment the following points: (i) Our technique MSF augments performance of mainstream OCL methods that are either transformer-based or diffusion-based. (ii) MSF fuses multi-scale information into VAE discrete representation and guides OCL better. (iii) How the composing designs of MSF contribute to its effectiveness. Results are mostly averaged over three random seeds. Metrics including ARI (Adjusted Rand Index)1, ARIfg (foreground), m Io U (mean Intersection-over-Union)2 and m BO (mean Best Overlap) (Caron et al., 2021) are used to measure OCL s byproduct segmentation accuracy. |
| Researcher Affiliation | Academia | 1Department of Electrical Engineering and Automation, Aalto University, Finland 2Department of Computer Science, Aalto University, Finland 3Center for Machine Vision and Signal Analysis, University of Oulu, Finland EMAIL |
| Pseudocode | Yes | Algorithm 1 Implementation of inter/intra-scale quantized fusion in Py Torch. |
| Open Source Code | Yes | The source code is available on https://github.com/Genera1Z/Multi Scale Fusion. |
| Open Datasets | Yes | Datasets are either synthetic images, i.e., Clevr Tex3; or real-world images, i.e., COCO4 and VOC5; or synthetic videos, i.e., MOVi-C/D/E6. (Footnotes providing URLs: 3https://www.robots.ox.ac.uk/~vgg/data/clevrtex/ 4https://cocodataset.org/#panoptic-2020 5http://host.robots.ox.ac.uk/pascal/VOC/voc2012/index.html 6https://github.com/google-research/kubric/tree/main/challenges/movi) |
| Dataset Splits | No | The datasets used for image OCL include Clevr Tex, COCO, and VOC, with the latter two being real-world datasets. For video OCL, the datasets are MOVi-C, D, and E, all of which are synthetic. In line with established practice, our training process comprises two stages. The first stage is pretraining, during which VAE modules are trained on their respective datasets to acquire robust discrete intermediate representations. In the second stage, OCL training utilizes the pretrained VAE representations to guide object-centric learning. |
| Hardware Specification | No | We also appreciate CSC IT Center for Science, Finland, for granting access to the LUMI supercomputer, owned by the Euro HPC Joint Undertaking and hosted by CSC (Finland) in collaboration with the LUMI consortium. Furthermore, we acknowledge the computational resources provided by the Aalto Science-IT project through the Triton cluster. |
| Software Dependencies | No | Automatic mixed precision is utilized, leveraging the Py Torch autocast API. In tandem with this, we use Py Torch s built-in gradient scaler to enable gradient clipping with a maximum norm of 1.0. |
| Experiment Setup | Yes | For all datasets, we conduct 30,000 training iterations, with validation every 600 iterations. This gives us roughly 50 checkpoints for each OCL model on every dataset. To optimize storage, we retain only the final 25 checkpoints. This approach is uniformly applied across datasets. For image datasets, the batch size is set to 64, while for video datasets, it is 16, a configuration maintained across both training and validation phases. This is consistent for all datasets. We use the Adam optimizer with an initial learning rate of 2e-3, adjusting the learning rate through cosine annealing scheduling, with a linear warmup over the first 1/20 of the total steps. This configuration is standardized across all datasets. Automatic mixed precision is utilized, leveraging the Py Torch autocast API. In tandem with this, we use Py Torch s built-in gradient scaler to enable gradient clipping with a maximum norm of 1.0. This setting is uniform across all datasets. On this stage, we load the pretrained VAE weights to guide the OCL model, where the VAE part is frozen. For all datasets, we run 50,000 training iterations, validating every 1,000 iterations. This results in about 50 checkpoints per OCL model for each dataset. To reduce storage demands, only the last 25 checkpoints are saved. This procedure is applied across all datasets. The batch size for image datasets is set to 32 for both training and validation. For video datasets, the batch size is 8 for training and 4 for validation, to account for the increased time steps during video validation. This setting is shared across datasets. We use the Adam optimizer with an initial learning rate of 2e-4, following a cosine annealing schedule and a linear warmup over the first 1/20 of the total steps. This configuration is standardized across datasets. We employ automatic mixed precision using the Py Torch autocast API. Alongside this, we use the Py Torch gradient scaler to apply gradient clipping, with a maximum norm of 1.0 for images and 0.02 for videos. For random query initialization, we adjust the σ value of the learned non-shared Gaussian distribution to balance exploration and exploitation. On multi-object datasets, σ starts at 1 and decays to 0 by the end of training using cosine annealing scheduling. On single-object datasets, σ remains at 0 throughout training. During validation and testing, this value is set to 0 to ensure deterministic performance. |