Test-Time Canonicalization by Foundation Models for Robust Perception

Authors: Utkarsh Singhal, Ryan Feng, Stella X. Yu, Atul Prakash

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate FOCAL across a range of challenging transformations, including 3D-viewpoint, illumination, day-night changes, and 2D rotations. We find that FOCAL improves out-of-distribution performance of foundation models such as CLIP (Radford et al., 2021) on Image Net (Deng et al., 2009) scale datasets. ... We demonstrate the effectiveness of FOCAL through evaluations on modern models such as CLIP, OV-Seg, and SAM, across diverse datasets including Image Net, COCO, Objaverse-LVIS, and CO3D.
Researcher Affiliation Academia 1UC Berkeley 2University of Michigan. Correspondence to: Utkarsh Singhal <EMAIL>.
Pseudocode No The paper describes the method using text and mathematical equations, but does not include any explicit pseudocode or algorithm blocks.
Open Source Code Yes Our code is available at: https://github.com/sutkarsh/focal.
Open Datasets Yes We demonstrate the effectiveness of FOCAL through evaluations on modern models such as CLIP, OV-Seg, and SAM, across diverse datasets including Image Net, COCO, Objaverse-LVIS, and CO3D. ... We evaluate chrominance (color) and contrast transformations on CIFAR100 (Krizhevsky et al., 2010) and Image Net (Deng et al., 2009) with CLIP (Radford et al., 2021). ... We compare against PRLC (Mondal et al., 2023) on their 2D rotation settings (C8) using PRLC-trained Vi T (Dosovitskiy et al., 2021) and PRLCtrained Res Net50 (He et al., 2016) models across CIFAR10 (Krizhevsky et al., 2010), CIFAR100 (Krizhevsky et al., 2010), and STL10 (Coates et al., 2011).
Dataset Splits No The paper discusses filtering processes for Objaverse-LVIS and CO3D, and refers to existing settings for 2D rotation experiments (e.g., 'their 2D rotation settings (C8)'). However, it does not explicitly provide specific train/test/validation splits with percentages, sample counts, or detailed methodology for all its experiments within the main text.
Hardware Specification Yes All experiments were done on an RTX 2080Ti GPU except 3D viewpoint, which was done on an RTX 6000 Ada Generation GPU.
Software Dependencies Yes For Objaverse-LVIS, we noticed cases of misleading and overlapping labels and thus filtered out such objects. ... We then pass the crop and the cropped segmentation to gpt-4o-mini-2025-04-16 (Open AI, 2025) with the following prompt: ... Rendering: For Objaverse-LVIS (Deitke et al., 2023), we generate our base input renders at viewpoints in the upper viewing hemisphere. We sample at an interval of 30 degrees and a radius of 2.2. ... Blender Foundation. Blender A 3D Modeling and Rendering Package. Blender Foundation, 2022. URL https://www.blender.org. Version 3.2.2.
Experiment Setup Yes Combining energy functions: We minimize the combined energy EFOCAL t(x) over all transformations t T to find the canonical version of the input image x. This is done by solving the following optimization problem: EFOCAL t(x) = γ1ECLIP(t(x)) + γ2Ediff(t(x)) (5) where α, β, γ1, γ2 R are hyperparameters. ... Bayesian Optimization for Efficient Search: ... We utilize Bayesian Optimization (BO) with a Gaussian Process (GP) using an RBF kernel and the Expected Improvement (EI) acquisition function (Jones et al., 1998) to balance exploration and exploitation. ... (Appendix B.1) For both Objaverse (Deitke et al., 2023) and CO3D (Reizenstein et al., 2021), we use α = 1, β = 0.5 following the 2D experiments (B.3). We also used the diffusion energy (steps 500 to 1000 with stride 100) with a factor of 5. ... (Appendix B.2) We define the color shift transformation using the popular von Kries model ... For initialization, we use random as well as a grid of initial samples. Color uses a uniform grid of 3x3, 6 random points, and 20 iterations. Contrast uses 3 grid points, 4 random points, and 5 iterations. ... (Appendix B.3) For experiments on Image Net, CIFAR10, CIFAR100, and STL10, we only used the classification energy for computational efficiency. We used α = 1, β = 0.5 for all these settings. ... For segmentation, we used the diffusion energy (steps 50 to 150 with stride 10) with a factor of 0.67 along with CLIP energy factor of 0.54 and β = 0.2.