Towards an Explainable Comparison and Alignment of Feature Embeddings

Authors: Mohammad Jalali, Bahar Dibaei Nia, Farzan Farnia

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We provide numerical results demonstrating the SPEC s application to compare and align embeddings on large-scale datasets such as Image Net and MS-COCO. In this section, we first discuss the experimental settings and then apply the SPEC algorithm to compare different image and text embeddings across various large-scale datasets.
Researcher Affiliation Academia 1The Chinese University of Hong Kong 2Sharif University of Technology. Correspondence to: Mohammad Jalali <EMAIL>, Bahar Dibaei Nia <EMAIL>, Farzan Farnia <EMAIL>.
Pseudocode Yes Algorithm 1 Spectral Pairwise Embedding Comparison (SPEC) 1: Input: Sample set {x1, . . . , xn}, embeddings ψ1 and ψ2, kernel feature maps ϕ1 and ϕ2 2: Initialize Cψ1 = 0d1 d1, Cψ2 = 0d2 d2, Cψ1,ψ2 = 0d1 d2 3: for i {1, . . . , n} do 4: Update Cψ1 Cψ1 + 1 nϕ1(ψ1(xi))ϕ1(ψ1(xi)) 5: Update Cψ2 Cψ2 + 1 nϕ2(ψ2(xi))ϕ2(ψ2(xi)) 6: Update Cψ1,ψ2 Cψ1,ψ2+ 1 nϕ1(ψ1(xi))ϕ2(ψ2(xi)) 8: Construct Γψ1,ψ2 as in Equation (4) 9: Compute eigenvalues λ1:d1+d2 and eigenvectors v1:d1+d2 of non-symmetric matrix Γψ1,ψ2 10: for i {1, . . . , d1 + d2} do 11: Map eigenvector ui = ϕ1(ψ1(X)) ϕ2(ψ2(X)) vi 12: end for 13: Output: Eigenvalues λ1, . . . , λd1+d2, eigenvectors u1, . . . , ud1+d2.
Open Source Code No The project page is available at https://mjalali.github.io/SPEC/. (This is a project page, not explicitly a code repository. The instruction states that project demonstration pages or high-level project overview pages are not considered sufficient for 'Yes' unless they explicitly host the source code, which is not stated here.)
Open Datasets Yes In our experiments on image data, we used four datasets: AFHQ (Choi et al., 2020) (15K animal faces in categories of cats, wildlife, and dogs), FFHQ (Karras et al., 2019) (70K human-face images), Image Net-1K (Deng et al., 2009) (1.4 million images across 1,000 labels), and MS-COCO 2017 (Lin et al., 2015) ( 110K samples of diverse scenes with multiple objects).
Dataset Splits No In our experiments on image data, we used four datasets: AFHQ (Choi et al., 2020) (15K animal faces in categories of cats, wildlife, and dogs), FFHQ (Karras et al., 2019) (70K human-face images), Image Net-1K (Deng et al., 2009) (1.4 million images across 1,000 labels), and MS-COCO 2017 (Lin et al., 2015) ( 110K samples of diverse scenes with multiple objects). We used the Open CLIP Git Hub repository (link) and used the MS-COCO 2017 training set, which consists of 120K pairs of texts and images. (The paper lists datasets and mentions "training set" for MS-COCO and ImageNet but does not provide explicit details on how these datasets were split into training, validation, and test sets, or reference standard splits with sufficient detail.)
Hardware Specification Yes The experiments were performed on two RTX-4090 GPUs.
Software Dependencies No The paper mentions using 'Open CLIP Git Hub repository' and 'Pytorch s eig command' but does not specify version numbers for any key software components like Python, PyTorch, or CUDA.
Experiment Setup Yes Parameter Value accum freq 1 alignment loss weight 0.1 batch size 128 clip alignment contrastive loss weight 0.9 coca contrastive loss weight 1.0 distributed True epochs 10 lr 1e-05 lr scheduler cosine model ViT-B-32 name Vit-B-32 laion2b e16 freeze 5 precision amp pretrained laion2b e16 seed 0 wd 0.2