Multimodal Unsupervised Domain Generalization by Retrieving Across the Modality Gap

Authors: Christopher Liao, Christian So, Theodoros Tsiligkaridis, Brian Kulis

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We compare against state-of-the-art name-only transfer, source-free DG and zero-shot (ZS) methods on their respective benchmarks and show consistent improvement in accuracy on 20 diverse datasets. Code is available: https://github.com/Chris210634/mudg [...] 4 EXPERIMENTS We experiment with the Vi T B/16 and Vi T L/14 pretrained weights released by Radford et al. (2021) and available through the Python openclip package (Ilharco et al., 2021).
Researcher Affiliation Collaboration Christopher Liao Boston University EMAIL Christian So Boston University EMAIL Theodoros Tsiligkaridis MIT Lincoln Laboratory EMAIL Brian Kulis Boston University EMAIL
Pseudocode Yes Algorithm 1 Paired k-means [...] Algorithm 2 MUDG
Open Source Code Yes Code is available: https://github.com/Chris210634/mudg
Open Datasets Yes Datasets We experiment with a diverse set of target classification tasks. Image Net-1K (Russakovsky et al., 2015), Caltech-101 (Li et al., 2022a), Oxford-Pets (Parkhi et al., 2012), Stanford-Cars (Krause et al., 2013), Flowers-102 (Nilsback and Zisserman, 2008), Food-101 (Bossard et al., 2014), FGVCAircraft (Maji et al., 2013), SUN-397 (Xiao et al., 2010), Describable-Textures (DTD) (Cimpoi et al., 2013), Euro SAT (Helber et al., 2019), UCF-101 (an action recognition dataset) (Soomro et al., 2012) in Table 2 and Image Net-V2 (Recht et al., 2019), Image Net-Sketch (Wang et al., 2019), Image Net-A (natural adversarial examples) (Hendrycks et al., 2021b), and Image Net-R (Hendrycks et al., 2021a) in Table 3 are commonly used by zero-shot papers, while Office Home (Venkateswara et al., 2017), Terra Incognita (Beery et al., 2018), Domain Net (Peng et al., 2019), VLCS (Torralba and Efros, 2011), and PACS (Li et al., 2017a) are common DG and DA datasets.
Dataset Splits Yes Datasets We experiment with a diverse set of target classification tasks. Image Net-1K (Russakovsky et al., 2015), Caltech-101 (Li et al., 2022a), Oxford-Pets (Parkhi et al., 2012), Stanford-Cars (Krause et al., 2013), Flowers-102 (Nilsback and Zisserman, 2008), Food-101 (Bossard et al., 2014), FGVCAircraft (Maji et al., 2013), SUN-397 (Xiao et al., 2010), Describable-Textures (DTD) (Cimpoi et al., 2013), Euro SAT (Helber et al., 2019), UCF-101 (an action recognition dataset) (Soomro et al., 2012) in Table 2 and Image Net-V2 (Recht et al., 2019), Image Net-Sketch (Wang et al., 2019), Image Net-A (natural adversarial examples) (Hendrycks et al., 2021b), and Image Net-R (Hendrycks et al., 2021a) in Table 3 are commonly used by zero-shot papers, while Office Home (Venkateswara et al., 2017), Terra Incognita (Beery et al., 2018), Domain Net (Peng et al., 2019), VLCS (Torralba and Efros, 2011), and PACS (Li et al., 2017a) are common DG and DA datasets.
Hardware Specification Yes Hardware and Computational Cost We ran experiments on a hybrid computing cluster with A40, A100 and L40S GPUs. All experiments require only one GPU at a time. Vi T-B/16 experiments require a GPU with 40 GB of memory; Vi T-B/14 experiments require a GPU with 80 GB of memory.
Software Dependencies No We experiment with the Vi T B/16 and Vi T L/14 pretrained weights released by Radford et al. (2021) and available through the Python openclip package (Ilharco et al., 2021). The indexing model is Vi T L/14; we modify FAISS (Douze et al., 2024) to build a search index for the source dataset, LAION-2B-en (Schuhmann et al., 2022).
Experiment Setup Yes Finetuning Parameters Vi T-B/16 Vi T-L/14: Finetune last 3 layers of text and vision encoders, batch size 128 64, learning rate 0.00064 0.00016, weight decay 1e-5, number of iterations (N) dataset dependent, learning rate decay none, softmax temperature 25, optimizer SGD momentum=0.9, label smoothing 0, EMA weight averaging β 0.995, text prompt length 3, text prompt initialization a photo of, text prompt learning rate multiplier 10, λ 0.2