Multimodal Unsupervised Domain Generalization by Retrieving Across the Modality Gap
Authors: Christopher Liao, Christian So, Theodoros Tsiligkaridis, Brian Kulis
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We compare against state-of-the-art name-only transfer, source-free DG and zero-shot (ZS) methods on their respective benchmarks and show consistent improvement in accuracy on 20 diverse datasets. Code is available: https://github.com/Chris210634/mudg [...] 4 EXPERIMENTS We experiment with the Vi T B/16 and Vi T L/14 pretrained weights released by Radford et al. (2021) and available through the Python openclip package (Ilharco et al., 2021). |
| Researcher Affiliation | Collaboration | Christopher Liao Boston University EMAIL Christian So Boston University EMAIL Theodoros Tsiligkaridis MIT Lincoln Laboratory EMAIL Brian Kulis Boston University EMAIL |
| Pseudocode | Yes | Algorithm 1 Paired k-means [...] Algorithm 2 MUDG |
| Open Source Code | Yes | Code is available: https://github.com/Chris210634/mudg |
| Open Datasets | Yes | Datasets We experiment with a diverse set of target classification tasks. Image Net-1K (Russakovsky et al., 2015), Caltech-101 (Li et al., 2022a), Oxford-Pets (Parkhi et al., 2012), Stanford-Cars (Krause et al., 2013), Flowers-102 (Nilsback and Zisserman, 2008), Food-101 (Bossard et al., 2014), FGVCAircraft (Maji et al., 2013), SUN-397 (Xiao et al., 2010), Describable-Textures (DTD) (Cimpoi et al., 2013), Euro SAT (Helber et al., 2019), UCF-101 (an action recognition dataset) (Soomro et al., 2012) in Table 2 and Image Net-V2 (Recht et al., 2019), Image Net-Sketch (Wang et al., 2019), Image Net-A (natural adversarial examples) (Hendrycks et al., 2021b), and Image Net-R (Hendrycks et al., 2021a) in Table 3 are commonly used by zero-shot papers, while Office Home (Venkateswara et al., 2017), Terra Incognita (Beery et al., 2018), Domain Net (Peng et al., 2019), VLCS (Torralba and Efros, 2011), and PACS (Li et al., 2017a) are common DG and DA datasets. |
| Dataset Splits | Yes | Datasets We experiment with a diverse set of target classification tasks. Image Net-1K (Russakovsky et al., 2015), Caltech-101 (Li et al., 2022a), Oxford-Pets (Parkhi et al., 2012), Stanford-Cars (Krause et al., 2013), Flowers-102 (Nilsback and Zisserman, 2008), Food-101 (Bossard et al., 2014), FGVCAircraft (Maji et al., 2013), SUN-397 (Xiao et al., 2010), Describable-Textures (DTD) (Cimpoi et al., 2013), Euro SAT (Helber et al., 2019), UCF-101 (an action recognition dataset) (Soomro et al., 2012) in Table 2 and Image Net-V2 (Recht et al., 2019), Image Net-Sketch (Wang et al., 2019), Image Net-A (natural adversarial examples) (Hendrycks et al., 2021b), and Image Net-R (Hendrycks et al., 2021a) in Table 3 are commonly used by zero-shot papers, while Office Home (Venkateswara et al., 2017), Terra Incognita (Beery et al., 2018), Domain Net (Peng et al., 2019), VLCS (Torralba and Efros, 2011), and PACS (Li et al., 2017a) are common DG and DA datasets. |
| Hardware Specification | Yes | Hardware and Computational Cost We ran experiments on a hybrid computing cluster with A40, A100 and L40S GPUs. All experiments require only one GPU at a time. Vi T-B/16 experiments require a GPU with 40 GB of memory; Vi T-B/14 experiments require a GPU with 80 GB of memory. |
| Software Dependencies | No | We experiment with the Vi T B/16 and Vi T L/14 pretrained weights released by Radford et al. (2021) and available through the Python openclip package (Ilharco et al., 2021). The indexing model is Vi T L/14; we modify FAISS (Douze et al., 2024) to build a search index for the source dataset, LAION-2B-en (Schuhmann et al., 2022). |
| Experiment Setup | Yes | Finetuning Parameters Vi T-B/16 Vi T-L/14: Finetune last 3 layers of text and vision encoders, batch size 128 64, learning rate 0.00064 0.00016, weight decay 1e-5, number of iterations (N) dataset dependent, learning rate decay none, softmax temperature 25, optimizer SGD momentum=0.9, label smoothing 0, EMA weight averaging β 0.995, text prompt length 3, text prompt initialization a photo of, text prompt learning rate multiplier 10, λ 0.2 |