reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Multimodal Unsupervised Domain Generalization by Retrieving Across the Modality Gap

Authors: Christopher Liao, Christian So, Theodoros Tsiligkaridis, Brian Kulis

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We compare against state-of-the-art name-only transfer, source-free DG and zero-shot (ZS) methods on their respective benchmarks and show consistent improvement in accuracy on 20 diverse datasets. Code is available: https://github.com/Chris210634/mudg [...] 4 EXPERIMENTS We experiment with the Vi T B/16 and Vi T L/14 pretrained weights released by Radford et al. (2021) and available through the Python openclip package (Ilharco et al., 2021).
Researcher Affiliation	Collaboration	Christopher Liao Boston University EMAIL Christian So Boston University EMAIL Theodoros Tsiligkaridis MIT Lincoln Laboratory EMAIL Brian Kulis Boston University EMAIL
Pseudocode	Yes	Algorithm 1 Paired k-means [...] Algorithm 2 MUDG
Open Source Code	Yes	Code is available: https://github.com/Chris210634/mudg
Open Datasets	Yes	Datasets We experiment with a diverse set of target classification tasks. Image Net-1K (Russakovsky et al., 2015), Caltech-101 (Li et al., 2022a), Oxford-Pets (Parkhi et al., 2012), Stanford-Cars (Krause et al., 2013), Flowers-102 (Nilsback and Zisserman, 2008), Food-101 (Bossard et al., 2014), FGVCAircraft (Maji et al., 2013), SUN-397 (Xiao et al., 2010), Describable-Textures (DTD) (Cimpoi et al., 2013), Euro SAT (Helber et al., 2019), UCF-101 (an action recognition dataset) (Soomro et al., 2012) in Table 2 and Image Net-V2 (Recht et al., 2019), Image Net-Sketch (Wang et al., 2019), Image Net-A (natural adversarial examples) (Hendrycks et al., 2021b), and Image Net-R (Hendrycks et al., 2021a) in Table 3 are commonly used by zero-shot papers, while Office Home (Venkateswara et al., 2017), Terra Incognita (Beery et al., 2018), Domain Net (Peng et al., 2019), VLCS (Torralba and Efros, 2011), and PACS (Li et al., 2017a) are common DG and DA datasets.
Dataset Splits	Yes	Datasets We experiment with a diverse set of target classification tasks. Image Net-1K (Russakovsky et al., 2015), Caltech-101 (Li et al., 2022a), Oxford-Pets (Parkhi et al., 2012), Stanford-Cars (Krause et al., 2013), Flowers-102 (Nilsback and Zisserman, 2008), Food-101 (Bossard et al., 2014), FGVCAircraft (Maji et al., 2013), SUN-397 (Xiao et al., 2010), Describable-Textures (DTD) (Cimpoi et al., 2013), Euro SAT (Helber et al., 2019), UCF-101 (an action recognition dataset) (Soomro et al., 2012) in Table 2 and Image Net-V2 (Recht et al., 2019), Image Net-Sketch (Wang et al., 2019), Image Net-A (natural adversarial examples) (Hendrycks et al., 2021b), and Image Net-R (Hendrycks et al., 2021a) in Table 3 are commonly used by zero-shot papers, while Office Home (Venkateswara et al., 2017), Terra Incognita (Beery et al., 2018), Domain Net (Peng et al., 2019), VLCS (Torralba and Efros, 2011), and PACS (Li et al., 2017a) are common DG and DA datasets.
Hardware Specification	Yes	Hardware and Computational Cost We ran experiments on a hybrid computing cluster with A40, A100 and L40S GPUs. All experiments require only one GPU at a time. Vi T-B/16 experiments require a GPU with 40 GB of memory; Vi T-B/14 experiments require a GPU with 80 GB of memory.
Software Dependencies	No	We experiment with the Vi T B/16 and Vi T L/14 pretrained weights released by Radford et al. (2021) and available through the Python openclip package (Ilharco et al., 2021). The indexing model is Vi T L/14; we modify FAISS (Douze et al., 2024) to build a search index for the source dataset, LAION-2B-en (Schuhmann et al., 2022).
Experiment Setup	Yes	Finetuning Parameters Vi T-B/16 Vi T-L/14: Finetune last 3 layers of text and vision encoders, batch size 128 64, learning rate 0.00064 0.00016, weight decay 1e-5, number of iterations (N) dataset dependent, learning rate decay none, softmax temperature 25, optimizer SGD momentum=0.9, label smoothing 0, EMA weight averaging β 0.995, text prompt length 3, text prompt initialization a photo of, text prompt learning rate multiplier 10, λ 0.2