reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Aligning Visual Contrastive learning models via Preference Optimization

Authors: Amirabbas Afzali, Borna khodabandeh, Ali Rasekh, Mahyar JafariNodeh, Sepehr Ranjbar, Simon Gottschalk

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments demonstrate that models trained using PO outperform standard contrastive learning techniques while retaining their ability to handle adversarial challenges and maintain accuracy on other downstream tasks. This makes our method well-suited for tasks requiring fairness, robustness, and alignment with specific preferences. We evaluate our method for tackling typographic attacks on images and explore its ability to disentangle gender concepts and mitigate gender bias, showcasing the versatility of our approach.
Researcher Affiliation	Academia	Amirabbas Afzali & Borna Khodabandeh EMAIL Ali Rasekh L3S Research Center, Leibniz Universit at Hannover, Germany EMAIL Mahyar Jafari Nodeh Massachusetts Institute of Technology, USA EMAIL Sepehr Kazemi EMAIL Simon Gottschalk L3S Research Center, Leibniz Universit at Hannover, Germany EMAIL
Pseudocode	Yes	Algorithm 1 Preference-based contrastive optimization Algorithm 2 Preference optimization for contrastive learning with BMA details
Open Source Code	Yes	1The code is available on Git Hub.
Open Datasets	Yes	To evaluate the classification accuracy of our method on both original and typographic images (results in Table 1), we consider 9 datasets: Image Net-100 (Deng et al., 2009), Caltech101 (Fei-Fei et al., 2004), Oxford Pets (Parkhi et al., 2012), Stanford Cars (Krause et al., 2013), Flowers102 (Nilsback & Zisserman, 2008), FGVCAir Crafts (Maji et al., 2013), DTD (Cimpoi et al., 2014), SUN397 (Xiao et al., 2016) and Euro SAT (Helber et al., 2019). Additionally, for the Gender Bias results and analysis, we considered the VL-Bias (Zhang et al., 2022) dataset. Also, we consider Food101 (Bossard et al., 2014) for additional experiments, including saliency maps in Section C.
Dataset Splits	No	The paper mentions using several datasets such as Image Net-100, FOOD101 for training, and SUN as a zero-shot dataset, and distinguishes between training and evaluation contexts (e.g., in-domain vs. zero-shot). However, specific details about the exact percentages or absolute sample counts for training, validation, and testing splits for these datasets are not explicitly provided in the main text.
Hardware Specification	Yes	Setup: For all experiments, we used 8 A100 GPUs, each with 40GB of memory.
Software Dependencies	No	The paper mentions the Adamax optimizer and a cosine scheduler, but it does not specify version numbers for any programming languages, libraries, or frameworks used in the implementation.
Experiment Setup	Yes	For the typographic attack experiments, hyperparameters such as β and λ were selected based on our empirical studies described in the appendix. Specifically, we used β = λ = 1 for DPO, β = λ = 0.01 for IPO, and β = 1.5, λ = 0.01 for KTO. In disentangling gender bias, we also set β = λ = 1. All models were trained for three epochs with a batch size of 512. The learning rate was set to 2 10 5, and we employed the Adamax optimizer with a linear warmup and a cosine scheduler, setting the warmup ratio to 0.1. We set the random seed to 0 for all experiments to ensure reproducibility.