reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

CABIN: Debiasing Vision-Language Models Using Backdoor Adjustments

Authors: Bo Pang, Tingrui Qiao, Caroline Walker, Chris Cunningham, Yun Sing Koh

IJCAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through comprehensive experiments and analyses, we demonstrate that CABIN effectively mitigates biases and improves fairness metrics while preserving the zeroshot strengths of VLMs. The code is available at: https://github.com/ipangbo/causal-debias
Researcher Affiliation	Academia	1School of Computer Science, University of Auckland, Auckland, New Zealand 2The Liggins Institute, University of Auckland, Auckland, New Zealand 3Research Centre for M aori Health and Development, Massey University, Wellington, New Zealand EMAIL, EMAIL, EMAIL
Pseudocode	No	The paper describes methods using mathematical formulations and textual descriptions, but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	The code is available at: https://github.com/ipangbo/causal-debias
Open Datasets	Yes	Evaluation Datasets. We use FACET [Gustafson et al., 2023], PATA [Seth et al., 2023], and Flickr30K [Plummer et al., 2015] to evaluate our debiasing method. Traditional facecentric datasets such as Fair Face [Karkkainen and Joo, 2021], MS-COCO (MS) [Lin et al., 2014], and Pascal-Sentence (PS) [Rashtchian et al., 2010] are also used to show our method applies to various ranges of datasets and tasks.
Dataset Splits	No	The paper mentions using 'test data Dtest' for attribute distribution estimation and evaluating on several datasets, but it does not specify concrete training/validation/test split percentages, sample counts, or refer to specific standard splits used for these evaluation datasets (FACET, PATA, Flickr30K, Fair Face, MS-COCO, Pascal-Sentence). It only states for the mapper training that 'We randomly sampled 10 million paired image-text data from the dataset'.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU models, CPU types, or memory amounts used for running the experiments. It only generally refers to 'computational resources'.
Software Dependencies	No	The paper does not provide specific software dependencies with version numbers (e.g., Python, PyTorch, CUDA versions) needed to replicate the experiment.
Experiment Setup	Yes	To obtain high-confidence results for the model, we set ϵ to 0.5. ... The weighting factor λ balances the alignment loss Lalign and the contrastive difference loss Ldiff... We evaluate three settings (λ = 0, λ = 0.5, and λ = 1).