reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Revisiting B2T: Discovering and Mitigating Visual Biases through Keyword Explanations

Authors: Faissal El Kayouhi, Aïda Asma, Joey Laarhoven, Fiona Nagelhout

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	This work aims to reproduce and extend the findings of "Discovering and Mitigating Visual Biases through Keyword Explanation" by Kim et al. (2024). The paper proposes the B2T framework, which detects and mitigates visual biases by extracting keywords from generated captions. By identifying biases in datasets, B2T contributes to the prevention of discriminatory behavior in vision-language models. We aim to investigate the five key claims from the original paper, namely that B2T (i) is able to identify whether a word represents a bias, (ii) can extract these keywords from captions of mispredicted images, (iii) outperforms other bias discovery models, (iv) can improve CLIP zero-shot prompting with the discovered keywords, and (v) identifies labeling errors in a dataset. To reproduce their results, we use the publicly available codebase and our re-implementations. Our findings confirm the first three claims and partially validate the fourth. We reject the fifth claim, due to the failure to identify pertinent labeling errors. Finally, we enhance the original work by optimizing the efficiency of the implementation, and assessing the generalizability of B2T on a new dataset.
Researcher Affiliation	Academia	Aïda Asma* EMAIL University of Amsterdam Faissal El Kayouhi* EMAIL University of Amsterdam Joey Laarhoven* EMAIL University of Amsterdam Fiona Nagelhout* EMAIL University of Amsterdam
Pseudocode	No	The paper describes the B2T framework and its components (Clip Cap, YAKE, CLIP score, B2T-DRO, CLIP zero-shot classification) in text format within the methodology section, but it does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	The author s codebase is publicly available, however, it is not complete enough to replicate all findings. We discuss which parts of the code were incomplete and our approach to implementing them based on the original paper. Afterward, the methodology for the additional experiments of our own is presented. [...] Our code is also publicly available.2 https://github.com/Joeyjdl/b2t-reproduction
Open Datasets	Yes	To identify known dataset biases and the applications of B2T keywords, we trained and tested the B2T framework on the Celeb A and Waterbirds datasets. Our approach to analyzing biases without a classifier was also performed on these datasets. Both are included in the codebase with pre-trained model checkpoints. Celeb A. Celeb A, which stands for Celeb Faces Attributes, contains 202,599 images of celebrities faces (Liu et al., 2015). Waterbirds. Waterbirds is a dataset created by Sagawa et al. (2020b) consisting of artificially created images of cropped-out birds (Wah et al., 2011) transferred onto backgrounds of the "places" dataset by Zhou et al. (2018). Fair Face. Fair Face is a dataset created by Karkkainen & Joo (2021) consisting of 108.501 face images, collected from the YFCC-100M Flickr dataset. Image Net. For identifying undiscovered dataset biases, the models were trained and tested on the Image Net dataset. This is also included in our codebase. The used subset of Image Net contains over a million images (Russakovsky et al., 2015).
Dataset Splits	Yes	Celeb A. Celeb A, which stands for Celeb Faces Attributes, contains 202,599 images of celebrities faces (Liu et al., 2015). Each image has 40 binary attribute annotations, such as blond and not blonde. The attribute "blonde" was employed as a binary target in Kim et al. (2024) s work, with gender as the underlying bias label. This setting is biased against blonde males, who make up only 0.85% of the dataset compared to 14.05% for blonde females. Waterbirds. Waterbirds is a dataset created by Sagawa et al. (2020b) consisting of artificially created images of cropped-out birds (Wah et al., 2011) transferred onto backgrounds of the "places" dataset by Zhou et al. (2018). The classes are the type of bird in the image: a waterbird or a landbird. Another attribute is the type of background of the image: a water background or a land background. Waterbirds has a total of 11.788 images. Within the dataset there is a bias against data points with conflicting backgrounds (data groups "waterbird on land" and "landbird on water"), as they collectively represent only 6% of the training set instead of the 50% one might expect in a balanced case. Image Net. For identifying undiscovered dataset biases, the models were trained and tested on the Image Net dataset. This is also included in our codebase. The used subset of Image Net contains over a million images (Russakovsky et al., 2015). In the validation set, there are 50 images per class. This subset was used instead of the entire dataset due to limited computational resources. Fair Face. [...] If a group is investigated in an experiment, the data concerning this group will be downsampled to a 3% ratio, similar to the ratios of the minorities in the waterbird dataset.
Hardware Specification	Yes	The B2T pipeline, without training a classifier, was run on a T4 GPU available via Google Colab. Captioning the Celeb A validation set, extracting keywords, and calculating the CLIP scores takes around 30 minutes. For the smaller Waterbirds dataset, this process was performed in 10 minutes. The computationally heavy DRO training was run on an A100 GPU on Snellius, which consumes 512 SBUs per GPU hour for a full node (SURF User Knowledge Base, 2025). Training a DRO model on the Celeb A dataset on the given 50 epochs took 4 hours, whilst training the same DRO model for 300 epochs on the Waterbirds dataset took 2 hours. The total SBU s spent to train all models for 3 seeds is about 5542 SBU s, equivalent to 10.8 GPU hours. Using the 2024 carbon intensity of the Netherlands, which is 0.37 k CO2eq/k Wh (Nowtricity, 2025), and the PUE of the SURF HPC datacenter equal to 1.2 (SURF, 2017), we calculate the total emissions according to equation 3: CO2e = CI PUE P t (3) The A100 GPU has a maximum power consumption of 0.25k W, and the AMD EPYC 9934 CPU is at 0.21 k W (NVIDIA, 2020). Leading to a carbon footprint of 2.21 k CO2e. The T4 has a power consumption of 0.70k W (NVIDIA, 2021). We have approximately spent 20 hours on this GPU, adding 0.6216 k CO2e for a total of 2.83 k CO2e.
Software Dependencies	No	The paper mentions several models and algorithms like Clip Cap (Mokady et al., 2021), YAKE algorithm (Campos et al., 2020), CLIP (Radford et al., 2021), Res Net-50 (He et al., 2016), and Py Torch library, but it does not specify any version numbers for these software components or libraries.
Experiment Setup	Yes	The hyperparameter values specified by Kim et al. (2024) were used to reproduce their work. Batch size was changed however to 512, as we noticed the original authors employed a batch size of only one during captioning. In section 4 we further reveal the resulting difference in GPU usage and captioning time. The original paper does not include any seed values employed to run the models, so we enhance future reproducibility by including the seeds employed: 32, 16, and 8 for DRO and DRO-B2T on Celeb A and Waterbirds. Our codebase includes the test performance of all seeds.