reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Revealing and Reducing Gender Biases in Vision and Language Assistants (VLAs)

Authors: Leander Girrbach, Stephan Alaniz, Yiran Huang, trevor darrell, Zeynep Akata

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We study gender bias in 22 popular open-source VLAs with respect to personality traits, skills, and occupations. Our results show that VLAs replicate human biases likely present in the data, such as real-world occupational imbalances. Similarly, they tend to attribute more skills and positive personality traits to women than to men, and we see a consistent tendency to associate negative personality traits with men. To eliminate the gender bias in these models, we find that fine-tuning-based debiasing methods achieve the best trade-off between debiasing and retaining performance on downstream tasks. We argue for pre-deploying gender bias assessment in VLAs and motivate further development of debiasing strategies to ensure equitable societal outcomes.
Researcher Affiliation	Academia	1Technical University of Munich, Munich Center for Machine Learning, MDSI 2Helmholtz Munich 3UC Berkeley
Pseudocode	No	The paper describes methods and processes in narrative text and uses equations for loss functions and importance scores, but it does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Code is available at https://github.com/Explainable ML/vla-gender-bias. We ensure the reproducibility of our study by making all prompts and data processing steps needed to obtain our results publicly available. The subset and all code used to derive is available at https://github.com/Explainable ML/vla-gender-bias. This includes the judgments by Intern VL2-40B on occupation-related content in the images. In addition, we will release all prompts used in our work, as well as code to generate them.
Open Datasets	Yes	Our image dataset is curated using the images of individuals from Fair Face (Karkkainen & Joo, 2021), MIAP (Schumann et al., 2021), Phase (Garcia et al., 2023), and PATA (Seth et al., 2023) datasets as they contain annotations for gender information, and all except MIAP also annotations for ethnicity. The original image data is available from the respective publications, i.e. (Karkkainen & Joo, 2021) for Fair Face, (Schumann et al., 2021) for MIAP, (Seth et al., 2023) for PATA and (Garcia et al., 2023) for Phase.
Dataset Splits	Yes	Our VL-Gender evaluation contains 5,000 images, i.e. 1,000 images from each dataset, balanced for the gender and ethnicity attributes where available. Since some methods involve training, we split the traits/skills/occupations into equally sized train/test portions and only use the test portion for evaluating methods. The train portion is exclusively used for training. Likewise, the prompt variations are split into train and test portions, and we use a new set of images from the original datasets for training but reuse the images of our main analysis for evaluation.
Hardware Specification	Yes	The authors gratefully acknowledge the Gauss Centre for Supercomputing e.V. (www.gauss-centre.eu) for funding this project by providing computing time on the GCS Supercomputer JUWELS at J ulich Supercomputing Centre (JSC).
Software Dependencies	No	The paper mentions several tools and models like 'Chat GPT', 'Intern VL2-40B', 'GPT-4V', 'LLa VA', 'Intern VL', 'CLIP', 'VLMEval Kit', etc., but it does not specify version numbers for any ancillary software or libraries used for the authors' own implementation.
Experiment Setup	Yes	H HYPERPARAMETERS OF DEBIASING METHODS. Full Finetuning We train all parameters in the transformer blocks of the VLAs LLM. As optimizer, we use stochastic gradient descent with batch size 1 and a learning rate of 0.0001. We train for at most 20000 steps but stop early if the loss is below 0.05 for 10 consecutive steps. Lo RA Finetuning Hyperparameters for Lo RA Finetuning are the same as for Full Finetuning. The Lo RAs are applied to all linear layers in the transformer blocks, and the Lo RA rank is 128, Lo RA α is also 128 and we do not apply Lo RA dropout. Prompt Tuning Here, we insert 20 learnable tokens after the BOS token, i.e. also before the image. The learnable tokens are trained by stochastic gradient descent using the same hyperparameters as for Full Finetuning (including early stopping), but the learning rate is 0.001 and the maximum number of training steps is 10000.