Data Matters Most: Auditing Social Bias in Contrastive Vision–Language Models

Authors: Zahraa Al Sahili, Ioannis Patras, Matthew Purver

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We systematically disentangle three design factors model size, training-data scale, and training-data source by comparing CLIP and Open CLIP, two models that share an identical contrastive objective yet differ in encoder width and in the image text corpora on which they are pre-trained(400M proprietary pairs vs. 400M/2B LAION). Across balanced face-analysis benchmarks, enlarging the encoder reduces gender skew in CLIP but amplifies both gender and racial skew in Open CLIP; increasing the LAION corpus from 400M to 2B further increases Open CLIP bias. At matched model and data budgets, substituting proprietary data with LAION improves gender fairness while increasing racial skew, underscoring data source as the primary driver of bias patterns. We also evaluate three post-hoc, test-time debiasing strategies Bias Prompts, Prompt Array, and SANER.
Researcher Affiliation Academia Zahraa Al Sahili EMAIL Queen Mary University of London, UK Ioannis Patras EMAIL Queen Mary University of London, UK Matthew Purver EMAIL Queen Mary University of London, UK Institut Jožef Stefan, Slovenia
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks. It describes methodologies in narrative text and mathematical equations.
Open Source Code Yes We release code and evaluation scripts to enable transparent, reproducible auditing of future VLMs.1 1Code available at https://github.com/zahraaalsahili/CLIP_Bias.
Open Datasets Yes Fair Face contains 108 501 cropped, face-only portraits labelled for seven self-identified race categories {White, Black, Indian, East Asian, South East Asian, Middle Eastern, Latino} and binary gender, sampled from Flickr under a CC BY NC licence (Karkkainen & Joo, 2021). PATA. The Protected Attribute Tag Association (PATA) benchmark comprises 4 934 images of people organised into 24 scenes (e.g., office, lab, sports), each annotated with binary gender (male/female) and five ethno-racial identities {Black, Caucasian, East Asian, Hispanic/Latino, Indian} (Seth et al., 2023).
Dataset Splits Yes Fair Face. Fair Face contains 108 501 cropped, face-only portraits labelled for seven self-identified race categories {White, Black, Indian, East Asian, South East Asian, Middle Eastern, Latino} and binary gender, sampled from Flickr under a CC BY NC licence (Karkkainen & Joo, 2021). We draw the validation subset of 10 954 portraits such that every race gender combination contains 782 images.
Hardware Specification Yes A single NVIDIA A100 (40 GB) processes the full benchmark in under thirty minutes.
Software Dependencies No The paper mentions using CLIP and Open CLIP models and notes that Prompt Array uses 'the hyperparameters from the AACL-22 code release' and SANER 'follows the SD-XL caption protocol of Hirota et al. (2025)' but does not provide specific version numbers for software libraries or dependencies used in their implementation.
Experiment Setup Yes Images are resized to 224 224 for Vi T B/32 and 336 336 for Vi T L/14 to match pre-training. Caption embeddings use the checkpoint-specific temperature τ without test time augmentation. BP uses the authors original prompt pairs and the calibrated projection matrix released with the paper. PA is trained for three epochs on 90k Fair Face images plus the same number of LAION images, using the hyperparameters from the AACL-22 code release (λITC=0.05). SANER follows the SD-XL caption protocol of Hirota et al. (2025) and is trained for five epochs on COCO 2017.