reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Position: The Categorization of Race in ML is a Flawed Premise

Authors: Miriam Doh, Benedikt Höltgen, Piera Riccio, Nuria M Oliver

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To illustrate this phenomenon, we conducted an experiment with Stable Diffusion 3.5 Large (Huggingface, 2024), Midjourney (Mid Journey, 2024), and DALL E 3 (Open AI, 2024), using the prompts a mixed-race person (MRPROMPT) and a mulatto2 person (MU-PROMPT)... To empirically validate this hypothesis, we then generated images using each model under both prompts. Specifically, 30 images per prompt were generated for both Stable Diffusion and Midjourney... From a quantitative perspective, the cosine similarity between the embeddings (extracted using CLIP Vi T-B/32) of the two sets of images was 0.8190 for Stable Diffusion and 0.7687 for Mid Journey... The mean cosine similarity of embeddings (also extracted using CLIP Vi T-B/32) within the CFD-MR dataset is 0.7883, whereas the generated images with the MR-PROMPT exhibit larger internal similarity (0.8694 for Stable Diffusion and 0.8265 for Midjourney)... To illustrate that these psychological findings are also relevant in AI, we tested whether Visual Question Answering (VQA) systems exhibit the same perceptual bias (see Appendix E for the details) as humans. We selected Chat GPT-4 (Open AI, 2023) and Gemini-2.0 (Google AI, 2024), two widely used VQA models, and presented them with the original FRL stimulus (Levin & Banaji, 2006) testing each model ten times (N=10) to assess response consistency.
Researcher Affiliation	Academia	1ISIA Lab, Universit e de Mons, Mons, Belgium 2IRIDIA Lab, Universit e Libre de Bruxelles, Ixelles, Belgium 3University of T ubingen, Germany 4Ellis Alicante, Spain.
Pseudocode	No	The paper describes methods and concepts in narrative text and does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code	No	The paper does not provide a specific repository link or an explicit statement about releasing source code for the methodology described. It refers to third-party models and datasets but not custom code by the authors.
Open Datasets	Yes	Racial labels such as White, Black, or Asian are indeed widely used in image (Phillips et al., 2000; Ricanek & Tesafaye, 2006; Zhang et al., 2017; Karkkainen & Joo, 2021) and tabular (Kohavi et al., 1996; Angwin et al., 2016) datasets... This seems to be the most common approach in ML and especially in computer vision, as exemplified by the Fair Face (Karkkainen & Joo, 2021) dataset... Approach (B) includes a single Mixed-race category, as it is the case of the American Community Survey (ACS) Public Use Microdata Sample (PUMS) data (United States Census Bureau), from which the folktables dataset (Ding et al., 2021) is derived... We illustrate this limitation with an example in computer vision, using embeddings extracted with CLIP Vi T-B/32 (Radford et al., 2021) from the Chicago Face Database (CFD) and its extension CFD-MR (Ma et al., 2015; 2021)... A clear example of this can be found in Image Net (Deng et al., 2009)... In particular, (Yucer et al., 2022) proposes a phenotypic-based framework to replace protected attributes like race. This framework considers traits such as skin tone, eyelid type, nose shape, lip shape, hair color, and hair texture selected based on social behavior (Feliciano, 2016) and medical studies (Fakhro et al., 2015) (see full categories in Appendix F) allowing them to annotate existing public datasets, such as VGGFace2 (Cao et al., 2018) and RFW (Wang et al., 2019).
Dataset Splits	No	The paper describes using existing datasets and conducting experiments (e.g., generating images, testing VQA models), but it does not specify any training/validation/test splits for models developed or evaluated by the authors themselves. For the CFD experiment, it states 'ensuring balance across gender and sample size' but this is not a detailed train/test split.
Hardware Specification	No	The paper mentions external models like "Stable Diffusion 3.5 Large", "Midjourney", "DALL E 3", "CLIP Vi T-B/32", "Chat GPT-4", and "Gemini-2.0". However, it does not specify the hardware (e.g., specific GPU or CPU models) used by the authors to run their experiments with these models.
Software Dependencies	Yes	we conducted an experiment with Stable Diffusion 3.5 Large (Huggingface, 2024), Midjourney (Mid Journey, 2024), and DALL E 3 (Open AI, 2024)... using embeddings extracted with CLIP Vi T-B/32 (Radford et al., 2021)... We selected Chat GPT-4 (Open AI, 2023) and Gemini-2.0 (Google AI, 2024)...
Experiment Setup	Yes	To empirically validate this hypothesis, we then generated images using each model under both prompts. Specifically, 30 images per prompt were generated for both Stable Diffusion and Midjourney... We asked each model the following question: Q1: Is there a difference in brightness between the faces? If so, which one is darker? . This query was repeated ten times (N=10) per model to assess response consistency.