reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

LIVS: A Pluralistic Alignment Dataset for Inclusive Public Spaces

Authors: Rashid Mushkani, Shravan Nayak, Hugo Berard, Allison Cohen, Shin Koseki, Hadrien Bertrand

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We introduce the Local Intersectional Visual Spaces (LIVS) dataset, a benchmark for multicriteria alignment, developed through a two-year participatory process with 30 community organizations to support the pluralistic alignment of textto-image (T2I) models in inclusive urban planning. The dataset encodes 37,710 pairwise comparisons across 13,462 images, structured along six criteria Accessibility, Safety, Comfort, Invitingness, Inclusivity, and Diversity derived from 634 community-defined concepts. Using Direct Preference Optimization (DPO), we fine-tune Stable Diffusion XL to reflect multi-criteria spatial preferences and evaluate the LIVS dataset and the fine-tuned model through four case studies
Researcher Affiliation	Academia	1Universit e de Montr eal 2Mila Quebec AI Institute. Correspondence to: Rashid Mushkani <EMAIL>.
Pseudocode	Yes	Algorithm 1: Selecting the 4 Most Diverse Images Using CLIP Similarity Scores
Open Source Code	No	The paper mentions that "An open-source annotation platform was developed to facilitate this process" but does not provide a specific link to the source code for this platform or for the main methodology (DPO fine-tuning, model evaluation).
Open Datasets	Yes	The LIVS dataset including citizen-provided self-identification markers (with consent) is available for research purposes at mid-space.one. This release aims to establish a benchmark for pluralistic alignment in text-to-image generation and supports both criterion-specific and user-specific customization.
Dataset Splits	Yes	We collected 35,510 multi-criteria preference annotations, each covering one to three criteria, to fine-tune a Stable Diffusion XL model using Direct Preference Optimization (DPO) (Wallace et al., 2023; Rafailov et al., 2024). We then tested the fine-tuned model with 2,200 additional annotations, comparing it to the baseline model.
Hardware Specification	Yes	Hardware: Single NVIDIA A100 80GB GPU
Software Dependencies	No	The paper mentions using "Stable Diffusion XL" and "Direct Preference Optimization (DPO)" and "GPT-4o", but does not provide specific version numbers for these or other software libraries or dependencies used for implementation.
Experiment Setup	Yes	We fine-tuned Stable Diffusion XL using Direct Preference Optimization (DPO; Rafailov et al. 2024), closely following the original hyperparameters: Batch Size: 64 Learning Rate: 1 10 8 with 20% linear warmup Beta (β): 5,000 Training Steps: 500 for smaller subsets; 1,500 when combining the entire preference dataset