reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Multimodal Cultural Safety: Evaluation Framework and Alignment Strategies

Authors: Haoyi Qiu, Kung-Hsiang Huang, Ruichen Zheng, Jiao Sun, Nanyun Peng

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate 21 leading LVLMs, including mixture-of-experts models (e.g., Llama-4-Maverick) and reasoning models (e.g., o1 and Gemini-2.5-Pro). Results reveal significant cultural safety gaps: the best-performing model achieves only 61.79% in awareness and 37.73% in compliance. Our results further show that increasing reasoning capacity improves cultural alignment but does not fully resolve the issue. To improve model performance, we develop two enhancement strategies: supervised fine-tuning with culturally grounded, open-ended data and preference tuning with contrastive response pairs that highlight safe versus unsafe behaviors. These methods substantially improve GPT-4o s cultural awareness (+60.14%) and compliance (+55.2%), while preserving general multimodal capabilities with minimal performance reduction on general multimodal understanding benchmarks.
Researcher Affiliation	Collaboration	1University of California, Los Angeles, 2Salesforce AI Research, 3Google Deep Mind EMAIL
Pseudocode	No	The paper describes methods through narrative text and figures illustrating data pipelines (e.g., Figure 2 and Figure 4), but it does not contain explicit pseudocode blocks or algorithms in a structured code-like format.
Open Source Code	No	The paper evaluates several open-source models and mentions their technical reports (e.g., Qwen2.5-VL technical report. Ar Xiv, abs/2502.13923, 2025; Internlm2 technical report. ar Xiv preprint ar Xiv:2403.17297, 2024), but it does not explicitly state that the authors' own implementation code for the methodology described in this paper is publicly available, nor does it provide a link to a code repository.
Open Datasets	Yes	We conduct supervised fine-tuning using the CVQA dataset (Romero et al., 2024), a benchmark for culturally grounded visual question answering. It contains over 10,000 human-validated multiple-choice image-question (MCQ) pairs spanning 39 country-language combinations and 10 thematic categories. 1https://huggingface.co/datasets/afaji/cvqa. All data here is free to use for research purposes. CROSS builds on validated text-only cultural norms from Safe World (Yin et al., 2024) and CASA (Qiu et al., 2025), extending them with visually grounded queries paired with real-world images.
Dataset Splits	No	The paper describes the total size of the CROSS benchmark (1,284 image-query pairs) and the sizes of the constructed training datasets (1,581 and 2,374 examples) used for fine-tuning. However, it does not provide explicit training, validation, or test splits for these datasets needed for reproduction in the conventional sense (e.g., percentages or counts for distinct training, validation, and testing sets within each dataset).
Hardware Specification	Yes	The training configurations for different fine-tuning and optimization strategies specify the number and type of GPUs: '1-2 A100 80GB' for GPT-4o SFT and DPO, and '2-4 A100 80GB' for Intern VL2.5.
Software Dependencies	No	The paper mentions using specific versions of large vision-language models (e.g., gpt-4o-2024-08-06, gemini-2.5-pro-preview-03-25) and Open AI's fine-tuning API, but it does not provide specific version numbers for ancillary software components such as programming languages (e.g., Python), libraries (e.g., PyTorch), or CUDA versions.
Experiment Setup	Yes	Table 13 details training configurations for supervised fine-tuning (SFT) and Direct Preference Optimization (DPO), including '# of Epochs 1' for all methods, 'Batch Size 1' for GPT-4o and 'Batch Size 8' for Intern VL2.5, 'Learning Rate 5e-4' and 'LR Multiplier 2' for GPT-4o SFT, and 'Beta 0.1' for GPT-4o DPO.