Multimodal Cultural Safety: Evaluation Framework and Alignment Strategies
Authors: Haoyi Qiu, Kung-Hsiang Huang, Ruichen Zheng, Jiao Sun, Nanyun Peng
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate 21 leading LVLMs, including mixture-of-experts models (e.g., Llama-4-Maverick) and reasoning models (e.g., o1 and Gemini-2.5-Pro). Results reveal significant cultural safety gaps: the best-performing model achieves only 61.79% in awareness and 37.73% in compliance. Our results further show that increasing reasoning capacity improves cultural alignment but does not fully resolve the issue. To improve model performance, we develop two enhancement strategies: supervised fine-tuning with culturally grounded, open-ended data and preference tuning with contrastive response pairs that highlight safe versus unsafe behaviors. These methods substantially improve GPT-4o s cultural awareness (+60.14%) and compliance (+55.2%), while preserving general multimodal capabilities with minimal performance reduction on general multimodal understanding benchmarks. |
| Researcher Affiliation | Collaboration | 1University of California, Los Angeles, 2Salesforce AI Research, 3Google Deep Mind EMAIL |
| Pseudocode | No | The paper describes methods through narrative text and figures illustrating data pipelines (e.g., Figure 2 and Figure 4), but it does not contain explicit pseudocode blocks or algorithms in a structured code-like format. |
| Open Source Code | No | The paper evaluates several open-source models and mentions their technical reports (e.g., Qwen2.5-VL technical report. Ar Xiv, abs/2502.13923, 2025; Internlm2 technical report. ar Xiv preprint ar Xiv:2403.17297, 2024), but it does not explicitly state that the authors' own implementation code for the methodology described in this paper is publicly available, nor does it provide a link to a code repository. |
| Open Datasets | Yes | We conduct supervised fine-tuning using the CVQA dataset (Romero et al., 2024), a benchmark for culturally grounded visual question answering. It contains over 10,000 human-validated multiple-choice image-question (MCQ) pairs spanning 39 country-language combinations and 10 thematic categories. 1https://huggingface.co/datasets/afaji/cvqa. All data here is free to use for research purposes. CROSS builds on validated text-only cultural norms from Safe World (Yin et al., 2024) and CASA (Qiu et al., 2025), extending them with visually grounded queries paired with real-world images. |
| Dataset Splits | No | The paper describes the total size of the CROSS benchmark (1,284 image-query pairs) and the sizes of the constructed training datasets (1,581 and 2,374 examples) used for fine-tuning. However, it does not provide explicit training, validation, or test splits for these datasets needed for reproduction in the conventional sense (e.g., percentages or counts for distinct training, validation, and testing sets within each dataset). |
| Hardware Specification | Yes | The training configurations for different fine-tuning and optimization strategies specify the number and type of GPUs: '1-2 A100 80GB' for GPT-4o SFT and DPO, and '2-4 A100 80GB' for Intern VL2.5. |
| Software Dependencies | No | The paper mentions using specific versions of large vision-language models (e.g., gpt-4o-2024-08-06, gemini-2.5-pro-preview-03-25) and Open AI's fine-tuning API, but it does not provide specific version numbers for ancillary software components such as programming languages (e.g., Python), libraries (e.g., PyTorch), or CUDA versions. |
| Experiment Setup | Yes | Table 13 details training configurations for supervised fine-tuning (SFT) and Direct Preference Optimization (DPO), including '# of Epochs 1' for all methods, 'Batch Size 1' for GPT-4o and 'Batch Size 8' for Intern VL2.5, 'Learning Rate 5e-4' and 'LR Multiplier 2' for GPT-4o SFT, and 'Beta 0.1' for GPT-4o DPO. |