Do LLMs have Consistent Values?
Authors: Naama Rozen, Liat Bezalel, Gal Elidan, Amir Globerson, Ella Daniel
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our findings reveal that standard prompting methods fail to produce human-consistent value correlations. However, we demonstrate that a novel prompting strategy (referred to as "Value Anchoring"), significantly improves the alignment of LLM value correlations with human data. Furthermore, we analyze the mechanism by which Value Anchoring achieves this effect. These results not only deepen our understanding of value representation in LLMs but also introduce new methodologies for evaluating consistency and human-likeness in LLM responses, highlighting the importance of explicit value prompting for generating human-aligned outputs. |
| Researcher Affiliation | Collaboration | Naama Rozen Tel-Aviv University EMAIL Liat Bezalel Tel-Aviv University EMAIL Gal Elidan Google Research Hebrew University EMAIL Amir Globerson Google Research Tel-Aviv University EMAIL Ella Daniel Tel-Aviv University EMAIL |
| Pseudocode | No | The paper describes the data analysis process in Section 3.2, but does not present any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | The code and data are provided as supplementary files in the submission. |
| Open Datasets | Yes | The data is from the study of 49 cultural groups.2 The total number of participants was 53,472, the mean age was 34.2, (SD = 15.8), with 59% females. This dataset is publicly available through the Open Science Framework as described in Schwartz and Cieciuch (2022). |
| Dataset Splits | No | The paper describes generating 300 personas per model and prompt, and uses an existing human dataset as a benchmark. However, it does not specify any training, validation, or test dataset splits for the LLM experiments or evaluation in the traditional machine learning sense, as the LLMs are being probed, not trained on a dataset for a specific task. |
| Hardware Specification | No | The paper lists the LLM models used (e.g., GPT-4-0314, Gemini 1.0 Pro, Llama 3.1 8B), but does not provide any specific hardware details such as GPU models, CPU types, or memory specifications used to run these models or experiments. |
| Software Dependencies | No | The Python and R code used to generate our prompt sets and analyses is available on the Open Review website. However, no specific version numbers for Python, R, or any libraries/dependencies are provided. |
| Experiment Setup | Yes | Each model was presented with each of our five prompt variants (see Section 3.1) 300 times, for a total of 1,500 runs per model. The prompts included gender-specific versions, with appropriate variants assigned based on the experimental condition. We conducted these experiments under two separate conditions: once with the temperature parameter set to 0.0 and once with it set to 0.7. |