reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Do LLMs have Consistent Values?

Authors: Naama Rozen, Liat Bezalel, Gal Elidan, Amir Globerson, Ella Daniel

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our findings reveal that standard prompting methods fail to produce human-consistent value correlations. However, we demonstrate that a novel prompting strategy (referred to as "Value Anchoring"), significantly improves the alignment of LLM value correlations with human data. Furthermore, we analyze the mechanism by which Value Anchoring achieves this effect. These results not only deepen our understanding of value representation in LLMs but also introduce new methodologies for evaluating consistency and human-likeness in LLM responses, highlighting the importance of explicit value prompting for generating human-aligned outputs.
Researcher Affiliation	Collaboration	Naama Rozen Tel-Aviv University EMAIL Liat Bezalel Tel-Aviv University EMAIL Gal Elidan Google Research Hebrew University EMAIL Amir Globerson Google Research Tel-Aviv University EMAIL Ella Daniel Tel-Aviv University EMAIL
Pseudocode	No	The paper describes the data analysis process in Section 3.2, but does not present any structured pseudocode or algorithm blocks.
Open Source Code	Yes	The code and data are provided as supplementary files in the submission.
Open Datasets	Yes	The data is from the study of 49 cultural groups.2 The total number of participants was 53,472, the mean age was 34.2, (SD = 15.8), with 59% females. This dataset is publicly available through the Open Science Framework as described in Schwartz and Cieciuch (2022).
Dataset Splits	No	The paper describes generating 300 personas per model and prompt, and uses an existing human dataset as a benchmark. However, it does not specify any training, validation, or test dataset splits for the LLM experiments or evaluation in the traditional machine learning sense, as the LLMs are being probed, not trained on a dataset for a specific task.
Hardware Specification	No	The paper lists the LLM models used (e.g., GPT-4-0314, Gemini 1.0 Pro, Llama 3.1 8B), but does not provide any specific hardware details such as GPU models, CPU types, or memory specifications used to run these models or experiments.
Software Dependencies	No	The Python and R code used to generate our prompt sets and analyses is available on the Open Review website. However, no specific version numbers for Python, R, or any libraries/dependencies are provided.
Experiment Setup	Yes	Each model was presented with each of our five prompt variants (see Section 3.1) 300 times, for a total of 1,500 runs per model. The prompts included gender-specific versions, with appropriate variants assigned based on the experimental condition. We conducted these experiments under two separate conditions: once with the temperature parameter set to 0.0 and once with it set to 0.7.