reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

DailyDilemmas: Revealing Value Preferences of LLMs with Quandaries of Daily Life

Authors: Yu Ying Chiu, Liwei Jiang, Yejin Choi

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We present DAILYDILEMMAS, a dataset of 1,360 moral dilemmas encountered in everyday life. ... With DAILYDILEMMAS, we evaluate LLMs on these dilemmas to determine what action they will choose and the values represented by these action choices. Then, we analyze values through the lens of five theoretical frameworks... we find LLMs are most aligned with self-expression over survival in World Values Survey and care over loyalty in Moral Foundations Theory. ... Finally, we find that end users cannot effectively steer such prioritization using system prompts.
Researcher Affiliation	Academia	University of Washington EMAIL
Pseudocode	No	The paper describes the 'synthetic data generation pipeline' in Figure 1 and Section 2.2 with structured steps ((1) Formulate Moral Dilemma, (2) Imagine Negative Consequences, (3) Capture Perspectives). However, these steps are described in prose and do not use a code-like format or explicit 'Pseudocode' or 'Algorithm' labels as requested by the schema.
Open Source Code	Yes	https://github.com/kellycyy/daily_dilemmas
Open Datasets	Yes	We present DAILYDILEMMAS, a dataset of 1,360 moral dilemmas encountered in everyday life. ... https://hf.co/datasets/kellycyy/daily_dilemmas
Dataset Splits	No	The paper describes generating 50,000 moral dilemmas, filtering them, and then stratifying and sampling 80 dilemmas from each topic to create a dataset of 1,360 dilemmas. This describes the dataset construction, but it does not specify train/test/validation splits for the LLM evaluation experiments. The LLMs are evaluated on these dilemmas, implying the full dataset is used for evaluation rather than being split for training/testing of a model proposed by the authors.
Hardware Specification	No	The paper mentions using LLMs like 'GPT-4-turbo', 'Claude-3-haiku', 'Llama-3 70b', and 'Mixtral-8x7B' for evaluation and 'GPT-4' for data generation, and discusses 'server-side indeterminism from LLM providers'. This indicates that the authors used external LLM APIs, but they do not specify any hardware they used for running their own experiments or analysis.
Software Dependencies	No	The paper mentions using the 'NTLK library (Wordnet, Conceptnet, Synnet)' and the 'Open AI embedding model (text-embedding-3-small)'. While NLTK is a software library, a specific version number is not provided. The Open AI embedding model is a specific model, not a general ancillary software dependency with a version number that would typically be installed and configured by the user.
Experiment Setup	Yes	We apply GPT-4 to generate daily-life moral dilemma situations with value conflicts, as shown in Fig. 1. Technical details and prompts are in Appendix A.5. ... To ensure the models generations are reliable (and feasible within our limited budget for calling external APIs), we use greedy decoding for all the model response generation. Therefore, all the models we tested should consistently generate the same response (i.e., same decision for choosing the binary dilemma situation; same involved values generated for each dilemma). ... Our task requires the model to accurately describe the relevant parties and values and hence our choice of temperature (0) is optimal for this task. Additionally, we also explored temperatures higher than zero earlier in the project but they led to generations that sometimes did not follow the expected output structure, making it hard to automatically parse the responses into the corresponding values. ... We designed a system prompt modulation experiment with GPT-4-turbo model, based on the principles stated in Open AI Model Spec. ... The detailed prompts are provided in the Table 14 in Appendix A.11.