DailyDilemmas: Revealing Value Preferences of LLMs with Quandaries of Daily Life
Authors: Yu Ying Chiu, Liwei Jiang, Yejin Choi
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We present DAILYDILEMMAS, a dataset of 1,360 moral dilemmas encountered in everyday life. ... With DAILYDILEMMAS, we evaluate LLMs on these dilemmas to determine what action they will choose and the values represented by these action choices. Then, we analyze values through the lens of five theoretical frameworks... we find LLMs are most aligned with self-expression over survival in World Values Survey and care over loyalty in Moral Foundations Theory. ... Finally, we find that end users cannot effectively steer such prioritization using system prompts. |
| Researcher Affiliation | Academia | University of Washington EMAIL |
| Pseudocode | No | The paper describes the 'synthetic data generation pipeline' in Figure 1 and Section 2.2 with structured steps ((1) Formulate Moral Dilemma, (2) Imagine Negative Consequences, (3) Capture Perspectives). However, these steps are described in prose and do not use a code-like format or explicit 'Pseudocode' or 'Algorithm' labels as requested by the schema. |
| Open Source Code | Yes | https://github.com/kellycyy/daily_dilemmas |
| Open Datasets | Yes | We present DAILYDILEMMAS, a dataset of 1,360 moral dilemmas encountered in everyday life. ... https://hf.co/datasets/kellycyy/daily_dilemmas |
| Dataset Splits | No | The paper describes generating 50,000 moral dilemmas, filtering them, and then stratifying and sampling 80 dilemmas from each topic to create a dataset of 1,360 dilemmas. This describes the dataset construction, but it does not specify train/test/validation splits for the LLM evaluation experiments. The LLMs are evaluated on these dilemmas, implying the full dataset is used for evaluation rather than being split for training/testing of a model proposed by the authors. |
| Hardware Specification | No | The paper mentions using LLMs like 'GPT-4-turbo', 'Claude-3-haiku', 'Llama-3 70b', and 'Mixtral-8x7B' for evaluation and 'GPT-4' for data generation, and discusses 'server-side indeterminism from LLM providers'. This indicates that the authors used external LLM APIs, but they do not specify any hardware they used for running their own experiments or analysis. |
| Software Dependencies | No | The paper mentions using the 'NTLK library (Wordnet, Conceptnet, Synnet)' and the 'Open AI embedding model (text-embedding-3-small)'. While NLTK is a software library, a specific version number is not provided. The Open AI embedding model is a specific model, not a general ancillary software dependency with a version number that would typically be installed and configured by the user. |
| Experiment Setup | Yes | We apply GPT-4 to generate daily-life moral dilemma situations with value conflicts, as shown in Fig. 1. Technical details and prompts are in Appendix A.5. ... To ensure the models generations are reliable (and feasible within our limited budget for calling external APIs), we use greedy decoding for all the model response generation. Therefore, all the models we tested should consistently generate the same response (i.e., same decision for choosing the binary dilemma situation; same involved values generated for each dilemma). ... Our task requires the model to accurately describe the relevant parties and values and hence our choice of temperature (0) is optimal for this task. Additionally, we also explored temperatures higher than zero earlier in the project but they led to generations that sometimes did not follow the expected output structure, making it hard to automatically parse the responses into the corresponding values. ... We designed a system prompt modulation experiment with GPT-4-turbo model, based on the principles stated in Open AI Model Spec. ... The detailed prompts are provided in the Table 14 in Appendix A.11. |