DailyDilemmas: Revealing Value Preferences of LLMs with Quandaries of Daily Life

Authors: Yu Ying Chiu, Liwei Jiang, Yejin Choi

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We present DAILYDILEMMAS, a dataset of 1,360 moral dilemmas encountered in everyday life. ... With DAILYDILEMMAS, we evaluate LLMs on these dilemmas to determine what action they will choose and the values represented by these action choices. Then, we analyze values through the lens of five theoretical frameworks... we find LLMs are most aligned with self-expression over survival in World Values Survey and care over loyalty in Moral Foundations Theory. ... Finally, we find that end users cannot effectively steer such prioritization using system prompts.
Researcher Affiliation Academia University of Washington EMAIL
Pseudocode No The paper describes the 'synthetic data generation pipeline' in Figure 1 and Section 2.2 with structured steps ((1) Formulate Moral Dilemma, (2) Imagine Negative Consequences, (3) Capture Perspectives). However, these steps are described in prose and do not use a code-like format or explicit 'Pseudocode' or 'Algorithm' labels as requested by the schema.
Open Source Code Yes https://github.com/kellycyy/daily_dilemmas
Open Datasets Yes We present DAILYDILEMMAS, a dataset of 1,360 moral dilemmas encountered in everyday life. ... https://hf.co/datasets/kellycyy/daily_dilemmas
Dataset Splits No The paper describes generating 50,000 moral dilemmas, filtering them, and then stratifying and sampling 80 dilemmas from each topic to create a dataset of 1,360 dilemmas. This describes the dataset construction, but it does not specify train/test/validation splits for the LLM evaluation experiments. The LLMs are evaluated on these dilemmas, implying the full dataset is used for evaluation rather than being split for training/testing of a model proposed by the authors.
Hardware Specification No The paper mentions using LLMs like 'GPT-4-turbo', 'Claude-3-haiku', 'Llama-3 70b', and 'Mixtral-8x7B' for evaluation and 'GPT-4' for data generation, and discusses 'server-side indeterminism from LLM providers'. This indicates that the authors used external LLM APIs, but they do not specify any hardware they used for running their own experiments or analysis.
Software Dependencies No The paper mentions using the 'NTLK library (Wordnet, Conceptnet, Synnet)' and the 'Open AI embedding model (text-embedding-3-small)'. While NLTK is a software library, a specific version number is not provided. The Open AI embedding model is a specific model, not a general ancillary software dependency with a version number that would typically be installed and configured by the user.
Experiment Setup Yes We apply GPT-4 to generate daily-life moral dilemma situations with value conflicts, as shown in Fig. 1. Technical details and prompts are in Appendix A.5. ... To ensure the models generations are reliable (and feasible within our limited budget for calling external APIs), we use greedy decoding for all the model response generation. Therefore, all the models we tested should consistently generate the same response (i.e., same decision for choosing the binary dilemma situation; same involved values generated for each dilemma). ... Our task requires the model to accurately describe the relevant parties and values and hence our choice of temperature (0) is optimal for this task. Additionally, we also explored temperatures higher than zero earlier in the project but they led to generations that sometimes did not follow the expected output structure, making it hard to automatically parse the responses into the corresponding values. ... We designed a system prompt modulation experiment with GPT-4-turbo model, based on the principles stated in Open AI Model Spec. ... The detailed prompts are provided in the Table 14 in Appendix A.11.