reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Language Model Alignment in Multilingual Trolley Problems

Authors: Zhijing Jin, Max Kleiman-Weiner, Giorgio Piatti, Sydney Levine, Jiarui Liu, Fernando Gonzalez Adauto, Francesco Ortu, András Strausz, Mrinmaya Sachan, Rada Mihalcea, Yejin Choi, Bernhard Schölkopf

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate the moral alignment of LLMs with human preferences in multilingual trolley problems. ... Our analysis explores the alignment of 19 different LLMs with human judgments, capturing preferences across six moral dimensions... Our findings reveal that very few LLMs demonstrate overall alignment with human preferences.
Researcher Affiliation	Academia	1Max Planck Institute for Intelligent Systems, Tübingen 2ETH Zürich 3University of Toronto 4University of Washington 5Allen Institute for AI (AI2) 6Carnegie Mellon University 7University of Trieste 8University of Michigan
Pseudocode	No	The paper describes the methodology and prompt construction in detail but does not contain any structured pseudocode or algorithm blocks.
Open Source Code	Yes	Our code and data are at https://github.com/causal NLP/multi TP.
Open Datasets	Yes	We introduce our Multilingual Trolley Problems (MULTITP) dataset... Our code and data are at https://github.com/causal NLP/multi TP. ... The original Moral Machine dataset was collected by Awad et al. (2018) and is available here: https://osf.io/3hvt2/
Dataset Splits	No	The paper states the MULTITP dataset comprises 98,440 trolley problem vignettes and uses the Moral Machine dataset as ground truth, but it does not specify explicit training, validation, or test splits for models, as the paper evaluates pre-trained LLMs rather than training new models on split data.
Hardware Specification	No	The paper lists VRAM requirements for various LLM models being evaluated (Table 4) and mentions estimated API costs (Table 5), but it does not specify the exact hardware (e.g., GPU models, CPU types) used by the authors to run their experiments or evaluations.
Software Dependencies	No	The paper mentions using the googletrans Python package for translations but does not specify its version. It also lists the specific LLM models evaluated but does not provide version numbers for ancillary software or libraries used for the evaluation process itself.
Experiment Setup	Yes	For reproducibility, we fix the random seed and set the temperature to zero for the generation. ... We employ the token-forcing method (Wei et al., 2023; Carlini et al., 2023): Q: [Vignette Description] A: If the self-driving car has to make a decision, between the two choices, it should save...