Uncovering Gaps in How Humans and LLMs Interpret Subjective Language

Authors: Erik Jones, Arjun Patrawala, Jacob Steinhardt

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate TED by measuring how well the failures it uncovers predict downstream behavior in two settings: output-editing and inference-steering. ... We include the full quantitative results in Table 1, and find that for nearly every failure type, semantic thesaurus, and model, TED's average success rate is always higher than the semantic-only baseline, and is frequently much higher.
Researcher Affiliation Academia Erik Jones , Arjun Patrawala , & Jacob Steinhardt UC Berkeley EMAIL
Pseudocode No No, the paper describes the method "THESAURUS ERROR DETECTION (TED)" in Section 3 and its instantiation in Section 4 using descriptive text and mathematical formulations (e.g., Equation 1), but it does not present a clearly labeled pseudocode block or algorithm.
Open Source Code Yes Code is available at https://github.com/arjunpat/thesaurus-error-detector
Open Datasets Yes The exhaustive list of ethical questions is made available in the code
Dataset Splits Yes To minimize overlap between training and test datasets, we find it effective to prompt GPT-4 to generate 200 ethical questions, saving 100 for training semantic embeddings and 100 for testing them in the output-editing failures test.
Hardware Specification Yes Inference occurs on single A100 40 GB with a temperature = 1, while gradients are computed on an 80 GB A100.
Software Dependencies No No, the paper mentions using "vLLM" and "Hugging Face transformers library (Wolf et al., 2019)" and "PyTorch" but does not provide specific version numbers for any of these software dependencies.
Experiment Setup Yes We average n = 100 prompts to construct the embeddings, and set τsim = 0.93 and τdis = 0.1 for Mistral on the unexpected edits and inadequate updates respectively. ... For Llama 3 we set τsim = 0.98 and τdis = 0.5. ... Inference occurs on single A100 40 GB with a temperature = 1