Uncovering Gaps in How Humans and LLMs Interpret Subjective Language
Authors: Erik Jones, Arjun Patrawala, Jacob Steinhardt
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate TED by measuring how well the failures it uncovers predict downstream behavior in two settings: output-editing and inference-steering. ... We include the full quantitative results in Table 1, and find that for nearly every failure type, semantic thesaurus, and model, TED's average success rate is always higher than the semantic-only baseline, and is frequently much higher. |
| Researcher Affiliation | Academia | Erik Jones , Arjun Patrawala , & Jacob Steinhardt UC Berkeley EMAIL |
| Pseudocode | No | No, the paper describes the method "THESAURUS ERROR DETECTION (TED)" in Section 3 and its instantiation in Section 4 using descriptive text and mathematical formulations (e.g., Equation 1), but it does not present a clearly labeled pseudocode block or algorithm. |
| Open Source Code | Yes | Code is available at https://github.com/arjunpat/thesaurus-error-detector |
| Open Datasets | Yes | The exhaustive list of ethical questions is made available in the code |
| Dataset Splits | Yes | To minimize overlap between training and test datasets, we find it effective to prompt GPT-4 to generate 200 ethical questions, saving 100 for training semantic embeddings and 100 for testing them in the output-editing failures test. |
| Hardware Specification | Yes | Inference occurs on single A100 40 GB with a temperature = 1, while gradients are computed on an 80 GB A100. |
| Software Dependencies | No | No, the paper mentions using "vLLM" and "Hugging Face transformers library (Wolf et al., 2019)" and "PyTorch" but does not provide specific version numbers for any of these software dependencies. |
| Experiment Setup | Yes | We average n = 100 prompts to construct the embeddings, and set τsim = 0.93 and τdis = 0.1 for Mistral on the unexpected edits and inadequate updates respectively. ... For Llama 3 we set τsim = 0.98 and τdis = 0.5. ... Inference occurs on single A100 40 GB with a temperature = 1 |