reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Uncovering Gaps in How Humans and LLMs Interpret Subjective Language

Authors: Erik Jones, Arjun Patrawala, Jacob Steinhardt

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate TED by measuring how well the failures it uncovers predict downstream behavior in two settings: output-editing and inference-steering. ... We include the full quantitative results in Table 1, and find that for nearly every failure type, semantic thesaurus, and model, TED's average success rate is always higher than the semantic-only baseline, and is frequently much higher.
Researcher Affiliation	Academia	Erik Jones , Arjun Patrawala , & Jacob Steinhardt UC Berkeley EMAIL
Pseudocode	No	No, the paper describes the method "THESAURUS ERROR DETECTION (TED)" in Section 3 and its instantiation in Section 4 using descriptive text and mathematical formulations (e.g., Equation 1), but it does not present a clearly labeled pseudocode block or algorithm.
Open Source Code	Yes	Code is available at https://github.com/arjunpat/thesaurus-error-detector
Open Datasets	Yes	The exhaustive list of ethical questions is made available in the code
Dataset Splits	Yes	To minimize overlap between training and test datasets, we find it effective to prompt GPT-4 to generate 200 ethical questions, saving 100 for training semantic embeddings and 100 for testing them in the output-editing failures test.
Hardware Specification	Yes	Inference occurs on single A100 40 GB with a temperature = 1, while gradients are computed on an 80 GB A100.
Software Dependencies	No	No, the paper mentions using "vLLM" and "Hugging Face transformers library (Wolf et al., 2019)" and "PyTorch" but does not provide specific version numbers for any of these software dependencies.
Experiment Setup	Yes	We average n = 100 prompts to construct the embeddings, and set τsim = 0.93 and τdis = 0.1 for Mistral on the unexpected edits and inadequate updates respectively. ... For Llama 3 we set τsim = 0.98 and τdis = 0.5. ... Inference occurs on single A100 40 GB with a temperature = 1