reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Answer, Assemble, Ace: Understanding How LMs Answer Multiple Choice Questions

Authors: Sarah Wiegreffe, Oyvind Tafjord, Yonatan Belinkov, Hanna Hajishirzi, Ashish Sabharwal

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We experiment on two challenging real-world multiple-choice datasets from LLM benchmarks: Hella Swag (Zellers et al., 2019) and MMLU (Hendrycks et al., 2021). Details are given in Appendix A.1. We additionally use a prototypical colors dataset (Norlund et al., 2021) to create a synthetic 4-way task disentangling dataset-specific knowledge from the ability to perform symbol binding: Copying Colors from Context ( Colors ). Our findings, summarized in Figure 1, are: 1. When models are correct, they both encode information needed to predict the correct answer symbol and promote answer symbols in the vocabulary space in a very similar fashion across tasks, even when their overall task performance varies.
Researcher Affiliation	Collaboration	Allen Institute for AI, University of Washington, Technion EMAIL
Pseudocode	No	The paper describes methodological steps in paragraph text (e.g., Section 4.1 Activation Patching details steps 1-4) but does not include any explicitly labeled or structured pseudocode or algorithm blocks.
Open Source Code	Yes	Code is available at https://github.com/allenai/understanding_mcqa. We have also open-sourced our code (https://github.com/allenai/understanding_mcqa) and Colors dataset (https://huggingface.co/datasets/sarahwie/copycolors_mcqa). We fixed random seeds to ensure full reproducibility of our experiments.
Open Datasets	Yes	We experiment on two challenging real-world multiple-choice datasets from LLM benchmarks: Hella Swag (Zellers et al., 2019) and MMLU (Hendrycks et al., 2021). Details are given in Appendix A.1. We additionally use a prototypical colors dataset (Norlund et al., 2021) to create a synthetic 4-way task disentangling dataset-specific knowledge from the ability to perform symbol binding: Copying Colors from Context ( Colors ). The Colors dataset (https://huggingface.co/datasets/sarahwie/copycolors_mcqa) is also open-sourced.
Dataset Splits	Yes	Hella Swag (Zellers et al., 2019)... We sample a fixed set of 1000 instances from the test set used in our experiments. We sample 3 random training set instances to serve as in-context examples. MMLU (Hendrycks et al., 2021)... We sample a fixed set of 1000 instances from the test set used in our experiments. We sample a fixed set of 3 in-context example instances from the 5 provided for each topical area. For the Colors dataset: We use 3 instances as in-context examples and the remaining 105 as our test set.
Hardware Specification	No	The paper does not provide any specific details about the hardware used, such as GPU or CPU models, or cloud computing instance types.
Software Dependencies	No	The paper mentions open-sourcing its code but does not specify any software dependencies with version numbers (e.g., Python, PyTorch, or other libraries).
Experiment Setup	Yes	We experiment on the base (Olmo 0724 7B) and instruction-tuned (Olmo 0724 7B Instruct) versions of the most recent (0724) release of the Olmo model (Groeneveld et al., 2024). These have 32 layers with 32 attention heads per layer. We experiment with the smallest Llama 3.1 models: Llama 3.1 8B base model and Llama 3.1 8B Instruct (Dubey et al., 2024). These have 32 layers with 32 attention heads per layer. We include the base and instruct versions of the 0.5B and 1.5B Qwen 2.5 models (Yang et al., 2024). The 1.5B model has 28 layers with 12 attention heads per layer. For each dataset instance, we construct four versions where we vary the location of the correct answer string, and thus y ... We additionally include prompts Q/Z/R/X and 1/2/3/4. We fixed random seeds to ensure full reproducibility of our experiments.