Answer, Assemble, Ace: Understanding How LMs Answer Multiple Choice Questions

Authors: Sarah Wiegreffe, Oyvind Tafjord, Yonatan Belinkov, Hanna Hajishirzi, Ashish Sabharwal

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We experiment on two challenging real-world multiple-choice datasets from LLM benchmarks: Hella Swag (Zellers et al., 2019) and MMLU (Hendrycks et al., 2021). Details are given in Appendix A.1. We additionally use a prototypical colors dataset (Norlund et al., 2021) to create a synthetic 4-way task disentangling dataset-specific knowledge from the ability to perform symbol binding: Copying Colors from Context ( Colors ). Our findings, summarized in Figure 1, are: 1. When models are correct, they both encode information needed to predict the correct answer symbol and promote answer symbols in the vocabulary space in a very similar fashion across tasks, even when their overall task performance varies.
Researcher Affiliation Collaboration Allen Institute for AI, University of Washington, Technion EMAIL
Pseudocode No The paper describes methodological steps in paragraph text (e.g., Section 4.1 Activation Patching details steps 1-4) but does not include any explicitly labeled or structured pseudocode or algorithm blocks.
Open Source Code Yes Code is available at https://github.com/allenai/understanding_mcqa. We have also open-sourced our code (https://github.com/allenai/understanding_mcqa) and Colors dataset (https://huggingface.co/datasets/sarahwie/copycolors_mcqa). We fixed random seeds to ensure full reproducibility of our experiments.
Open Datasets Yes We experiment on two challenging real-world multiple-choice datasets from LLM benchmarks: Hella Swag (Zellers et al., 2019) and MMLU (Hendrycks et al., 2021). Details are given in Appendix A.1. We additionally use a prototypical colors dataset (Norlund et al., 2021) to create a synthetic 4-way task disentangling dataset-specific knowledge from the ability to perform symbol binding: Copying Colors from Context ( Colors ). The Colors dataset (https://huggingface.co/datasets/sarahwie/copycolors_mcqa) is also open-sourced.
Dataset Splits Yes Hella Swag (Zellers et al., 2019)... We sample a fixed set of 1000 instances from the test set used in our experiments. We sample 3 random training set instances to serve as in-context examples. MMLU (Hendrycks et al., 2021)... We sample a fixed set of 1000 instances from the test set used in our experiments. We sample a fixed set of 3 in-context example instances from the 5 provided for each topical area. For the Colors dataset: We use 3 instances as in-context examples and the remaining 105 as our test set.
Hardware Specification No The paper does not provide any specific details about the hardware used, such as GPU or CPU models, or cloud computing instance types.
Software Dependencies No The paper mentions open-sourcing its code but does not specify any software dependencies with version numbers (e.g., Python, PyTorch, or other libraries).
Experiment Setup Yes We experiment on the base (Olmo 0724 7B) and instruction-tuned (Olmo 0724 7B Instruct) versions of the most recent (0724) release of the Olmo model (Groeneveld et al., 2024). These have 32 layers with 32 attention heads per layer. We experiment with the smallest Llama 3.1 models: Llama 3.1 8B base model and Llama 3.1 8B Instruct (Dubey et al., 2024). These have 32 layers with 32 attention heads per layer. We include the base and instruct versions of the 0.5B and 1.5B Qwen 2.5 models (Yang et al., 2024). The 1.5B model has 28 layers with 12 attention heads per layer. For each dataset instance, we construct four versions where we vary the location of the correct answer string, and thus y ... We additionally include prompts Q/Z/R/X and 1/2/3/4. We fixed random seeds to ensure full reproducibility of our experiments.