reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Benchmarking and Understanding Compositional Relational Reasoning of LLMs

Authors: Ruikang Ni, Da Xiao, Qingye Meng, Xiangyu Li, Shihui Zheng, Hongliang Liang

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate existing LLMs, e.g. open source Llama-2/3 7B-70B, close source GPT 3.5/4, on GAR to show that it is challenging for these LLMs despite appearing simple. Scaling helps but the compositionality gap increases, revealing fundamental deficiency of these LLMs in CRR. To understand how LLMs solve GAR tasks, we use attribution patching to discover the core circuits reused by Vicuna-33B across different tasks, and a set of vital attention heads. Intervention experiments show that the correct functioning of these heads significantly impacts task performance.
Researcher Affiliation	Collaboration	1Beijing University of Posts and Telecommunications 2Colorful Clouds Technology Co., Ltd., 3ICBC UBS Asset Management EMAIL EMAIL, EMAIL
Pseudocode	No	The paper describes methods like 'step-wise patching similar to Wang et al. (2023)' and 'integrated gradients-based attribution patching', but does not present these or any other procedures in a structured pseudocode or algorithm block.
Open Source Code	Yes	Dataset and code https://github.com/Caiyun-AI/GAR
Open Datasets	Yes	We propose a new synthetic benchmark called Generalized Associative Recall (GAR) by integrating and generalizing the essence of several tasks in MI, e.g. associative recall, knowledge recall, indirect object identification (IOI), in a unified framework. GAR consists of a set of automatically generated tasks with varying forms (e.g. affirmative/negative, generation/classification) and difficulties. ... Dataset and code https://github.com/Caiyun-AI/GAR
Dataset Splits	Yes	The GAR dataset consists of 192 generation tasks and 192 classification tasks and a total of 4608 examples, with 8/16 examples per generation/classification task. To obtain better performance, all examples are formatted as in-context one-shot learning.
Hardware Specification	No	The paper mentions evaluating LLMs such as Llama-2/3 7B-70B, GPT 3.5/4, and using Vicuna-33B for mechanistic interpretability studies. However, it does not provide specific details about the hardware (e.g., GPU models, CPU types, or memory) used to run these evaluations or studies.
Software Dependencies	No	The paper does not provide specific software dependencies with version numbers, such as programming languages, libraries, or frameworks used for the experiments or analysis.
Experiment Setup	No	The paper mentions that 'all examples are formatted as in-context one-shot learning' and describes methods like 'attribution patching' and using 'KL divergence as the metric to compute gradient', but it does not specify concrete hyperparameters for training or fine-tuning models (e.g., learning rate, batch size, epochs, optimizer settings) or other detailed system-level experimental setup information.