Benchmarking and Understanding Compositional Relational Reasoning of LLMs

Authors: Ruikang Ni, Da Xiao, Qingye Meng, Xiangyu Li, Shihui Zheng, Hongliang Liang

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate existing LLMs, e.g. open source Llama-2/3 7B-70B, close source GPT 3.5/4, on GAR to show that it is challenging for these LLMs despite appearing simple. Scaling helps but the compositionality gap increases, revealing fundamental deficiency of these LLMs in CRR. To understand how LLMs solve GAR tasks, we use attribution patching to discover the core circuits reused by Vicuna-33B across different tasks, and a set of vital attention heads. Intervention experiments show that the correct functioning of these heads significantly impacts task performance.
Researcher Affiliation Collaboration 1Beijing University of Posts and Telecommunications 2Colorful Clouds Technology Co., Ltd., 3ICBC UBS Asset Management EMAIL EMAIL, EMAIL
Pseudocode No The paper describes methods like 'step-wise patching similar to Wang et al. (2023)' and 'integrated gradients-based attribution patching', but does not present these or any other procedures in a structured pseudocode or algorithm block.
Open Source Code Yes Dataset and code https://github.com/Caiyun-AI/GAR
Open Datasets Yes We propose a new synthetic benchmark called Generalized Associative Recall (GAR) by integrating and generalizing the essence of several tasks in MI, e.g. associative recall, knowledge recall, indirect object identification (IOI), in a unified framework. GAR consists of a set of automatically generated tasks with varying forms (e.g. affirmative/negative, generation/classification) and difficulties. ... Dataset and code https://github.com/Caiyun-AI/GAR
Dataset Splits Yes The GAR dataset consists of 192 generation tasks and 192 classification tasks and a total of 4608 examples, with 8/16 examples per generation/classification task. To obtain better performance, all examples are formatted as in-context one-shot learning.
Hardware Specification No The paper mentions evaluating LLMs such as Llama-2/3 7B-70B, GPT 3.5/4, and using Vicuna-33B for mechanistic interpretability studies. However, it does not provide specific details about the hardware (e.g., GPU models, CPU types, or memory) used to run these evaluations or studies.
Software Dependencies No The paper does not provide specific software dependencies with version numbers, such as programming languages, libraries, or frameworks used for the experiments or analysis.
Experiment Setup No The paper mentions that 'all examples are formatted as in-context one-shot learning' and describes methods like 'attribution patching' and using 'KL divergence as the metric to compute gradient', but it does not specify concrete hyperparameters for training or fine-tuning models (e.g., learning rate, batch size, epochs, optimizer settings) or other detailed system-level experimental setup information.