DARG: Dynamic Evaluation of Large Language Models via Adaptive Reasoning Graph

Authors: Zhehao Zhang, Jiaao Chen, Diyi Yang

NeurIPS 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We apply our DARG framework to diverse reasoning tasks in four domains with 15 state-of-the-art LLMs. Experimental results show that almost all LLMs experience a performance decrease with increased complexity and certain LLMs exhibit significant drops.
Researcher Affiliation Academia Zhehao Zhang Dartmouth College EMAIL Jiaao Chen Georgia Institute of Technology EMAIL Diyi Yang Stanford University EMAIL
Pseudocode Yes Algorithm 1: Algorithm of DARG
Open Source Code Yes The code is available at https://github.com/SALT-NLP/DARG.
Open Datasets Yes For each of the tasks, we utilized the most used datasets, specifically, GSM8K [19] for math reasoning, BBQ [2] for social reasoning, BBH Navigate [91] dataset for spatial reasoning and BBH Dyck Language for symbolic reasoning
Dataset Splits Yes We construct a hold-out validation set, which contains 0.05% of the data points from each complexity dimension generated by DARG and others are used for training.
Hardware Specification Yes Other models are used locally on a machine with an Nvidia A100 40G GPU with 40G GPU memory and a 12-core CPU.
Software Dependencies Yes For fine-tuning and subsequent inference, we employ Lit GPT [3] along with its default hyperparameters (learning_rate=0.0003, weight_decay=0.02, beta1=0.9, beta2=0.95, max_norm=None, min_lr=6e-05, epochs=5) and Lo RA [37].
Experiment Setup Yes For graph construction and graph-to-text decoding, we set the number temperature to 1. For all evaluation experiments, we set the temperature to 0.1 to ensure reproducibility and the top_p to 0.95. For fine-tuning and subsequent inference, we employ Lit GPT [3] along with its default hyperparameters (learning_rate=0.0003, weight_decay=0.02, beta1=0.9, beta2=0.95, max_norm=None, min_lr=6e-05, epochs=5) and Lo RA [37].