reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

DARG: Dynamic Evaluation of Large Language Models via Adaptive Reasoning Graph

Authors: Zhehao Zhang, Jiaao Chen, Diyi Yang

NeurIPS 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We apply our DARG framework to diverse reasoning tasks in four domains with 15 state-of-the-art LLMs. Experimental results show that almost all LLMs experience a performance decrease with increased complexity and certain LLMs exhibit significant drops.
Researcher Affiliation	Academia	Zhehao Zhang Dartmouth College EMAIL Jiaao Chen Georgia Institute of Technology EMAIL Diyi Yang Stanford University EMAIL
Pseudocode	Yes	Algorithm 1: Algorithm of DARG
Open Source Code	Yes	The code is available at https://github.com/SALT-NLP/DARG.
Open Datasets	Yes	For each of the tasks, we utilized the most used datasets, specifically, GSM8K [19] for math reasoning, BBQ [2] for social reasoning, BBH Navigate [91] dataset for spatial reasoning and BBH Dyck Language for symbolic reasoning
Dataset Splits	Yes	We construct a hold-out validation set, which contains 0.05% of the data points from each complexity dimension generated by DARG and others are used for training.
Hardware Specification	Yes	Other models are used locally on a machine with an Nvidia A100 40G GPU with 40G GPU memory and a 12-core CPU.
Software Dependencies	Yes	For fine-tuning and subsequent inference, we employ Lit GPT [3] along with its default hyperparameters (learning_rate=0.0003, weight_decay=0.02, beta1=0.9, beta2=0.95, max_norm=None, min_lr=6e-05, epochs=5) and Lo RA [37].
Experiment Setup	Yes	For graph construction and graph-to-text decoding, we set the number temperature to 1. For all evaluation experiments, we set the temperature to 0.1 to ensure reproducibility and the top_p to 0.95. For fine-tuning and subsequent inference, we employ Lit GPT [3] along with its default hyperparameters (learning_rate=0.0003, weight_decay=0.02, beta1=0.9, beta2=0.95, max_norm=None, min_lr=6e-05, epochs=5) and Lo RA [37].