reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Autonomous Evaluation of LLMs for Truth Maintenance and Reasoning Tasks

Authors: Rushang Karia, Daniel Bramblett, Daksh Dobhal, Siddharth Srivastava

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirical analysis shows that an LLM s performance on Auto Eval is highly indicative of its performance on a diverse array of other benchmarks focusing on translation and reasoning tasks, making it a valuable autonomous evaluation paradigm in settings where hand-curated datasets can be hard to obtain and/or update.
Researcher Affiliation	Academia	Rushang Karia , Daniel Bramblett , Daksh Dobhal, Siddharth Srivastava School of Computing and Augmented Intelligence Arizona State University EMAIL
Pseudocode	Yes	Algorithm 1 Dataset Generation 1: Inputs: CFG G, vocabulary V, branching factor n, tree depth depth, sample count sample_count, and categorization metric m. 2: Outputs: set of FS expressions φ 3: N {0 : [None]}, Nt 4: for d = 1, 2, . . . , depth do 5: N sample N(N[d 1], n) 6: for ν N do 7: Nν, Tν generate NChildren(ν, G, n) 8: N[d] += Nν 9: Nt Nt Tν 10: end for 11: end for 12: M categorize Expressions Into Dict(Nt, m) 13: φ 14: for k keys(M) do 15: Mk sample CFGExpressions(M[k], sample_count) 16: φk build FSExpressions(Mk, V) 17: φ φ φk 18: end for 19: Return: φ
Open Source Code	Yes	uto L is open-source1, is written in Python 3, includes several pre-computed datasets, and is easily customizable for adding new datasets, prompts, LLMs, etc. 1The code for this project is available at: https://github.com/AAIR-lab/autoeval.
Open Datasets	Yes	The uto L core benchmark uses four CFGs (Fig. 2) for producing five datasets compromising FL strings. We provide 5 datasets with 2 generated from the FOL CFG and 1 each for the rest. uto L is open-source1, is written in Python 3, includes several pre-computed datasets, and is easily customizable for adding new datasets, prompts, LLMs, etc.
Dataset Splits	No	The paper describes how datasets are generated and sampled for evaluation, e.g., 'We sampled 500 strings for each complexity level.' and 'We generated 10 batches for each dataset, resulting in approximately 20k samples for each dataset with an equal distribution for each operator number.' However, it does not specify explicit training/test/validation splits for machine learning model training as it focuses on evaluating pre-trained LLMs.
Hardware Specification	Yes	The open-source models LLama3-8B-Instruct and Mistral-v0.2-7B-Instruct were locally hosted on a server with a 13th Gen Intel(R) Core(TM) i9-13900K and Nvidia RTX 4090 GPU using the model s default parameters with a temperature of 1. Similarly, Phi-3-medium-4k-instruct was locally hosted on a server using a Nvidia A100-XM4-80GB GPU. Verification was performed on an AMD EPYC machine with 128 cores. The larger open-source models, Yi-1.5-34B-Instruct and Llama-3-70B-Instruct, were run on Arizona State University s Sol supercomputer (Jennewein et al., 2023).
Software Dependencies	Yes	We ran our experiments using Python 3.10.13 with package versions shown in Table 2. Table 2: Python package versions used for empirical evaluation. openai 1.45.0 nltk 3.8.1 tqdm 4.66.4 anthropic 0.26.1 backoff 2.2.1 tiktoken 0.6.0 transformers 4.41.1 Faker 25.2.0 networkx 3.3
Experiment Setup	Yes	The closed-source models (GPT3.5-turbo, GPT-4o, and GPT-4o-mini) were accessed using their API using a temperature of 0.1. The open-source models LLama3-8B-Instruct and Mistral-v0.2-7B-Instruct were locally hosted on a server with a 13th Gen Intel(R) Core(TM) i9-13900K and Nvidia RTX 4090 GPU using the model s default parameters with a temperature of 1. Table 1: Hyperparameters used for producing the five datasets. Parameter Type Hyperparameter Value Description depth 40 Maximum depth of the CFG tree. n 200 Branching factor of produced CFG tree. sample_count 50 Number of CFGS for each metric value to select.