Autonomous Evaluation of LLMs for Truth Maintenance and Reasoning Tasks
Authors: Rushang Karia, Daniel Bramblett, Daksh Dobhal, Siddharth Srivastava
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirical analysis shows that an LLM s performance on Auto Eval is highly indicative of its performance on a diverse array of other benchmarks focusing on translation and reasoning tasks, making it a valuable autonomous evaluation paradigm in settings where hand-curated datasets can be hard to obtain and/or update. |
| Researcher Affiliation | Academia | Rushang Karia , Daniel Bramblett , Daksh Dobhal, Siddharth Srivastava School of Computing and Augmented Intelligence Arizona State University EMAIL |
| Pseudocode | Yes | Algorithm 1 Dataset Generation 1: Inputs: CFG G, vocabulary V, branching factor n, tree depth depth, sample count sample_count, and categorization metric m. 2: Outputs: set of FS expressions φ 3: N {0 : [None]}, Nt 4: for d = 1, 2, . . . , depth do 5: N sample N(N[d 1], n) 6: for ν N do 7: Nν, Tν generate NChildren(ν, G, n) 8: N[d] += Nν 9: Nt Nt Tν 10: end for 11: end for 12: M categorize Expressions Into Dict(Nt, m) 13: φ 14: for k keys(M) do 15: Mk sample CFGExpressions(M[k], sample_count) 16: φk build FSExpressions(Mk, V) 17: φ φ φk 18: end for 19: Return: φ |
| Open Source Code | Yes | uto L is open-source1, is written in Python 3, includes several pre-computed datasets, and is easily customizable for adding new datasets, prompts, LLMs, etc. 1The code for this project is available at: https://github.com/AAIR-lab/autoeval. |
| Open Datasets | Yes | The uto L core benchmark uses four CFGs (Fig. 2) for producing five datasets compromising FL strings. We provide 5 datasets with 2 generated from the FOL CFG and 1 each for the rest. uto L is open-source1, is written in Python 3, includes several pre-computed datasets, and is easily customizable for adding new datasets, prompts, LLMs, etc. |
| Dataset Splits | No | The paper describes how datasets are generated and sampled for evaluation, e.g., 'We sampled 500 strings for each complexity level.' and 'We generated 10 batches for each dataset, resulting in approximately 20k samples for each dataset with an equal distribution for each operator number.' However, it does not specify explicit training/test/validation splits for machine learning model training as it focuses on evaluating pre-trained LLMs. |
| Hardware Specification | Yes | The open-source models LLama3-8B-Instruct and Mistral-v0.2-7B-Instruct were locally hosted on a server with a 13th Gen Intel(R) Core(TM) i9-13900K and Nvidia RTX 4090 GPU using the model s default parameters with a temperature of 1. Similarly, Phi-3-medium-4k-instruct was locally hosted on a server using a Nvidia A100-XM4-80GB GPU. Verification was performed on an AMD EPYC machine with 128 cores. The larger open-source models, Yi-1.5-34B-Instruct and Llama-3-70B-Instruct, were run on Arizona State University s Sol supercomputer (Jennewein et al., 2023). |
| Software Dependencies | Yes | We ran our experiments using Python 3.10.13 with package versions shown in Table 2. Table 2: Python package versions used for empirical evaluation. openai 1.45.0 nltk 3.8.1 tqdm 4.66.4 anthropic 0.26.1 backoff 2.2.1 tiktoken 0.6.0 transformers 4.41.1 Faker 25.2.0 networkx 3.3 |
| Experiment Setup | Yes | The closed-source models (GPT3.5-turbo, GPT-4o, and GPT-4o-mini) were accessed using their API using a temperature of 0.1. The open-source models LLama3-8B-Instruct and Mistral-v0.2-7B-Instruct were locally hosted on a server with a 13th Gen Intel(R) Core(TM) i9-13900K and Nvidia RTX 4090 GPU using the model s default parameters with a temperature of 1. Table 1: Hyperparameters used for producing the five datasets. Parameter Type Hyperparameter Value Description depth 40 Maximum depth of the CFG tree. n 200 Branching factor of produced CFG tree. sample_count 50 Number of CFGS for each metric value to select. |