reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Active Task Disambiguation with LLMs

Authors: Katarzyna Kobalczyk, Nicolás Astorga, Tennison Liu, Mihaela van der Schaar

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirical results demonstrate that this form of question selection leads to more effective task disambiguation in comparison to approaches relying on reasoning solely within the space of questions.
Researcher Affiliation	Academia	Katarzyna Kobalczyk , Nicol as Astorga , Tennison Liu, & Mihaela van der Schaar DAMTP, University of Cambridge EMAIL
Pseudocode	Yes	Algorithm 1 estimate EIG(qj, {ai,j}N i=1) Require: A question qj and a set of N answers {ai,j}N i=1 {a1, . . . , an} Unique answers in {ai,j}N i=1 for k {1, . . . , n} do nk \|{i : ai,j = ak, i [N]}\| pk nk/n end for return Pn k=1 pk log(pk)
Open Source Code	Yes	Code for reproducing the experimental results of section 4.2 is made available at: https://github.com/kasia-kobalczyk/active-task-disambiguation. The repository also includes generated programs and querries with GPT-3.5-turbo and GPT-4o-mini.
Open Datasets	Yes	We evaluate all question-generating strategies on the Human Eval benchmark containing simple coding problems (Chen et al., 2021), and the more challenging APPS (Hendrycks et al., 2021) benchmark with competition-level coding challenges
Dataset Splits	Yes	For each task, we run the iterative requirement querying for 10 iterations across 5 seeds.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU or CPU models used for running experiments.
Software Dependencies	No	The paper mentions using specific Large Language Models (GPT-3.5-turbo, GPT-4o-mini, Llama3-8B, Llama3-70B) and an external Python interpreter, but does not provide specific version numbers for any software dependencies or libraries.
Experiment Setup	Yes	To reduce the sampling costs, we play the game for 10 instead of the original 20 rounds. We restrict the game to the category of animals. Here, the set of acceptable solutions, H are singletons, {h } where h represents a single animal name that player A may think about. Player A is simulated with GPT-4o-mini prompted to answer questions about the ground-truth animal h . For solution and question generation we use N = 20 and M = 5, respectively.