Active Task Disambiguation with LLMs
Authors: Katarzyna Kobalczyk, Nicolás Astorga, Tennison Liu, Mihaela van der Schaar
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirical results demonstrate that this form of question selection leads to more effective task disambiguation in comparison to approaches relying on reasoning solely within the space of questions. |
| Researcher Affiliation | Academia | Katarzyna Kobalczyk , Nicol as Astorga , Tennison Liu, & Mihaela van der Schaar DAMTP, University of Cambridge EMAIL |
| Pseudocode | Yes | Algorithm 1 estimate EIG(qj, {ai,j}N i=1) Require: A question qj and a set of N answers {ai,j}N i=1 {a1, . . . , an} Unique answers in {ai,j}N i=1 for k {1, . . . , n} do nk |{i : ai,j = ak, i [N]}| pk nk/n end for return Pn k=1 pk log(pk) |
| Open Source Code | Yes | Code for reproducing the experimental results of section 4.2 is made available at: https://github.com/kasia-kobalczyk/active-task-disambiguation. The repository also includes generated programs and querries with GPT-3.5-turbo and GPT-4o-mini. |
| Open Datasets | Yes | We evaluate all question-generating strategies on the Human Eval benchmark containing simple coding problems (Chen et al., 2021), and the more challenging APPS (Hendrycks et al., 2021) benchmark with competition-level coding challenges |
| Dataset Splits | Yes | For each task, we run the iterative requirement querying for 10 iterations across 5 seeds. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU or CPU models used for running experiments. |
| Software Dependencies | No | The paper mentions using specific Large Language Models (GPT-3.5-turbo, GPT-4o-mini, Llama3-8B, Llama3-70B) and an external Python interpreter, but does not provide specific version numbers for any software dependencies or libraries. |
| Experiment Setup | Yes | To reduce the sampling costs, we play the game for 10 instead of the original 20 rounds. We restrict the game to the category of animals. Here, the set of acceptable solutions, H are singletons, {h } where h represents a single animal name that player A may think about. Player A is simulated with GPT-4o-mini prompted to answer questions about the ground-truth animal h . For solution and question generation we use N = 20 and M = 5, respectively. |