DeLLMa: Decision Making Under Uncertainty with Large Language Models
Authors: Ollie Liu, Deqing Fu, Dani Yogatama, Willie Neiswanger
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We validate our procedure on multiple realistic decision-making environments, demonstrating that De LLMa can consistently enhance the decision-making performance of leading language models, and achieve up to a 40% increase in accuracy over competing methods. Additionally, we show how performance improves when scaling compute at test time, and carry out human evaluations to benchmark components of De LLMa. |
| Researcher Affiliation | Academia | Department of Computer Science University of Southern California EMAIL, EMAIL |
| Pseudocode | Yes | Algorithm 1 STATEFORECAST Input: LLM M, user prompt P = (G, A, C), plausibilty score mapping V, latent factors {f1, , fk}, and plausible values { f 1:ℓ 1 , , f 1:ℓ k }. for i = 1 to k do πi( | C) {} # Verbalized probability score [v1, vℓ] M(P, fi, f 1:ℓ i ) for j = 1 to ℓdo πj( f j i | C) V[vj] end for πi( | C) Normalize(πi( | C)) end for return πLLM(f1, , fk | C) := Qk i=1 πi( | C) Algorithm 2 UTILITYELICITATION Input: LLM M, user prompt P = (G, A, C), proposal distribution πLLM(θ | C), sample size s, minibatch size b, and overlap proportion q. # Sample fixed states a A SA A {θi | θi πLLM, 1 i s/|A| } SA Shuffle(SA) Ω {} # Pairwise comparisons for i = 1 to s with step b (1 q) do # Rank the minibatch R M (P, (θi, ai), , (θi+b, ai+b)) # Format into comparison Ω Ω Format Rank(R) end for return U( , ) := Bradley Terry(Ω) Rs |
| Open Source Code | Yes | Equal Contribution. Project website and code available at https://dellma.github.io/. Our implementations for the zero-shot, self consistency, Chain-of-Thought, and De LLMa methods are included in the supplementary material. |
| Open Datasets | Yes | We collect bi-annual reports published by the United States Department of Agriculture (USDA) that provide analysis of supply-and-demand conditions in the U.S. fruit markets2. ...Additionally supplement these natural language contexts with USDA-issued price and yield statistics in California3. 2www.ers.usda.gov/publications/pub-details/?pubid=107539 3www.nass.usda.gov/Quick_Stats We collect historical stock prices as the context for this problem. ...These historical monthly prices are collected via Yahoo Finance4 manually by the authors. 4finance.yahoo.com |
| Dataset Splits | No | No specific training/test/validation dataset splits (e.g., percentages or sample counts for main model evaluation) are explicitly provided in the paper. The paper describes creating 120 decision problem instances for each domain, but how these instances are split for training and testing of the overall De LLMa framework is not detailed. The described 'sample size' and 'minibatch size' relate to internal sampling within the De LLMa framework for utility elicitation, not a general train/test split for the experiment. |
| Hardware Specification | No | The paper does not explicitly describe the specific hardware (e.g., GPU models, CPU types) used to run the experiments. It mentions using specific LLMs like GPT-4, Claude-3, and Gemini 1.5, which implies API calls, but not the local hardware setup by the authors. |
| Software Dependencies | No | The paper mentions that reproducing results requires "Open AI, Anthropic, and Gemini API access," but it does not provide specific version numbers for any software dependencies, libraries, or APIs used in their implementation. |
| Experiment Setup | Yes | For De LLMa-Pairs and Top1, we allocate a per action sample size of 64 and a minibatch size of 32. We set the overlap proportion q to 25% for the Agriculture dataset and 50% for the Stocks dataset due to budget constraints. For De LLMa-Naive, we fix a total sample size of 50. Zero-Shot. Only the goal G, the action space A, and the context C is provided. We adopt a greedy decoding process by setting temperature = 0. Self-Consistency (SC) (Wang et al., 2022). We use the same prompt as in zero-shot, but with temperature = 0.5 to generate a set of K responses. We take the majority vote of the K responses. |