reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

DeLLMa: Decision Making Under Uncertainty with Large Language Models

Authors: Ollie Liu, Deqing Fu, Dani Yogatama, Willie Neiswanger

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We validate our procedure on multiple realistic decision-making environments, demonstrating that De LLMa can consistently enhance the decision-making performance of leading language models, and achieve up to a 40% increase in accuracy over competing methods. Additionally, we show how performance improves when scaling compute at test time, and carry out human evaluations to benchmark components of De LLMa.
Researcher Affiliation	Academia	Department of Computer Science University of Southern California EMAIL, EMAIL
Pseudocode	Yes	Algorithm 1 STATEFORECAST Input: LLM M, user prompt P = (G, A, C), plausibilty score mapping V, latent factors {f1, , fk}, and plausible values { f 1:ℓ 1 , , f 1:ℓ k }. for i = 1 to k do πi( \| C) {} # Verbalized probability score [v1, vℓ] M(P, fi, f 1:ℓ i ) for j = 1 to ℓdo πj( f j i \| C) V[vj] end for πi( \| C) Normalize(πi( \| C)) end for return πLLM(f1, , fk \| C) := Qk i=1 πi( \| C) Algorithm 2 UTILITYELICITATION Input: LLM M, user prompt P = (G, A, C), proposal distribution πLLM(θ \| C), sample size s, minibatch size b, and overlap proportion q. # Sample fixed states a A SA A {θi \| θi πLLM, 1 i s/\|A\| } SA Shuffle(SA) Ω {} # Pairwise comparisons for i = 1 to s with step b (1 q) do # Rank the minibatch R M (P, (θi, ai), , (θi+b, ai+b)) # Format into comparison Ω Ω Format Rank(R) end for return U( , ) := Bradley Terry(Ω) Rs
Open Source Code	Yes	Equal Contribution. Project website and code available at https://dellma.github.io/. Our implementations for the zero-shot, self consistency, Chain-of-Thought, and De LLMa methods are included in the supplementary material.
Open Datasets	Yes	We collect bi-annual reports published by the United States Department of Agriculture (USDA) that provide analysis of supply-and-demand conditions in the U.S. fruit markets2. ...Additionally supplement these natural language contexts with USDA-issued price and yield statistics in California3. 2www.ers.usda.gov/publications/pub-details/?pubid=107539 3www.nass.usda.gov/Quick_Stats We collect historical stock prices as the context for this problem. ...These historical monthly prices are collected via Yahoo Finance4 manually by the authors. 4finance.yahoo.com
Dataset Splits	No	No specific training/test/validation dataset splits (e.g., percentages or sample counts for main model evaluation) are explicitly provided in the paper. The paper describes creating 120 decision problem instances for each domain, but how these instances are split for training and testing of the overall De LLMa framework is not detailed. The described 'sample size' and 'minibatch size' relate to internal sampling within the De LLMa framework for utility elicitation, not a general train/test split for the experiment.
Hardware Specification	No	The paper does not explicitly describe the specific hardware (e.g., GPU models, CPU types) used to run the experiments. It mentions using specific LLMs like GPT-4, Claude-3, and Gemini 1.5, which implies API calls, but not the local hardware setup by the authors.
Software Dependencies	No	The paper mentions that reproducing results requires "Open AI, Anthropic, and Gemini API access," but it does not provide specific version numbers for any software dependencies, libraries, or APIs used in their implementation.
Experiment Setup	Yes	For De LLMa-Pairs and Top1, we allocate a per action sample size of 64 and a minibatch size of 32. We set the overlap proportion q to 25% for the Agriculture dataset and 50% for the Stocks dataset due to budget constraints. For De LLMa-Naive, we fix a total sample size of 50. Zero-Shot. Only the goal G, the action space A, and the context C is provided. We adopt a greedy decoding process by setting temperature = 0. Self-Consistency (SC) (Wang et al., 2022). We use the same prompt as in zero-shot, but with temperature = 0.5 to generate a set of K responses. We take the majority vote of the K responses.