Adaptive Self-improvement LLM Agentic System for ML Library Development

Authors: Genghan Zhang, Weixin Liang, Olivia Hsu, Kunle Olukotun

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our results show improvements of up to 3.9 over a baseline single LLM 1. ... Putting all these together, our system solves up to 96% of the tasks in our benchmark and achieves up to a 3.9 improvement over a baseline single LLM, as shown in Figure 2. ... We construct a set of tasks to measure the adaptive selfimprovement agentic system proposed in Section 3 and Section 4.
Researcher Affiliation Academia 1Department of Computer Science, Stanford University, USA. Correspondence to: Genghan Zhang <EMAIL>.
Pseudocode Yes Algorithm 1 Adaptive self-improvement learning input X: task set, m: adaptive granularity Require: θ: LLM agentic system, r: reward from verifier, σ: filter function, β: selection function 1: D 2: t 0 iteration 3: repeat 4: E β(D, m) stratification 5: for ej E do 6: dj [e0, e1, ...ej] selection 7: // Parallel sampling 8: Ct {Ey pθ(y|xi,dj)[r(xi, y)] | xi X} 9: Bt {(xi, y) | r(xi, y) = 1, xi X} 10: St {xi | ci > 0, ci Ct} 11: if Bt = then 12: D D σ(Bt, Ct, St) filtering 13: X X \ St 14: t t + 1 15: break 16: end if 17: t t + 1 18: end for 19: until X = (dj = D Bt 1 = ) output Solutions: D
Open Source Code Yes 1The example code is public at https://github.com/ zhang677/PCL-lite
Open Datasets No We construct a set of tasks to measure the adaptive selfimprovement agentic system proposed in Section 3 and Section 4. This benchmark should cover a diverse set of popular ML operators and specialized functions. In total, we collect 26 tasks covering 8 groups of ML operators in common LLM model architectures, as shown in Figure 3. We provide a reference implementation for each task.
Dataset Splits No Algorithm 1 Adaptive self-improvement learning input X: task set, m: adaptive granularity... until X = () ... We choose pass@k (Chen et al., 2021) as the metric for task completion.
Hardware Specification No Four models are: claude-3-5-sonnet-20241022 of Anthropic API (Claude 3.5 Sonnet), gpt-4o-2024-11-20 of Open AI API (GPT-4o), deepseek-chat of Deep Seek API (Deep Seek-V3), and Meta-Llama-3-1-405B-Instruct-Turbo of Together AI API (Llama 3.1-405B).
Software Dependencies No Users program Py Torch to express their ML operators... The other verifier checks the affine type constraint by performing static analysis on the abstraction syntax tree of the STe P program with the Python ast module (Ronacher, 2008). ... all prompts are formatted in YAML because structural prompts generally benefit (He et al., 2024).
Experiment Setup Yes In Section 6.1 we sample 64 times for each temperature of 0.4, 0.7, and 1.0, recording the best result. In Section 6.2, we sample 64 times at temperature 0.7 on Claude 3.5 Sonnet to control variables. Four models are: claude-3-5-sonnet-20241022 of Anthropic API (Claude 3.5 Sonnet), gpt-4o-2024-11-20 of Open AI API (GPT-4o), deepseek-chat of Deep Seek API (Deep Seek-V3), and Meta-Llama-3-1-405B-Instruct-Turbo of Together AI API (Llama 3.1-405B). Maximum output tokens are set as 1024 and the seed for GPT-4o is 42.