Adaptive Self-improvement LLM Agentic System for ML Library Development
Authors: Genghan Zhang, Weixin Liang, Olivia Hsu, Kunle Olukotun
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our results show improvements of up to 3.9 over a baseline single LLM 1. ... Putting all these together, our system solves up to 96% of the tasks in our benchmark and achieves up to a 3.9 improvement over a baseline single LLM, as shown in Figure 2. ... We construct a set of tasks to measure the adaptive selfimprovement agentic system proposed in Section 3 and Section 4. |
| Researcher Affiliation | Academia | 1Department of Computer Science, Stanford University, USA. Correspondence to: Genghan Zhang <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 Adaptive self-improvement learning input X: task set, m: adaptive granularity Require: θ: LLM agentic system, r: reward from verifier, σ: filter function, β: selection function 1: D 2: t 0 iteration 3: repeat 4: E β(D, m) stratification 5: for ej E do 6: dj [e0, e1, ...ej] selection 7: // Parallel sampling 8: Ct {Ey pθ(y|xi,dj)[r(xi, y)] | xi X} 9: Bt {(xi, y) | r(xi, y) = 1, xi X} 10: St {xi | ci > 0, ci Ct} 11: if Bt = then 12: D D σ(Bt, Ct, St) filtering 13: X X \ St 14: t t + 1 15: break 16: end if 17: t t + 1 18: end for 19: until X = (dj = D Bt 1 = ) output Solutions: D |
| Open Source Code | Yes | 1The example code is public at https://github.com/ zhang677/PCL-lite |
| Open Datasets | No | We construct a set of tasks to measure the adaptive selfimprovement agentic system proposed in Section 3 and Section 4. This benchmark should cover a diverse set of popular ML operators and specialized functions. In total, we collect 26 tasks covering 8 groups of ML operators in common LLM model architectures, as shown in Figure 3. We provide a reference implementation for each task. |
| Dataset Splits | No | Algorithm 1 Adaptive self-improvement learning input X: task set, m: adaptive granularity... until X = () ... We choose pass@k (Chen et al., 2021) as the metric for task completion. |
| Hardware Specification | No | Four models are: claude-3-5-sonnet-20241022 of Anthropic API (Claude 3.5 Sonnet), gpt-4o-2024-11-20 of Open AI API (GPT-4o), deepseek-chat of Deep Seek API (Deep Seek-V3), and Meta-Llama-3-1-405B-Instruct-Turbo of Together AI API (Llama 3.1-405B). |
| Software Dependencies | No | Users program Py Torch to express their ML operators... The other verifier checks the affine type constraint by performing static analysis on the abstraction syntax tree of the STe P program with the Python ast module (Ronacher, 2008). ... all prompts are formatted in YAML because structural prompts generally benefit (He et al., 2024). |
| Experiment Setup | Yes | In Section 6.1 we sample 64 times for each temperature of 0.4, 0.7, and 1.0, recording the best result. In Section 6.2, we sample 64 times at temperature 0.7 on Claude 3.5 Sonnet to control variables. Four models are: claude-3-5-sonnet-20241022 of Anthropic API (Claude 3.5 Sonnet), gpt-4o-2024-11-20 of Open AI API (GPT-4o), deepseek-chat of Deep Seek API (Deep Seek-V3), and Meta-Llama-3-1-405B-Instruct-Turbo of Together AI API (Llama 3.1-405B). Maximum output tokens are set as 1024 and the seed for GPT-4o is 42. |