reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Adaptive Self-improvement LLM Agentic System for ML Library Development

Authors: Genghan Zhang, Weixin Liang, Olivia Hsu, Kunle Olukotun

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our results show improvements of up to 3.9 over a baseline single LLM 1. ... Putting all these together, our system solves up to 96% of the tasks in our benchmark and achieves up to a 3.9 improvement over a baseline single LLM, as shown in Figure 2. ... We construct a set of tasks to measure the adaptive selfimprovement agentic system proposed in Section 3 and Section 4.
Researcher Affiliation	Academia	1Department of Computer Science, Stanford University, USA. Correspondence to: Genghan Zhang <EMAIL>.
Pseudocode	Yes	Algorithm 1 Adaptive self-improvement learning input X: task set, m: adaptive granularity Require: θ: LLM agentic system, r: reward from verifier, σ: filter function, β: selection function 1: D 2: t 0 iteration 3: repeat 4: E β(D, m) stratification 5: for ej E do 6: dj [e0, e1, ...ej] selection 7: // Parallel sampling 8: Ct {Ey pθ(y\|xi,dj)[r(xi, y)] \| xi X} 9: Bt {(xi, y) \| r(xi, y) = 1, xi X} 10: St {xi \| ci > 0, ci Ct} 11: if Bt = then 12: D D σ(Bt, Ct, St) filtering 13: X X \ St 14: t t + 1 15: break 16: end if 17: t t + 1 18: end for 19: until X = (dj = D Bt 1 = ) output Solutions: D
Open Source Code	Yes	1The example code is public at https://github.com/ zhang677/PCL-lite
Open Datasets	No	We construct a set of tasks to measure the adaptive selfimprovement agentic system proposed in Section 3 and Section 4. This benchmark should cover a diverse set of popular ML operators and specialized functions. In total, we collect 26 tasks covering 8 groups of ML operators in common LLM model architectures, as shown in Figure 3. We provide a reference implementation for each task.
Dataset Splits	No	Algorithm 1 Adaptive self-improvement learning input X: task set, m: adaptive granularity... until X = () ... We choose pass@k (Chen et al., 2021) as the metric for task completion.
Hardware Specification	No	Four models are: claude-3-5-sonnet-20241022 of Anthropic API (Claude 3.5 Sonnet), gpt-4o-2024-11-20 of Open AI API (GPT-4o), deepseek-chat of Deep Seek API (Deep Seek-V3), and Meta-Llama-3-1-405B-Instruct-Turbo of Together AI API (Llama 3.1-405B).
Software Dependencies	No	Users program Py Torch to express their ML operators... The other verifier checks the affine type constraint by performing static analysis on the abstraction syntax tree of the STe P program with the Python ast module (Ronacher, 2008). ... all prompts are formatted in YAML because structural prompts generally benefit (He et al., 2024).
Experiment Setup	Yes	In Section 6.1 we sample 64 times for each temperature of 0.4, 0.7, and 1.0, recording the best result. In Section 6.2, we sample 64 times at temperature 0.7 on Claude 3.5 Sonnet to control variables. Four models are: claude-3-5-sonnet-20241022 of Anthropic API (Claude 3.5 Sonnet), gpt-4o-2024-11-20 of Open AI API (GPT-4o), deepseek-chat of Deep Seek API (Deep Seek-V3), and Meta-Llama-3-1-405B-Instruct-Turbo of Together AI API (Llama 3.1-405B). Maximum output tokens are set as 1024 and the seed for GPT-4o is 42.