Improving Parallel Program Performance with LLM Optimizers via Agent-System Interfaces
Authors: Anjiang Wei, Allen Nie, Thiago S. F. X. Teixeira, Rohan Yadav, Wonchan Lee, Ke Wang, Alex Aiken
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments demonstrate that mappers optimized by LLM-powered agents not only match but often surpass expert-written mappers, achieving up to 1.34 speedup across nine benchmarks. ... Empirical Evaluation of Performance: Our agent-based solution achieves up to 1.34 speedup across nine benchmarks |
| Researcher Affiliation | Collaboration | 1Stanford University 2Intel 3NVIDIA 4Nanjing University. |
| Pseudocode | Yes | We show how we use Trace to incorporate the feedback from the execution to update the agent, with a Pytorch-like syntax. (Figure A2) High-level structure of the Trace-based agent template, where functions annotated with @bundle(trainable=True) define the search space that the LLM optimizer updates during mapper generation. (Figure A3) |
| Open Source Code | No | The paper does not contain a specific link to the source code for the methodology described, nor does it explicitly state that the code is being released. It references the 'Trace' framework but not its own implementation. |
| Open Datasets | Yes | Our evaluation utilizes a suite of 9 benchmarks, including 3 scientific computing workloads and 6 well-known matrix multiplication algorithms. Circuit is a simulation benchmark that models electrical circuit behavior by simulating currents and voltages across interconnected nodes and wires (Bauer et al., 2012). Stencil simulates a 2D grid where each point s value is updated based on a stencil pattern determined by its neighbors (Van der Wijngaart & Mattson, 2014). Pennant models unstructured mesh Lagrangian staggered-grid hydrodynamics, commonly used for simulating compressible flow (Ferenbaugh, 2015). |
| Dataset Splits | No | The paper focuses on performance optimization of mappers for parallel programs on benchmarks, rather than traditional machine learning tasks involving training, validation, and test datasets. Therefore, specific dataset split information is not applicable or provided. |
| Hardware Specification | Yes | Experiments are conducted on one node with two Intel 10-core E5-2640 v4 CPUs, 256G main memory, and four NVIDIA Tesla P100 GPUs. |
| Software Dependencies | No | The paper mentions using 'gpt-4o-2024-08-06' and the 'Trace' framework, but does not specify version numbers for general programming languages (like Python) or other libraries required to replicate the experiments. |
| Experiment Setup | Yes | running 10 iterations per application. To account for stochastic output, we repeated the process 5 times and report the average. ... The agent takes two inputs: server specifications and application metadata. Server specifications detail the hardware configuration, including the number of CPUs and GPUs per node, as well as the total node count. Application metadata provides information on task names and the associated data arguments accessed by each task. |