Diffusion On Syntax Trees For Program Synthesis

Authors: Shreyas Kapur, Erik Jenner, Stuart Russell

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct experiments on four domain-specific graphics languages... Figure 4 shows the performance of our method compared to the baseline methods. In both the CSG2D and Tiny SVG environments, our tree diffusion policy rollouts significantly outperform the policies of previous methods. Our policy combined with beam search further improves performance, solving problems with fewer calls to the renderer than all other methods. Figure 6 shows successful qualitative examples of our system alongside outputs of baseline methods. We additionally show how our system can write graphics programs for hand-drawn sketches.
Researcher Affiliation Academia Shreyas Kapur, Erik Jenner & Stuart Russell University of California, Berkeley EMAIL
Pseudocode Yes Algorithm 1 shows the high-level pseudocode for how we find the first set of mutations to transform tree A into tree B.
Open Source Code No The paper states: 'Our sketch algorithm can be found in our codebase', but it does not provide a direct link to a code repository, specific file names in supplementary material, or an explicit statement of code release with a concrete access method. The provided URL (https://tree-diffusion.github.io) is explicitly for 'Video results'.
Open Datasets No The paper describes generating its own data: 'For Tiny SVG we used a held-out test set of randomly generated expressions and their images.' and 'The training data was generated by using our Constrained Sample function, which randomly samples programs from grammars.' It does not provide concrete access information (links, DOIs, or citations to publicly available versions) for these generated datasets.
Dataset Splits No The paper mentions: 'For Tiny SVG we used a held-out test set of randomly generated expressions and their images.' and 'We give the methods n = 256 images from the test set'. While a test set is implied, the paper does not specify the size of the training set, the percentage split between training/validation/test, or the methodology for how a fixed dataset was divided, as training data is described as an 'infinite stream of random expressions'.
Hardware Specification Yes We trained for 3 days for the environments we tested on a single Nvidia A6000 GPU. To make it more fair for LLaVA, we fine-tuned it using the author’s original finetune task script alongside the default suggested hyperparameters. To test LLaVA, we provide it with the test image and rejection sample its output 100 times. We used LLaVA-1.5-7B, and trained on a single A100 graphics card.
Software Dependencies No The paper mentions several software components and their publication years or sources: 'We implemented these languages using the Lark (Lark Contributors, 2014) and Iceberg (Ice Berg Contributors, 2023) Python libraries', 'implement our architecture in Py Torch (Ansel et al., 2024)', 'NF-Res Net26 (Brock et al., 2021) implementation from the open-sourced library by Wightman (2019)'. However, it does not provide specific version numbers (e.g., Python 3.8, PyTorch 1.9) for these dependencies, only the year of their associated publication or release.
Experiment Setup Yes Our decoder-only transformer (Vaswani et al., 2017; Radford et al., 2019) uses 8 layers, 16 heads, with an embedding size of 256. We use batch size 32 and optimize with Adam (Kingma & Ba, 2014) with a learning rate of 3 10 4. All models, including baselines were trained for 1 million steps. The image embeddings are of the same size as the transformer embeddings. We use 4 prefix tokens for the image embeddings. We used a maximum context size of 512 tokens. For both environments, we sampled expressions with at most 8 primitives. ... For tree diffusion search, we used a beam size of 64, with a maximum node expansion budget of 5000 nodes. The value network is a small 2-layer MLP with 128 units in each layer that takes in the output of the vision encoder, and outputs a single scalar predicting the edit distance in program space between two images. We train this for 100, 000 steps.