reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Can Large Language Models Understand Symbolic Graphics Programs?

Authors: Zeju Qiu, Weiyang Liu, Haiwen Feng, Zhen Liu, Tim Xiao, Katherine Collins, Joshua B Tenenbaum, Adrian Weller, Michael J Black, Bernhard Schölkopf

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate commercial and open-source LLMs on our benchmark to assess their ability to reason about visual output of programs, finding that LLMs considered stronger at reasoning generally perform better. Lastly, we introduce a novel method to improve this ability Symbolic Instruction Tuning (SIT), in which the LLM is finetuned with pre-collected instruction data on symbolic graphics programs. Interestingly, we find that SIT not only improves LLM s understanding on symbolic programs, but it also improves general reasoning ability on various other benchmarks.
Researcher Affiliation	Academia	Zeju Qiu1, Weiyang Liu1,2, ,* Haiwen Feng1, Zhen Liu1, Tim Z. Xiao1, Katherine M. Collins2, Joshua B. Tenenbaum3 Adrian Weller2 Michael J. Black1 Bernhard Schölkopf1 1Max Planck Institute for Intelligent Systems, Tübingen 2University of Cambridge 3MIT
Pseudocode	No	The paper describes a step-by-step reasoning process demonstrated by an LLM in Figure 3, but it does not contain any pseudocode or algorithm blocks for the authors' own methodology.
Open Source Code	Yes	Project lead sgp-bench.github.io To facilitate future research, our symbolic instruction data is also made publicly available.
Open Datasets	Yes	Our SVG data are sampled from kaggle SVG Icons dataset4 and we build our SGP-Bench (SVG) using text prompts from F.1. The original data from kaggle SVG Icons is crawled from SVGrepo5. Our CAD (3D) sequences data are sampled from Deep CAD [105] datasets, which contains around 180k manually constructed CAD sequences that are originally from the ABC dataset [47]. Our CAD (3Dcomplex) sequences are sampled from Fusion360 Reconstruction Dataset [101] datasets... Our CAD (2D) sequences are sampled from Sketch Graphs [86] dataset... The MNIST SVG data is sampled from the kaggle MNIST-SVG dataset6. Following this idea, we construct the first semantic description dataset for symbolic graphics programs...our symbolic instruction data is also made publicly available.
Dataset Splits	No	The paper describes the construction of several datasets (SGP-Bench SVG, CAD, SGP-MNIST, SIT data) and how they are used for evaluation or fine-tuning, but it does not explicitly define standard train/validation/test splits with percentages or sample counts for its primary SGP-Bench evaluation.
Hardware Specification	Yes	The v LLM inference engine is deployed on a node with 8 NVIDIA H100 80G GPUs. We use the unsloth9 framwork to finetune the base models Llama3-8b-instruct and Gemma-1.1-7b-it. For both models, we use the exact same training setting: we finetune the base models with Lo RA [36] on 1 NVIDIA H100 80GB gpu For both fine-tuning methods, we train on 8 NVIDIA H100 80GB gpus
Software Dependencies	No	We adopt implementations from other projects to build our SGP-Bench. We follow the implemntation of Math Vista1 for querying GPT or open-sourced Llama3.1-8B and perform LLM-based answer extraction, v LLM2 for efficient model inference and simple-evals3 for a unified benchmarking framework. We use the unsloth9 framwork to finetune the base models Llama3-8b-instruct and Gemma-1.1-7b-it. We use the PEFT10 framework to test different fine-tuning methods. We employ the widely-used lm-evaluation-harness12 to obtain the results on a variety of LLM benchmarks. The paper mentions several frameworks and models (Math Vista, vLLM, simple-evals, unsloth, PEFT, lm-evaluation-harness) but does not provide specific version numbers for these software dependencies.
Experiment Setup	Yes	For both models, we use the exact same training setting: we finetune the base models with Lo RA [36] on 1 NVIDIA H100 80GB gpu with learning rate 2e-4, batch size of 2 and for 1 epoch. For both fine-tuning methods, we train on 8 NVIDIA H100 80GB gpus with learning rate 1e-4, per device batch size of 1 and for 1 epoch.