Visually Descriptive Language Model for Vector Graphics Reasoning

Authors: Zhenhailong Wang, Joy Hsu, Xingyao Wang, Kuan-Hao Huang, Manling Li, Jiajun Wu, Heng Ji

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirical experiments show that VDLM leads to significant improvements in state-of-the-art LMMs, such as GPT-4o, across various low-level multimodal perception and reasoning tasks on rasterized vector graphics. Additionally, we provide extensive analyses of VDLM s performance, showing that our framework offers improved interpretability due to its disentangled perception and reasoning processes.
Researcher Affiliation Academia 1University of Illinois Urbana-Champaign, 2Stanford University, 3Texas A&M University, 4Northwestern University EMAIL EMAIL, EMAIL
Pseudocode No The paper describes steps in regular paragraph text or as conceptual modules (e.g., Figure 2) but does not contain a dedicated pseudocode block or algorithm section.
Open Source Code No The paper mentions using third-party tools and models such as "Mistral-7b (Jiang et al., 2023)", "Megatron-LLM (Cano et al., 2023)", and "VTracer (2024)", but it does not provide a direct link or explicit statement about releasing the source code for their own methodology (VDLM).
Open Datasets Yes We leverage VGBench (Zou et al., 2024), a benchmark originally proposed for evaluating LLMs in understanding and generating vector graphics codes. ... Shapeworld (Kuhnle & Copestake, 2017) dataset on spatial relations... NLVR: The Natural Language for Visual Reasoning dataset (Suhr et al., 2017)... Geoclidean (Hsu et al., 2022) dataset...
Dataset Splits Yes Our final dataset contains 160K SVG, PVD pairs. More details can be found in Appendix C. ... The detailed configuration can be found in Table 4. ... # Training Instances # Eval Instances ... Line or Angle 10K 1K ... Angle Classification 10K 1000 ... Length Comparison 10K 1000 ... Clevr QA 36K 1000 ... Shapeworld Scene 15K 100 ... Maze Scene 10K 600
Hardware Specification Yes We use the Megatron-LLM (Cano et al., 2023) library for efficient LLM fine-tuning and the entire training process can be done in 16 hours on 4 NVIDIA A100-40GB GPUs.
Software Dependencies Yes GPT-4V model version: gpt-4-1106-vision-preview. GPT-4o model version: gpt-4o-2024-05-13 GPT-4 (text-only) model version: gpt-4-0125-preview.
Experiment Setup Yes We fine-tune a pretrained Mistral-7b (Jiang et al., 2023) model on the synthesized PVD 160K dataset to perform SVG-to-PVD generation. We conduct full-parameter fine-tuning for 3 epochs with a learning rate of 1e-5. The training objective is a standard Language Modeling loss on the generated PVD tokens as follows: