Sketch2Diagram: Generating Vector Diagrams from Hand-Drawn Sketches

Authors: Itsumi Saito, Haruto Yoshida, Keisuke Sakaguchi

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our evaluations reveal the limitations of state-of-the-art vision and language models (VLMs), positioning SKETIk Z as a key benchmark for future research in sketch-to-diagram conversion. Along with SKETIk Z, we present IMGTIk Z, an image-to-Tik Z model that integrates a 6.7B parameter code-specialized open-source large language model (LLM) with a pretrained vision encoder. Despite its relatively compact size, IMGTIk Z performs comparably to GPT-4o. This success is driven by using our two data augmentation techniques and a multi-candidate inference strategy. Our findings open promising directions for future research in sketch-to-diagram conversion and broader image-to-code generation tasks. SKETIk Z is publicly available.1
Researcher Affiliation Academia Itsumi Saito*, , Haruto Yoshida*, Keisuke Sakaguchi*, *Tohoku University, RIKEN AIP EMAIL
Pseudocode No The paper provides Python code listings (Listing 1 and Listing 2) for image augmentation pipelines, but these are actual code snippets, not abstract pseudocode or algorithm blocks.
Open Source Code No The paper states: "SKETIk Z is publicly available.1" with a footnote to https://sketikz.github.io/ for the dataset. However, there is no explicit statement or link providing access to the source code for the IMGTIk Z model or the methodology described in the paper.
Open Datasets Yes To address this gap, we introduce SKETIk Z, a new dataset designed for benchmarking sketch-to-diagram generation. SKETIk Z comprises 3,231 pairs of hand-drawn sketches and their corresponding Tik Z codes. ... SKETIk Z is publicly available.1 (Footnote 1: https://sketikz.github.io/). Datasets used in stage 2 training ... We also used existing pairs of Tik Z code and images (No. 8), excluding data with ar Xiv IDs that overlap with our collected dataset.
Dataset Splits Yes We aligned sketches Is with corresponding Tik Z codes Yr and reference images Ir, creating a dataset of 2,585 training, 323 validation, and 323 test samples.
Hardware Specification Yes We used 8 A100 GPUs for training IMGTIk Z, and 1 H100 GPU for inference. ... The training was conducted using four H100 80G GPUs. (for D-Sig LIP) ... We trained the model using a NVIDIA A100 GPU. (for diagram image classification model)
Software Dependencies Yes We used pdflatex from Te X Live 2023 to compile generated Tik Z code into a diagram image. ... We used the gpt-4o-2024-05-13 version for GPT-4o, the gpt-4o-mini-2024-07-18 version for GPT-4o mini, the claude-3-5-sonnet-20240620 version for Claude 3.5, and the llama3-llava-next-8b version, which is trained on the 8B Llama 3 model, for LLa VA-Next. ... We used text-embedding-3-small version. (for OpenAI's text embedding model) ... We used the google/siglip-so400m-patch14-384 version of Sig LIP as the vision encoder.
Experiment Setup Yes We set the Lo RA tuning parameters for training to r=128 and α=256. Stage 1 training was conducted with a batch size of 256 for 6,000 steps. Stage 2 training used a batch size of 128 for 1 epoch. ... The maximum number of attempts M for iterative sampling was set to 5, and the number of candidates K for multi-candidate generation was set to 20. ... The sampling temperature was set to 0.6. ... Table 8: Configuration for the IMGTIk Z model training. Option Value model max length 4096 num train epochs 1 batch size 16 gradient accumulation steps 8 mm projector lr 2e-5.