Vector Grimoire: Codebook-based Shape Generation under Raster Image Supervision
Authors: Marco Cipriano, Moritz Feuerpfeil, Gerard De Melo
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate the effectiveness of our method by fitting GRIMOIRE for closed filled shapes on MNIST and Emoji, and for outline strokes on icon and font data, surpassing previous image-supervised methods in generative quality and the vector-supervised approach in flexibility. |
| Researcher Affiliation | Academia | Marco Cipriano * 1 Moritz Feuerpfeil * 1 Gerard de Melo 1 *Equal contribution 1Hasso Plattner Institute. Correspondence to: Marco Cipriano <EMAIL>. |
| Pseudocode | No | The paper describes the methodology in narrative text and with diagrams (Figure 2, Figure 3), but it does not contain a formally labeled pseudocode block or algorithm. |
| Open Source Code | Yes | 4. We release the code of this work to the research community1. 1https://github.com/potpov/Vector_Grimoire |
| Open Datasets | Yes | We experiment on four datasets (see Section A.1). MNIST. We conduct our initial experiments on the MNIST dataset (Le Cun et al., 1998). Fonts. For our experiments on fonts, we use a subset of the SVG-Fonts dataset (Lopes et al., 2019). FIGR-8. We validate our method on more complex data and further use a subset of FIGR-8 (Clouˆatre & Demers, 2019). Emoji. For our preliminary experiments with segmentation-guided patch extraction, we use a subset of standard emoji images (emoji dataset, 2022). |
| Dataset Splits | Yes | For MNIST, the patches are obtained by tiling each image in a 6 ˆ 6 grid. For Fonts, we use 80%, 10%, and 10% for training, testing, and validation respectively. For FIGR-8, we select 90% for training, 5% for validation, and 5% for testing. For Emoji, we focus on images that primarily depict faces, selecting 107 for training and 20 for the test. |
| Hardware Specification | Yes | Training the VSQ module on six NVIDIA H100 takes approximately 48, 15, and 12 hours for MNIST, FIGR-8, and Fonts, respectively; the ART module takes considerably fewer resources, requiring around 8 hours depending on the configuration. These values were obtained across 20 generations on one NVIDIA H100. |
| Software Dependencies | No | The paper mentions using Adam W optimization, a Ranger scheduler, a pre-trained BERT encoder (Devlin et al., 2018), and CLIP with a Vi T-16 backend, but it does not specify concrete version numbers for any software libraries, programming languages, or specific frameworks like PyTorch or TensorFlow. |
| Experiment Setup | Yes | We use Adam W optimization and train the VSQ module for 1 epoch for Fonts and FIGR-8 and five epochs for MNIST. We use a learning rate of λ 2 ˆ 10 5, while the auto-regressive Transformer is trained for 30 epochs with λ 6 ˆ 10 4. The Transformer has a context length of 512. |