Vector Grimoire: Codebook-based Shape Generation under Raster Image Supervision

Authors: Marco Cipriano, Moritz Feuerpfeil, Gerard De Melo

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate the effectiveness of our method by fitting GRIMOIRE for closed filled shapes on MNIST and Emoji, and for outline strokes on icon and font data, surpassing previous image-supervised methods in generative quality and the vector-supervised approach in flexibility.
Researcher Affiliation Academia Marco Cipriano * 1 Moritz Feuerpfeil * 1 Gerard de Melo 1 *Equal contribution 1Hasso Plattner Institute. Correspondence to: Marco Cipriano <EMAIL>.
Pseudocode No The paper describes the methodology in narrative text and with diagrams (Figure 2, Figure 3), but it does not contain a formally labeled pseudocode block or algorithm.
Open Source Code Yes 4. We release the code of this work to the research community1. 1https://github.com/potpov/Vector_Grimoire
Open Datasets Yes We experiment on four datasets (see Section A.1). MNIST. We conduct our initial experiments on the MNIST dataset (Le Cun et al., 1998). Fonts. For our experiments on fonts, we use a subset of the SVG-Fonts dataset (Lopes et al., 2019). FIGR-8. We validate our method on more complex data and further use a subset of FIGR-8 (Clouˆatre & Demers, 2019). Emoji. For our preliminary experiments with segmentation-guided patch extraction, we use a subset of standard emoji images (emoji dataset, 2022).
Dataset Splits Yes For MNIST, the patches are obtained by tiling each image in a 6 ˆ 6 grid. For Fonts, we use 80%, 10%, and 10% for training, testing, and validation respectively. For FIGR-8, we select 90% for training, 5% for validation, and 5% for testing. For Emoji, we focus on images that primarily depict faces, selecting 107 for training and 20 for the test.
Hardware Specification Yes Training the VSQ module on six NVIDIA H100 takes approximately 48, 15, and 12 hours for MNIST, FIGR-8, and Fonts, respectively; the ART module takes considerably fewer resources, requiring around 8 hours depending on the configuration. These values were obtained across 20 generations on one NVIDIA H100.
Software Dependencies No The paper mentions using Adam W optimization, a Ranger scheduler, a pre-trained BERT encoder (Devlin et al., 2018), and CLIP with a Vi T-16 backend, but it does not specify concrete version numbers for any software libraries, programming languages, or specific frameworks like PyTorch or TensorFlow.
Experiment Setup Yes We use Adam W optimization and train the VSQ module for 1 epoch for Fonts and FIGR-8 and five epochs for MNIST. We use a learning rate of λ 2 ˆ 10 5, while the auto-regressive Transformer is trained for 30 epochs with λ 6 ˆ 10 4. The Transformer has a context length of 512.