Decomposition of Graphic Design with Unified Multimodal Model

Authors: Hui Nie, Zhao Zhang, Yutao Cheng, Maoke Yang, Gonglei Shi, Qingsong Xie, Jie Shao, Xinglong Wu

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments demonstrate the effectiveness of the proposed method. The paper includes sections such as "6. Training Datasets", "7. Experiments", "7.3. Quantitative Results", and "7.4. Qualitative Results", along with ablation studies and comparisons to baselines.
Researcher Affiliation Collaboration The authors are affiliated with "1University of Chinese Academy of Sciences" (academic) and "2Byte Dance Intelligent Creation, China." as well as "3OPPO AI Center" (both industry).
Pseudocode No The paper describes the Dea M pipeline textually and with a diagram in Figure 2, but it does not contain explicit pseudocode or algorithm blocks.
Open Source Code Yes The code is accessed at https://github.com/witnessai/Dea M.
Open Datasets Yes To facilitate a more open and transparent comparison with other methods, we utilize a publicly available academic dataset Crello for evaluation. Crello dataset1: Crello (Yamaguchi, 2021) is now referred to as Vista Create2, provides a collection of visual designs originating from an online design tool. 1https://huggingface.co/datasets/cyberagent/crello
Dataset Splits Yes The test set of this dataset contains over 2,000 images.
Hardware Specification Yes We use 16 NVIDIA A800 GPUs for training.
Software Dependencies No The paper mentions several software components and models like VQ-GAN, Intern LM2-7B, ResNet, CLIP Vision Encoder, DINO v2, and GPT-4. While Intern LM2-7B is cited with a year (Team, 2023), specific version numbers for multiple key software components (like Python, PyTorch, CUDA, or explicit versions for the models) are not provided.
Experiment Setup Yes The training process of Dea M is divided into three phases: VQ-GAN training, instruction tuning, and decoder training. ... We trained the VQ-GAN model with a downsampling ratio of f = 16. ... we set the input resolution for semantically rich natural images to 192x192 and for semantically sparse decorative elements to 128x128. ... We use 16 NVIDIA A800 GPUs for training.