Decomposition of Graphic Design with Unified Multimodal Model
Authors: Hui Nie, Zhao Zhang, Yutao Cheng, Maoke Yang, Gonglei Shi, Qingsong Xie, Jie Shao, Xinglong Wu
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments demonstrate the effectiveness of the proposed method. The paper includes sections such as "6. Training Datasets", "7. Experiments", "7.3. Quantitative Results", and "7.4. Qualitative Results", along with ablation studies and comparisons to baselines. |
| Researcher Affiliation | Collaboration | The authors are affiliated with "1University of Chinese Academy of Sciences" (academic) and "2Byte Dance Intelligent Creation, China." as well as "3OPPO AI Center" (both industry). |
| Pseudocode | No | The paper describes the Dea M pipeline textually and with a diagram in Figure 2, but it does not contain explicit pseudocode or algorithm blocks. |
| Open Source Code | Yes | The code is accessed at https://github.com/witnessai/Dea M. |
| Open Datasets | Yes | To facilitate a more open and transparent comparison with other methods, we utilize a publicly available academic dataset Crello for evaluation. Crello dataset1: Crello (Yamaguchi, 2021) is now referred to as Vista Create2, provides a collection of visual designs originating from an online design tool. 1https://huggingface.co/datasets/cyberagent/crello |
| Dataset Splits | Yes | The test set of this dataset contains over 2,000 images. |
| Hardware Specification | Yes | We use 16 NVIDIA A800 GPUs for training. |
| Software Dependencies | No | The paper mentions several software components and models like VQ-GAN, Intern LM2-7B, ResNet, CLIP Vision Encoder, DINO v2, and GPT-4. While Intern LM2-7B is cited with a year (Team, 2023), specific version numbers for multiple key software components (like Python, PyTorch, CUDA, or explicit versions for the models) are not provided. |
| Experiment Setup | Yes | The training process of Dea M is divided into three phases: VQ-GAN training, instruction tuning, and decoder training. ... We trained the VQ-GAN model with a downsampling ratio of f = 16. ... we set the input resolution for semantically rich natural images to 192x192 and for semantically sparse decorative elements to 128x128. ... We use 16 NVIDIA A800 GPUs for training. |