UniMC: Taming Diffusion Transformer for Unified Keypoint-Guided Multi-Class Image Generation
Authors: Qin Guo, Ailing Zeng, Dongxu Yue, Ceyuan Yang, Yang Cao, Hanzhong Guo, Fei Shen, Wei Liu, Xihui Liu, Dan Xu
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments demonstrate the high quality of HAIG-2.9M and the effectiveness of UNIMC, particularly in heavy occlusions and multi-class scenarios. 5. Experiments Implementation Details. ... For evaluation, we utilize the testing set of HAIG-2.9M. ... Evaluation Metrics. We adopt commonly-used metrics for comprehensive comparisons from five perspectives: 1) Image Quality. FID (Heusel et al., 2017) and KID (Bi nkowski et al., 2018) reflect quality and diversity; 2) Text-Image Alignment. CLIP (Radford et al., 2021) text-image similarity is reported; 3) Class Accuracy. ... 4) Pose Accuracy. ... 5) Human Subjective Evaluation. |
| Researcher Affiliation | Collaboration | 1The Hong Kong University of Science and Technology 2Tencent 3Peking University 4The Chinese University of Hong Kong 5The University of Hong Kong 6National University of Singapore. Correspondence to: Dan Xu <EMAIL>. |
| Pseudocode | No | The paper describes the UNIMC framework and its components (unified keypoint encoder, timestep-aware keypoint modulator) in Section 3 and Figure 3, but does not present a structured pseudocode or algorithm block. |
| Open Source Code | No | The paper does not contain an explicit statement about the release of source code for the methodology described, nor does it provide a direct link to a code repository. |
| Open Datasets | Yes | We propose HAIG-2.9M, a large-scale, high-quality, and diverse dataset designed for keypoint-guided human and animal image generation. Our License: Creative Common CC-BY 4.0 license. In A.1. Licenses section, it lists multiple image websites and datasets with their URLs and licenses, such as Pexels (Pexels, 2024): Creative Commons CC0 license. https://www.pexels.com/ and SA-1B8 (Kirillov et al., 2023): SA-1B Dataset Research License. https://ai.meta.com/datasets/segment-anything/. |
| Dataset Splits | Yes | Dataset Split. Detailed statistics for each subset of the dataset are provided in Tab. 2. First, for the testing set, we select 40 images for each class, ensuring that each class of images contains multiple classes. Then, we split the remaining images into training and validation sets at an approximately 20 : 1 ratio. We adopt a class-level partition to ensure the class proportions are balanced between the training and validation sets. The training set comprises 745K images and 2.7M instances, while the validation set consists of 39K images and 145K instances. Table 2. Split of HAIG-2.9M. Training Set 745,828 2,725,484... Validation Set 39,342 145,504... Testing Set 1,224 3,785... |
| Hardware Specification | Yes | We train at 1024 1024 resolution for 8K steps with a batch size of 256 using 8 A800 GPUs. |
| Software Dependencies | No | The paper mentions PIXART-α-1024px as the backbone model and Adam W optimizer, but does not provide specific version numbers for any software libraries or programming languages used in the implementation. |
| Experiment Setup | Yes | Implementation Details. We use PIXART-α-1024px (Chen et al., 2024c) as backbone. We use the Adam W optimizer (Loshchilov & Hutter, 2017) with a weight decay of 0.03 and a fixed learning rate of 2e 5, we only train the unified keypoint encoder and the timestep-aware keypoint modulator. We train at 1024 1024 resolution for 8K steps with a batch size of 256 using 8 A800 GPUs. During training, we drop the bounding box condition with 50% probability, the keypoint condition with 15% probability, and the prompt with 10% probability. |