RealisID: Scale-Robust and Fine-Controllable Identity Customization via Local and Global Complementation
Authors: Zhaoyang Sun, Fei Du, Weihua Chen, Fan Wang, Yaxiong Chen, Yi Rong, Shengwu Xiong
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments and ablation studies indicate the effectiveness of Realis ID and verify its ability in fulfilling all the requirements mentioned above. |
| Researcher Affiliation | Collaboration | 1Wuhan University of Technology 2DAMO Academy, Alibaba Group 3Hupan Laboratory 4Shanghai AI Laboratory 5Interdisciplinary Artificial Intelligence Research Institute, Wuhan College |
| Pseudocode | No | The paper describes the methodology using textual explanations and mathematical equations, but does not include any explicit pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide an explicit statement about the release of its source code, nor does it include a link to a code repository. |
| Open Datasets | Yes | The trainable parameters in our Realis ID model are learned from the publicly available Cosmic Man dataset (Li et al. 2024a), which comprises 2 million image-text pairs of single individuals. Our evaluation data consists of 40 unseen identities obtained from another Celeb A-HQ (Karras et al. 2017) dataset. |
| Dataset Splits | No | The paper states that the Cosmic Man dataset (Li et al. 2024a) is used for training and 40 unseen identities from the Celeb A-HQ dataset (Karras et al. 2017) are used for evaluation. However, it does not explicitly provide specific details on the dataset splits, such as percentages, sample counts for training, validation, and testing within these datasets, or the methodology for selecting the 'unseen identities' for reproducibility. |
| Hardware Specification | Yes | The framework is optimized on 8 NVIDIA H20 GPUs, using the Adam optimizer with batch size of 16, learning rate of 1e-5 and weight decay of 1e-2. |
| Software Dependencies | No | The paper mentions using MTCNN, Bi Se Net, Media Pipe, SDXL-1.0, and IP-Adapter. However, it does not provide specific version numbers for these software dependencies or any other libraries needed to replicate the experiment. |
| Experiment Setup | Yes | During the training phase, we follow the learning strategy of IP-Adapter (Ye et al. 2023) that randomly drops either the image prompt (i.e., ID embedding) or the text prompt or both of them with a probability of 0.05. The hyperparameter λ in Eq. (7) is set to 1.0. The framework is optimized on 8 NVIDIA H20 GPUs, using the Adam optimizer with batch size of 16, learning rate of 1e-5 and weight decay of 1e-2. For inference, we adopt the same delayed subject conditioning technique as in (Xiao et al. 2023). We set λt = 7.5 and λi = 5.0 in Eq. (8), and use a 30-step DDIM (Song, Meng, and Ermon 2020) sampler to generate the target images. |