RealisID: Scale-Robust and Fine-Controllable Identity Customization via Local and Global Complementation

Authors: Zhaoyang Sun, Fei Du, Weihua Chen, Fan Wang, Yaxiong Chen, Yi Rong, Shengwu Xiong

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments and ablation studies indicate the effectiveness of Realis ID and verify its ability in fulfilling all the requirements mentioned above.
Researcher Affiliation Collaboration 1Wuhan University of Technology 2DAMO Academy, Alibaba Group 3Hupan Laboratory 4Shanghai AI Laboratory 5Interdisciplinary Artificial Intelligence Research Institute, Wuhan College
Pseudocode No The paper describes the methodology using textual explanations and mathematical equations, but does not include any explicit pseudocode or algorithm blocks.
Open Source Code No The paper does not provide an explicit statement about the release of its source code, nor does it include a link to a code repository.
Open Datasets Yes The trainable parameters in our Realis ID model are learned from the publicly available Cosmic Man dataset (Li et al. 2024a), which comprises 2 million image-text pairs of single individuals. Our evaluation data consists of 40 unseen identities obtained from another Celeb A-HQ (Karras et al. 2017) dataset.
Dataset Splits No The paper states that the Cosmic Man dataset (Li et al. 2024a) is used for training and 40 unseen identities from the Celeb A-HQ dataset (Karras et al. 2017) are used for evaluation. However, it does not explicitly provide specific details on the dataset splits, such as percentages, sample counts for training, validation, and testing within these datasets, or the methodology for selecting the 'unseen identities' for reproducibility.
Hardware Specification Yes The framework is optimized on 8 NVIDIA H20 GPUs, using the Adam optimizer with batch size of 16, learning rate of 1e-5 and weight decay of 1e-2.
Software Dependencies No The paper mentions using MTCNN, Bi Se Net, Media Pipe, SDXL-1.0, and IP-Adapter. However, it does not provide specific version numbers for these software dependencies or any other libraries needed to replicate the experiment.
Experiment Setup Yes During the training phase, we follow the learning strategy of IP-Adapter (Ye et al. 2023) that randomly drops either the image prompt (i.e., ID embedding) or the text prompt or both of them with a probability of 0.05. The hyperparameter λ in Eq. (7) is set to 1.0. The framework is optimized on 8 NVIDIA H20 GPUs, using the Adam optimizer with batch size of 16, learning rate of 1e-5 and weight decay of 1e-2. For inference, we adopt the same delayed subject conditioning technique as in (Xiao et al. 2023). We set λt = 7.5 and λi = 5.0 in Eq. (8), and use a 30-step DDIM (Song, Meng, and Ermon 2020) sampler to generate the target images.