InstantPortrait: One-Step Portrait Editing via Diffusion Multi-Objective Distillation

Authors: Zhixin Lai, Keqiang Sun, Fu-Yun Wang, Dhritiman Sagar, Erli Ding

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive comparison with prior models demonstrates IPNet as a superior model in terms of identity preservation, text fidelity, and inference speed. In our evaluation, IPNet is compared with top state-of-the-art models in portrait image editing across metrics including identity preservation, text fidelity, and image quality, as well as model size, and inference steps. IPNet outperforms these models by a substantial margin, demonstrating the effectiveness of our Diffusion Multi-Objective Distillation approach. We outline the experiment setup in Section 4.1, compare IPNet with state-of-the-art models (SOTA) in Section 4.2, and conduct ablation studies in Section 4.3 to evaluate components effectiveness. Table 1: Quantitative comparison against state-of-the-art models.
Researcher Affiliation Collaboration 1 Snap Inc 2 The Chinese University of Hong Kong EMAIL, EMAIL, {dhritiman.sagar}@snapchat.com
Pseudocode No The paper describes methods using mathematical equations and structured prose, but does not include any explicitly labeled pseudocode blocks or algorithms formatted like code.
Open Source Code No The paper states: "We have made significant efforts to ensure the reproducibility of the results presented in this paper. Detailed instructions for dataset generation are provided in Appendix B. Model training procedures are outlined in Section 3, while the main experiments and evaluation are discussed in Section 4. Additional experimental results can be found in Appendix E. Moreover, a demo video is also available for download in the Supplementary Materials." However, it does not explicitly provide a link to the source code for the methodology described in this paper, nor does it state that the code is included in the supplementary materials.
Open Datasets Yes For evaluation, we select 100 images from FFHQ Karras et al. (2019) and 40 prompts from our dataset, generating 4000 validation data pairs.
Dataset Splits Yes For training, we utilize our dataset outlined in Appendix B. For evaluation, we select 100 images from FFHQ Karras et al. (2019) and 40 prompts from our dataset, generating 4000 validation data pairs. Using the prompt sets described in Section B.1, we generate 10 million image pairs within this framework.
Hardware Specification No The paper does not provide specific details about the hardware used to run the experiments, such as GPU models or CPU types.
Software Dependencies No The paper does not explicitly mention any specific software dependencies or library versions used in the implementation.
Experiment Setup Yes IDE-Net is trained with time steps sampled in the range [0, 999], using an Annealing Identity Loss Laid weight of 0.7 to balance identity preservation and image quality. Building on IDE-Net, IPNet distillation is divided into three stages: High Time Step ([400, 800]) focuses on the structure and pose alignment with Identity Distillation Loss Ldistill set to 1 to stabilize early training; Middle Time Step ([200, 400]) reduces Identity Distillation Loss Ldistill weight to 0.3, emphasizing Adversarial Loss Ladv to improve image quality and mitigate artifacts and further keeping structure and pose alignment; and Low Time Step ([150, 200]) refines details and style with DDIM inversion-based Identity Distillation Loss Ldistill and Face-Style Enhancing Triplet Loss Ltriplet, utilizing a large batch size of 2048 to enhance style consistency and stabilize training. More training details are summarized in Appendix G. Table 6: Training Parameters Model Time step Loss function Training batch size Training steps IDE-Net [0, 999] Ldm + 0.7 Laid 256 40k IPNet (High time step) [400, 800] Ladv + 1 Ldistill + 0 Ltriplet 256 25k IPNet (Middle time step) [200, 400] Ladv + 0.3 Ldistill + 0 Ltriplet 256 15k IPNet (Low time step) [150, 200] Ladv + 0.3 Ldistill + 1 Ltriplet 2048 1k