EMControl: Adding Conditional Control to Text-to-Image Diffusion Models via Expectation-Maximization
Authors: He Wang, Longquan Dai, Jinhui Tang
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments We implement EMControl across various conditions, including canny edge (Canny 1986), depth map (Yang et al. 2024b), normal map (Vasiljevic et al. 2019), M-LSD lines (Gu et al. 2022), HED edge (Xie and Tu 2015), semantic segmentation (Cheng et al. 2022), skeleton (Cao et al. 2017), sketch, object location (Redmon et al. 2016) and style guidance (Radford et al. 2021). In this section, we present the generated results and provide a comparison with existing methods to demonstrate the effectiveness of our approach. |
| Researcher Affiliation | Academia | He Wang, Longquan Dai*, Jinhui Tang Nanjing University of Science and Technology, China EMAIL |
| Pseudocode | Yes | Algorithm 1: EMControl Sampling |
| Open Source Code | No | The paper does not provide any explicit statements about the availability of open-source code or links to a code repository. |
| Open Datasets | Yes | We trained our model on approximately 156,000 images from COCO2017 (Lin et al. 2014), covering a range of tasks. [...] For the aspect of style guidance, we further integrated approximately 81,000 images from Wiki-Art (Tan et al. 2019). |
| Dataset Splits | Yes | To evaluate the performance of different methods, we used the COCO2017 validation set comprising 5,000 image-text pairs. |
| Hardware Specification | Yes | During training, the model commenced with the SDv1.5 checkpoint and was trained for 20 hours on a single NVIDIA RTX3090 GPU. |
| Software Dependencies | No | A batch size of 1 was utilized alongside the Adam W optimizer at a learning rate of 1e-5, where the inputs, including images and condition, were scaled down to 512 512 pixels. For EMControl sampling, we employed the DDPM (Ho, Jain, and Abbeel 2020) scheduler across 20 time steps. This text mentions specific optimizers and schedulers, but not their software versions or versions of the underlying libraries like PyTorch/TensorFlow, Python, CUDA etc. |
| Experiment Setup | Yes | Experimental Setup Our model for the latent forward network Aθ(zt, t) is based on U-Net (Ronneberger, Fischer, and Brox 2015). During training, the model commenced with the SDv1.5 checkpoint and was trained for 20 hours on a single NVIDIA RTX3090 GPU. A batch size of 1 was utilized alongside the Adam W optimizer at a learning rate of 1e-5, where the inputs, including images and condition, were scaled down to 512 512 pixels. For EMControl sampling, we employed the DDPM (Ho, Jain, and Abbeel 2020) scheduler across 20 time steps. |