FlexControl: Computation-Aware Conditional Control with Differentiable Router for Text-to-Image Generation

Authors: Zheng Fang, Lichuan Xiang, Xu Cai, Kaicheng Zhou, Hongkai Wen

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through comprehensive experiments on both UNet and Di T architectures on different control methods, we show that our method can upgrade existing controllable generative models in certain key aspects of interest. As evidenced by both quantitative and qualitative evaluations, Flex Control preserves or enhances image fidelity while also reducing computational overhead by selectively activating the most relevant blocks to control. [...] 4. Experiment 4.1. Quantitative comparison 4.2. Qualitative comparison 4.3. Ablation study
Researcher Affiliation Collaboration 1Department of Computer Science, University of Warwick, Coventry, UK 2Collov Labs. Correspondence to: Hongkai Wen <EMAIL>.
Pseudocode Yes A. Pseudo-code of Our Algorithm. In this section, we give the pseudo-code algorithm of our Flex Control. The specific inference procedure is shown in Algorithm 1, and the training procedure is shown in Algorithm 2.
Open Source Code No The code will soon be available at https://github.com/ Daryu-Fan/Flex Control.
Open Datasets Yes We evaluate Flex Control against state-of-the-art methods across different conditions: depth map (Multi Gen-20M, (Zhao et al., 2024a)), canny edge (LLAVA-558K, (Liu et al., 2024)), segmentation mask (ADE20K, (Zhou et al., 2017)), and etc. [...] Depth map. In this application, we use Multi Gen-20M proposed by (Zhao et al., 2024a) as training data, which is a subset of LAION-Aesthetics (Schuhmann et al., 2022) and contains over 2 million depth-image-caption pairs, and 5K test samples.
Dataset Splits Yes Depth map. In this application, we use Multi Gen-20M proposed by (Zhao et al., 2024a) as training data, which is a subset of LAION-Aesthetics (Schuhmann et al., 2022) and contains over 2 million depth-image-caption pairs, and 5K test samples. [...] Segmentation mask. For the segmentation mask, we use the ADE20K (Zhou et al., 2017) dataset for model training. This dataset contains a total of 27K segmentation image pairs, 25K for training and 2K for testing.
Hardware Specification Yes The speed is measured on single Nvidia RTX 2080 Ti GPU. [...] The models based on SD1.5 and SD3.0 are trained with 2 and 8 Nvidia-A100 (40G) GPUs, respectively. [...] The diffusion speed is measured on single Nvidia A100 (40G) GPU.
Software Dependencies No We implement Flex Control based on SD1.5 (Stability, 2022) and SD3.0 (Esser et al., 2024a). [...] we further use Deep Speed (Rajbhandari et al., 2020) Zero-2 to accelerate the training process, the resolution of 1024 1024 is used, and the batch size and gradient accumulation steps are set to 4 and 8.
Experiment Setup Yes Training settings. During the training procedure, we uniformly use the Adam W optimizer with a learning rate of 1 10 5. For SD1.5-based models, half-precision floatingpoint (Float16) is used for mixed precision training, original images and conditional images are resized to 512 512, and batch size and gradient accumulation steps are set to 4 and 32, respectively. When turning to SD3.0, we further use Deep Speed (Rajbhandari et al., 2020) Zero-2 to accelerate the training process, the resolution of 1024 1024 is used, and the batch size and gradient accumulation steps are set to 4 and 8. We set the maximum training iterations to 50k and 25k for the models based on SD1.5 and SD3.0, respectively. For the threshold parameter T required by the Gumbel-Sigmoid activation function in the router unit, we set it to 0.5, and the hyperparameter λC in the loss function Lθ is set to 0.5, the value of γ depends on the target sparsity.