Improving the Diffusability of Autoencoders

Authors: Ivan Skorokhodov, Sharath Girish, Benran Hu, Willi Menapace, Yanyu Li, Rameen Abdal, Sergey Tulyakov, Aliaksandr Siarohin

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We validate our approach on both image and video autoencoders, including Flux AE (Black Forest Labs, 2023), Cosmos Tokenizer (Agarwal et al., 2025), Cog Video X-AE (Hong et al., 2022), and LTX-AE (Ha Cohen et al., 2024), consistently demonstrating improved LDM performance on Image Net-1K (Deng et al., 2009) 2562, reducing FID by 19% for Di T-XL, and Kinetics-700 (Carreira et al., 2019) 17 2562, reducing FVD by at least 44%.
Researcher Affiliation Collaboration 1Snap Inc. 2Carnegie Mellon University. Correspondence to: Ivan Skorokhodov <EMAIL>, Aliaksandr Siarohin <EMAIL>.
Pseudocode No The paper describes methods using mathematical equations and structured steps in paragraph form, but does not contain explicitly labeled 'Pseudocode' or 'Algorithm' blocks, nor does it present procedures formatted like code.
Open Source Code Yes The source code is available at https://github.com/ snap-research/diffusability.
Open Datasets Yes We validate our approach on both image and video autoencoders... demonstrating improved LDM performance on Image Net-1K (Deng et al., 2009) 2562, reducing FID by 19% for Di T-XL, and Kinetics-700 (Carreira et al., 2019) 17 2562, reducing FVD by at least 44%.
Dataset Splits Yes For image models, we use 50,000 samples without any optimization for class balancing. To evaluate autoencoders, we used PSNR, SSIM, LPIPS and FID metrics computed on 512 samples from Image Net and Kinetics-700 for image and video autoencoders, respectively.
Hardware Specification Yes Our models were trained in the FSDP (Zhao et al., 2023) framework with the full sharding strategy on a single node of 8 NVidia A100 80GB GPUs or 8 NVidia H100 80GB GPUs (depending on their availability in our computational cluster).
Software Dependencies No The paper mentions frameworks and optimizers like 'FSDP (Zhao et al., 2023) framework' and 'Adam W (Loshchilov, 2017) optimizer', but does not provide specific version numbers for programming languages or key software libraries required to replicate the experiments.
Experiment Setup Yes All the LDM models are trained for 400k steps with 10k warmup steps of the learning rate from 0 to 0.0003 and then its gradual decay towards 0.00001. We used weight decay of 0.01 and Adam W (Loshchilov, 2017) optimizer with beta coefficients of 0.9 and 0.99. We used gradient clipping with the norm of 16 for all the Di T models. Other hyperparameters for autoencoders training are provided in Table 5.