Improving the Diffusability of Autoencoders
Authors: Ivan Skorokhodov, Sharath Girish, Benran Hu, Willi Menapace, Yanyu Li, Rameen Abdal, Sergey Tulyakov, Aliaksandr Siarohin
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We validate our approach on both image and video autoencoders, including Flux AE (Black Forest Labs, 2023), Cosmos Tokenizer (Agarwal et al., 2025), Cog Video X-AE (Hong et al., 2022), and LTX-AE (Ha Cohen et al., 2024), consistently demonstrating improved LDM performance on Image Net-1K (Deng et al., 2009) 2562, reducing FID by 19% for Di T-XL, and Kinetics-700 (Carreira et al., 2019) 17 2562, reducing FVD by at least 44%. |
| Researcher Affiliation | Collaboration | 1Snap Inc. 2Carnegie Mellon University. Correspondence to: Ivan Skorokhodov <EMAIL>, Aliaksandr Siarohin <EMAIL>. |
| Pseudocode | No | The paper describes methods using mathematical equations and structured steps in paragraph form, but does not contain explicitly labeled 'Pseudocode' or 'Algorithm' blocks, nor does it present procedures formatted like code. |
| Open Source Code | Yes | The source code is available at https://github.com/ snap-research/diffusability. |
| Open Datasets | Yes | We validate our approach on both image and video autoencoders... demonstrating improved LDM performance on Image Net-1K (Deng et al., 2009) 2562, reducing FID by 19% for Di T-XL, and Kinetics-700 (Carreira et al., 2019) 17 2562, reducing FVD by at least 44%. |
| Dataset Splits | Yes | For image models, we use 50,000 samples without any optimization for class balancing. To evaluate autoencoders, we used PSNR, SSIM, LPIPS and FID metrics computed on 512 samples from Image Net and Kinetics-700 for image and video autoencoders, respectively. |
| Hardware Specification | Yes | Our models were trained in the FSDP (Zhao et al., 2023) framework with the full sharding strategy on a single node of 8 NVidia A100 80GB GPUs or 8 NVidia H100 80GB GPUs (depending on their availability in our computational cluster). |
| Software Dependencies | No | The paper mentions frameworks and optimizers like 'FSDP (Zhao et al., 2023) framework' and 'Adam W (Loshchilov, 2017) optimizer', but does not provide specific version numbers for programming languages or key software libraries required to replicate the experiments. |
| Experiment Setup | Yes | All the LDM models are trained for 400k steps with 10k warmup steps of the learning rate from 0 to 0.0003 and then its gradual decay towards 0.00001. We used weight decay of 0.01 and Adam W (Loshchilov, 2017) optimizer with beta coefficients of 0.9 and 0.99. We used gradient clipping with the norm of 16 for all the Di T models. Other hyperparameters for autoencoders training are provided in Table 5. |