Effective Diffusion Transformer Architecture for Image Super-Resolution

Authors: Kun Cheng, Lei Yu, Zhijun Tu, Xiao He, Liyu Chen, Yong Guo, Mingrui Zhu, Nannan Wang, Xinbo Gao, Jie Hu

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments demonstrate that Di T-SR outperforms the existing training-from-scratch SR methods significantly, and even beats some of the prior-based methods on pretrained Stable Diffusion, proving the superiority of diffusion transformer in image super resolution. [...] We evaluate the proposed model on 4 realworld SR task. The training data comprises LSDIR (Li et al. 2023), DIV2K (Agustsson and Timofte 2017), DIV8K (Gu et al. 2019), Outdoor Scene Training (Wang et al. 2018), Flicker2K (Timofte et al. 2017) and the first 10K face images from FFHQ (Karras, Laine, and Aila 2019) datasets. [...] We adopt reference-based metrics, including PSNR and LPIPS (Zhang et al. 2018), to evaluate the performance of different models. Additionally, non-reference metrics such as CLIPIQA (Wang, Chan, and Loy 2023), MUSIQ (Ke et al. 2021), and MANIQA (Yang et al. 2022), which are more consistent with human perception in generative SR, are also employed.
Researcher Affiliation Collaboration 1State Key Laboratory of Integrated Services Networks, Xidian University 2Huawei Noah s Ark Lab 3Consumer Business Group, Huawei 4Chongqing Key Laboratory of Image Cognition, Chongqing University of Posts and Telecommunications
Pseudocode No The paper describes the methodology using text and diagrams but does not contain a clearly labeled pseudocode block or algorithm.
Open Source Code Yes Code https://github.com/kunncheng/Di T-SR
Open Datasets Yes The training data comprises LSDIR (Li et al. 2023), DIV2K (Agustsson and Timofte 2017), DIV8K (Gu et al. 2019), Outdoor Scene Training (Wang et al. 2018), Flicker2K (Timofte et al. 2017) and the first 10K face images from FFHQ (Karras, Laine, and Aila 2019) datasets. We partition LSDIR into a training set with 82991 images and a test set with 2000 images. [...] Furthermore, we utilize two real-world datasets: Real SR (Cai et al. 2019), which comprises 100 real images captured by Canon 5D3 and Nikon D810 cameras, and Real Set65 (Yue, Wang, and Loy 2024), including 65 low-resolution images collected from widely used datasets and the internet.
Dataset Splits Yes We partition LSDIR into a training set with 82991 images and a test set with 2000 images. Following LDM (Rombach et al. 2022), HR images in our training set are randomly cropped to 256 256 and the degradation pipeline of Real ESRGAN (Wang et al. 2021) is used to synthesize LR/HR pairs. The test set images are center-cropped to 512 512 and subjected to the same degradation pipeline used in the training stage to create a synthetic dataset, named LSDIR-Test.
Hardware Specification Yes We train the proposed model for 300K iterations with a batch size of 64 using 8 NVIDIA Tesla V100 GPUs.
Software Dependencies No The paper mentions 'The optimizer is Adam (Kingma and Ba 2014)' but does not provide specific version numbers for software libraries, frameworks, or programming languages used.
Experiment Setup Yes We train the proposed model for 300K iterations with a batch size of 64 using 8 NVIDIA Tesla V100 GPUs. The optimizer is Adam (Kingma and Ba 2014), and the learning rate is 5e 5. The FFT window size p is empirically set to 8 (Wallace 1991; Kong et al. 2023).