End-to-end Training for Text-to-Image Synthesis using Dual-Text Embeddings
Authors: Yeruru Asrar Ahmed, Anurag Mittal
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | A comprehensive set of experiments on three text-to-image benchmark datasets (Oxford-102, Caltech-UCSD, and MS-COCO) reveal that having two separate embeddings gives better results than using a shared one and that such an approach performs favourably in comparison with methods that use text representations from a pre-trained text encoder trained using a discriminative approach. Furthermore, we observe that employing separate embeddings gives superior results compared to a shared embedding approach as verified in Section 4.3.4. |
| Researcher Affiliation | Academia | Yeruru Asrar Ahmed EMAIL Department of Computer Science and Engineering Indian Institute of Technology Madras Anurag Mittal EMAIL Department of Computer Science and Engineering Indian Institute of Technology Madras |
| Pseudocode | No | The paper describes the model architecture and methodology in detail across sections 3 and 3.1-3.4, including mathematical formulations of losses and components, but does not present a distinct, structured pseudocode block or algorithm. |
| Open Source Code | No | The paper does not contain any explicit statement about releasing source code, nor does it provide a link to a code repository or mention code in supplementary materials for the methodology described. |
| Open Datasets | Yes | DTE-GAN is evaluated on three datasets, namely, 1) Caltech-UCSD birds (CUB) (Welinder et al., 2010), 2) Oxford-102 flowers (Nilsback & Zisserman, 2008), and 3) MS COCO (Lin et al., 2014b) datasets. |
| Dataset Splits | Yes | For CUB and Oxford-102 datasets, we have similar setup to Stack GAN (Zhang et al., 2017b). Ten captions are provided for each image in both the datasets (Reed et al., 2016a). The MS-COCO dataset consists of around 80k training and 40k validation images; and for every image, there are 5 captions provided with the dataset. |
| Hardware Specification | Yes | Due to computational constraints, our model is trained on a single NVIDIA 1080Ti GPU with 12 GB of memory... The model is trained for 600 epochs on CUB and Oxford-102 datasets (takes 4 days in 2 NVIDIA 1080Ti GPUs) and 120 epochs for COCO dataset (takes 7 days in 2 NVIDIA 1080Ti GPUs). |
| Software Dependencies | No | Implementation of the models is done using the Py Torch framework (Paszke et al., 2019) and optimising the network using Adam optimiser (Kingma & Ba, 2015) with the following hyper parameters: β1 = 0.5, β2 = 0.999, batch size = 24, learning rate = 0.0002, λ1 = 1, λ2 = 1 and λ3 = 1. While PyTorch and Adam optimizer are mentioned with citations, specific version numbers for these software components are not provided (e.g., 'PyTorch 1.9' or 'Adam optimizer version X'). |
| Experiment Setup | Yes | Implementation of the models is done using the Py Torch framework (Paszke et al., 2019) and optimising the network using Adam optimiser (Kingma & Ba, 2015) with the following hyper parameters: β1 = 0.5, β2 = 0.999, batch size = 24, learning rate = 0.0002, λ1 = 1, λ2 = 1 and λ3 = 1. Spectral Normalisation (Miyato et al., 2018) is used for all convolutions and fully connected layers in generator and discriminator. The model is trained for 600 epochs on CUB and Oxford-102 datasets... and 120 epochs for COCO dataset. |