Improving Diffusion Models for Scene Text Editing with Dual Encoders
Authors: Jiabao Ji, Guanhua Zhang, Zhaowen Wang, Bairu Hou, Zhifei Zhang, Brian L. Price, Shiyu Chang
TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate our approach on five datasets and demonstrate its superior performance in terms of text correctness, image naturalness, and style controllability. Our code is publicly available at https://github.com/UCSB-NLP-Chang/Diff STE. |
| Researcher Affiliation | Collaboration | 1University of California, Santa Barbara 2Adobe Research |
| Pseudocode | No | The paper describes the methodology, dual-encoder design, and instruction tuning framework in prose, but it does not contain a formally labeled pseudocode or algorithm block. |
| Open Source Code | Yes | Our code is publicly available at https://github.com/UCSB-NLP-Chang/Diff STE. |
| Open Datasets | Yes | As described in Section 3.2, we collect 1.3M examples by combining the synthetic dataset (Synthetic) and three real-world datasets (Ar TChng et al. (2019), COCOText Gomez et al. (2017), and Text OCR Singh et al. (2021)) for instruction tuning. For thew Synthetic dataset, we randomly pick up 100 font families from the google fonts library1 and 954 XKCD colors2 for text rendering. |
| Dataset Splits | Yes | We randomly select 200 images from each dataset for validation and 1000 images for testing. |
| Hardware Specification | Yes | In total, the training has 80k steps, which requires approximately two days of training time using eight Nvidia-V100 gpus. |
| Software Dependencies | No | The paper mentions software like "diffusers" and specific "stable-diffusion-inpainting" models, but it does not provide explicit version numbers for these software components. |
| Experiment Setup | Yes | The batch size is set to 256. We use the Adam W optimizer with a fixed learning rate 5e 5 to train the model for 15 epochs. |