Improving Diffusion Models for Scene Text Editing with Dual Encoders

Authors: Jiabao Ji, Guanhua Zhang, Zhaowen Wang, Bairu Hou, Zhifei Zhang, Brian L. Price, Shiyu Chang

TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate our approach on five datasets and demonstrate its superior performance in terms of text correctness, image naturalness, and style controllability. Our code is publicly available at https://github.com/UCSB-NLP-Chang/Diff STE.
Researcher Affiliation Collaboration 1University of California, Santa Barbara 2Adobe Research
Pseudocode No The paper describes the methodology, dual-encoder design, and instruction tuning framework in prose, but it does not contain a formally labeled pseudocode or algorithm block.
Open Source Code Yes Our code is publicly available at https://github.com/UCSB-NLP-Chang/Diff STE.
Open Datasets Yes As described in Section 3.2, we collect 1.3M examples by combining the synthetic dataset (Synthetic) and three real-world datasets (Ar TChng et al. (2019), COCOText Gomez et al. (2017), and Text OCR Singh et al. (2021)) for instruction tuning. For thew Synthetic dataset, we randomly pick up 100 font families from the google fonts library1 and 954 XKCD colors2 for text rendering.
Dataset Splits Yes We randomly select 200 images from each dataset for validation and 1000 images for testing.
Hardware Specification Yes In total, the training has 80k steps, which requires approximately two days of training time using eight Nvidia-V100 gpus.
Software Dependencies No The paper mentions software like "diffusers" and specific "stable-diffusion-inpainting" models, but it does not provide explicit version numbers for these software components.
Experiment Setup Yes The batch size is set to 256. We use the Adam W optimizer with a fixed learning rate 5e 5 to train the model for 15 epochs.