Revisiting Tampered Scene Text Detection in the Era of Generative AI

Authors: Chenfan Qu, Yiwu Zhong, Fengjun Guo, Lianwen Jin

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct extensive experiments on both the proposed OSTF benchmark and the widely-used Tampered IC13 (Wang et al. 2022) benchmark. Our method demonstrates strong generalization ability in these experiments. For example, the proposed method leads to a gain of 27.88 mean F-score on the open-set generalization ability in the OSTF benchmark. Moreover, the zero-shot version of our method even outperforms the full-shot version of the previous SOTA method UPOCR (Peng et al. 2023b) by 10.46 mean Io U on the Tampered-IC13 benchmark.
Researcher Affiliation Collaboration Chenfan Qu1, Yiwu Zhong2, Fengjun Guo3, 4, Lianwen Jin1, 4 * 1South China University of Technology 2The Chinese University of Hong Kong 3Intsig Information Co., Ltd 4INTSIG-SCUT Joint Lab on Document Analysis and Recognition EMAIL, EMAIL
Pseudocode No The paper describes methods through figures and textual explanations (e.g., Figure 3 for Texture Jitter pipeline, Figure 5 for DAF framework) but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes Code https://github.com/qcf-568/OSTF
Open Datasets Yes Datasets https://github.com/qcf-568/OSTF. We have curated a comprehensive, high-quality dataset, featuring the texts tampered by eight text editing models, to thoroughly assess the open-set generalization capabilities. We manually construct a comprehensive high-quality new benchmark for tampered scene text detection, termed as OSFT, which includes text tampered by various latest text editing methods and cross source-dataset evaluation settings.
Dataset Splits Yes Evaluation Settings. As shown in Table 2, there are 9 sessions in our dataset (ICDAR2013 tampered by 7 methods, Text OCR tampered by UDiff Text, ICDAR2017 and Re CTS tampered by Text Diffuser). To evaluate both closed-set performance and open-set generalization, the models are trained on one session of the training set and tested on all nine sessions of the testing set. As a result, there are 9 9=81 test settings, enabling three evaluation protocols: cross tampering methods, cross source dataset, and cross both tampering methods and source datasets. Table 2 provides detailed statistics with 'train' and 'test' splits for images and text instances for each session.
Hardware Specification No The paper mentions training deep models but does not provide specific hardware details such as GPU/CPU models, memory specifications, or types of computing resources used for the experiments.
Software Dependencies No The paper mentions using Adam W optimizer, Swin-Transformer as backbone, and built-in functions of mmsegmentation, but it does not provide specific version numbers for any of these software dependencies.
Experiment Setup Yes We first pretrain our model for 12 epochs with the proposed Texture Jitter... The Adam W optimizer (Loshchilov and Hutter 2017) with a learning rate initialized at 6e-5 and decaying to 1e-6 is used in the experiments. We then fine-tune the model using also the training sets... for 15k iterations with a batch size of 8. We adopt Swin-Transformer (Small) (Liu et al. 2021) as the backbone... The input image is resized to ensure that the shortest edge 1024 and the longest edge 1536.