ConText: Driving In-context Learning for Text Removal and Segmentation

Authors: Fei Zhang, Pei Zhang, Baosong Yang, Fei Huang, Yanfeng Wang, Ya Zhang

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive results on several benchmarks demonstrate the general superiority and effectiveness of our method compared to all baseline V-ICL generalists and specialists, yielding new state-of-the-art (SOTA) performance on both text removal (+4.50 PNSR) and segmentation (+3.34% fg Io U). The code is available at https: //github.com/Ferenas/Con Text.
Researcher Affiliation Collaboration 1Shanghai Jiao Tong University (Yanfeng Wang and Ya Zhang are with School of Artificial Intelligence) 2Shanghai Innovation Institute 3Tongyi Lab, Alibaba Group.
Pseudocode No The paper describes the methodology in Section 4, "Method", using textual descriptions and diagrams (Figure 3), but no explicit pseudocode or algorithm blocks are provided.
Open Source Code Yes The code is available at https: //github.com/Ferenas/Con Text.
Open Datasets Yes For text segmentation, we, following the majority of the pipelines (Yu et al., 2023a; Wang et al., 2023c; Yu et al., 2024; Ye et al., 2024), adopt four datasets with high-quality pixel-level labels: Hier Text (Long et al., 2022), Total Text (Ch ng & Chan, 2017), ICDAR13 FST (Karatzas et al., 2013), and Text Seg (Xu et al., 2021). For text removal, we follow the prevailing pipelines (Du et al., 2023b; Peng et al., 2024a) and adopt two datasets: SCUT-Ens Text (Liu et al., 2020), and SCUT-Syn (Zhang et al., 2019)...
Dataset Splits Yes 1. Hier Text: A fine-grained real-world segmentation benchmark, including 8,281 training samples, 1,724 validation samples, and 1,634 test samples. We use all the training samples during the training stage and evaluate the model with the validation set. 2. Text Seg: A large-scale fine-annotated text segmentation dataset with 4,024 images of scene text and design text. The training, validating, and testing sets contain 2,646, 340, and 1,038 samples, respectively. 3. Total Text: A prevailing small-scale text segmentation dataset. The training and validating sets contain 1,255, and 300, respectively. 4. FST: A prevailing small-scale English text segmentation dataset. The training and validating sets contain 229, and 233, respectively. 5. SCUT-Ens Text: is a real-world scene text removal dataset, comprising 2,749 samples for training and 813 samples for testing. 6. SCUT-Syn: is a purely synthetic scene text removal dataset, comprising 8,000 samples for training and 800 samples for testing.
Hardware Specification Yes We adopt 16 A100 (80GB memory) to implement the training procedure, leading to a total batch size of 64.
Software Dependencies No We use Adam W optimizer (Kingma & Ba, 2015) and a cosine learning rate scheduler, accompanied with a base learning rate of 0.0001, and weight decay of 0.1. Our model adopts vision transformers (Vi T) (Dosovitskiy et al., 2020) as the backbone.
Experiment Setup Yes We use Adam W optimizer (Kingma & Ba, 2015) and a cosine learning rate scheduler, accompanied with a base learning rate of 0.0001, and weight decay of 0.1. The training epoch is set to 150, and the batch size is set to 2 with a two-step gradient accumulation. [...] The weight for the removal reconstruction loss is set to 0.3, and 1 for the pixel-level supervision loss Lpix. The removal reconstruction loss is adopted as smooth-l1 for both reconstructing the segmentation mask and removal image. The probability of self-prompting is set to 0.2.