Instruct2See: Learning to Remove Any Obstructions Across Distributions
Authors: Junhang Li, Yu Guo, Chuhua Xian, Shengfeng He
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on both in-distribution and out-of-distribution obstacles show that Instruct2See consistently achieves strong performance and generalization in obstruction removal, regardless of whether the obstacles were present during the training phase. Code and dataset are available at https://jhscut.github.io/Instruct2See. |
| Researcher Affiliation | Academia | 1School of Computer Science and Engineering, South China University of Technology 2School of Computing and Information Systems, Singapore Management University 3School of Navigation, Wuhan University of Technology. Correspondence to: Chuhua Xian and Shengfeng He <EMAIL, EMAIL>. |
| Pseudocode | Yes | Algorithm 1 Instruct2See Model Inference |
| Open Source Code | Yes | Code and dataset are available at https://jhscut.github.io/Instruct2See. |
| Open Datasets | Yes | Code and dataset are available at https://jhscut.github.io/Instruct2See. Datasets. We utilize 3,984 images for model training. For the fence obstacle, we select 897 clear images from the BSD dataset (Martin et al., 2001) and the UCID dataset (Schaefer & Stich, 2003), and generate paired data using the fence synthesis method from (Du et al., 2018). Additionally, 987 clear images from the Flickr24K dataset (Zhang et al., 2018) and 987 flare images from the Flare7K dataset (Dai et al., 2022) are used to create flare image pairs. We also include 2,100 training image pairs from the VRDS dataset (Wu et al., 2023). For unseen obstructions, we sourced 100 test images each from the rain streak dataset (Yang et al., 2017), snowy dataset (Liu et al., 2018), and stroke dataset (Lugmayr et al., 2022). |
| Dataset Splits | Yes | Datasets. We utilize 3,984 images for model training. For the fence obstacle, we select 897 clear images from the BSD dataset (Martin et al., 2001) and the UCID dataset (Schaefer & Stich, 2003), and generate paired data using the fence synthesis method from (Du et al., 2018). Additionally, 987 clear images from the Flickr24K dataset (Zhang et al., 2018) and 987 flare images from the Flare7K dataset (Dai et al., 2022) are used to create flare image pairs. We also include 2,100 training image pairs from the VRDS dataset (Wu et al., 2023). For testing, we apply the same synthesis strategy to create a fence test dataset with 100 image pairs. Moreover, a flare test dataset with another 100 image pairs is used. Additionally, 500 raindrop test image pairs are included. For unseen obstructions, we sourced 100 test images each from the rain streak dataset (Yang et al., 2017), snowy dataset (Liu et al., 2018), and stroke dataset (Lugmayr et al., 2022). |
| Hardware Specification | Yes | Our Instruct2See framework is implemented in Py Torch 1.12.0 and trained on a system equipped with 2 AMD EPYC 7543 32-Core CPUs and 8 NVIDIA L40 GPUs. |
| Software Dependencies | Yes | Our Instruct2See framework is implemented in Py Torch 1.12.0 and trained on a system equipped with 2 AMD EPYC 7543 32-Core CPUs and 8 NVIDIA L40 GPUs. We utilize the CLIP Vi T-B/32 model. For obstructions like rain streaks and snow, which are more challenging to segment, we employ a U-Net-based model (Ronneberger et al., 2015) to generate the initial mask. For other obstructions, we use the Segment Anything Model 2 (SAM2) (Ravi et al., 2024). |
| Experiment Setup | Yes | Our Instruct2See framework is implemented in Py Torch 1.12.0 and trained on a system equipped with 2 AMD EPYC 7543 32-Core CPUs and 8 NVIDIA L40 GPUs. We train the model using the Adam W optimizer (β1 = 0.9, β2 = 0.999, weight decay of 1 × 10−4) and L1 loss, over 300K iterations. The initial learning rate is set to 3 × 10−4. A progressive learning strategy is employed, starting with a patch size of 128 × 128 and a batch size of 1. The patch size is progressively updated to 128 × 128, 160 × 160, 192 × 192, and 256 × 256 at iterations 115,000, 80,000, 60,000, and 45,000, respectively. We also apply horizontal and vertical flips for data augmentation. |