MaskRIS: Semantic Distortion-aware Data Augmentation for Referring Image Segmentation
Authors: Minhyun Lee, Seungho Lee, Song Park, Dongyoon Han, Byeongho Heo, Hyunjung Shim
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments demonstrate that Mask RIS can easily be applied to various RIS models, outperforming existing methods in both fully supervised and weakly supervised settings. Finally, Mask RIS achieves new state-of-the-art performance on Ref COCO, Ref COCO+, and Ref COCOg datasets. |
| Researcher Affiliation | Collaboration | Minhyun Lee EMAIL AI Center, Samsung Electronics Seungho Lee EMAIL AI Center, Samsung Electronics Song Park EMAIL Dongyoon Han EMAIL NAVER AI Lab Byeongho Heo EMAIL NAVER AI Lab Hyunjung Shim EMAIL Korea Advanced Institute of Science & Technology (KAIST) |
| Pseudocode | No | The paper describes the Mask RIS framework, input masking strategy, and Distortion-aware Contextual Learning using mathematical formulations and textual descriptions, but does not include any explicit pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code is available at https://github.com/naver-ai/maskris. |
| Open Datasets | Yes | Datasets. We evaluate our method using three popular benchmarks in RIS: Ref COCO, Ref COCO+, and Ref COCOg. Ref COCO (Yu et al., 2016), all built on the MS COCO dataset (Lin et al., 2014). ... To further examine cross-dataset generalization beyond COCO-style images, we additionally evaluate Mask RIS on Ref Clef, a subset of the Image CLEF dataset with more diverse natural scenes and object categories (see Appendix A.1 for details). |
| Dataset Splits | Yes | We evaluate our method using three popular benchmarks in RIS: Ref COCO, Ref COCO+, and Ref COCOg. ... On Ref COCO+, our method leads by 1.37%p, 2.76%p, and 1.93%p on the validation, test A, and test B splits, respectively. Even on the challenging Ref COCOg dataset, Mask RIS still outperforms CARIS by 0.89%p and 1.05%p on the validation and test splits. ... For this dataset [Ref COCOg], we report results on the UMD partition (Yu et al., 2016), following the previous studies (Wang et al., 2022; Liu et al., 2023b; Kim et al., 2023b). |
| Hardware Specification | No | The paper mentions that "Some parts of experiments are based on the NAVER Smart Machine Learning NSML (Kim et al., 2018) platform." in the Acknowledgments. However, it does not specify any details about the CPU, GPU models, memory, or other specific hardware configurations used for the experiments. |
| Software Dependencies | No | Most of our experimental results are based on CARIS (Liu et al., 2023c). For the image encoder, we used the Swin-Base Transformer (Liu et al., 2021b), pre-trained on Image Net-22k (Deng et al., 2009), and for the text encoder, we employed BERT-Base (Devlin et al., 2018). The maximum length of the text is set to 20 words. We used the Adam W (Loshchilov & Hutter, 2017) optimizer with a weight decay of 0.01. ... While various software components like Swin-Base Transformer, BERT-Base, and Adam W are mentioned, specific version numbers for these or other critical software dependencies (e.g., Python, PyTorch, CUDA) are not provided. |
| Experiment Setup | Yes | Designed as a plug-and-play training strategy, we strictly follow the original training settings and hyperparameters, such as learning rate, epochs, and batch size, without modification. Notably, we primarily implemented our method on CARIS (Liu et al., 2023c), a leading So TA method, unless stated otherwise. Images are resized to 448 × 448 for both training and testing. For image masking, we set 32 as the patch size. ... We used the Adam W (Loshchilov & Hutter, 2017) optimizer with a weight decay of 0.01. We applied different learning rates of 1e 5 and 1e 4 to encoders and the others, respectively, with a polynomial learning rate schedule with a power of 0.9. The model was trained for 50 epochs with a batch size of 16, and the input images were resized to 448 × 448. ... While our default setting uses λ = 0.5, we also provide a detailed sensitivity analysis of this ratio in Appendix A.5, showing that Mask RIS is robust across a wide range of choices. |