Explore In-Context Segmentation via Latent Diffusion Models

Authors: Chaoyang Wang, Xiangtai Li, Henghui Ding, Lu Qi, Jiangning Zhang, Yunhai Tong, Chen Change Loy, Shuicheng Yan

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments validate the effectiveness of our approach, demonstrating comparable or even stronger results than previous specialist or visual foundation models. We conduct extensive ablation studies and compare our method with previous works to demonstrate its effectiveness.
Researcher Affiliation Collaboration Chaoyang Wang1, Xiangtai Li2,3*, Henghui Ding4, Lu Qi5, Jiangning Zhang6, Yunhai Tong1, Chen Change Loy3, Shuicheng Yan2,3 1School of Intelligence Science and Technology, Peking University, China 2Skywork AI, Singapore 3Nanyang Technological University, Singapore 4Institute of Big Data, Fudan University, China 5Wuhan University, China 6Zhejiang University, China
Pseudocode No The paper describes methods with equations and text, but no structured pseudocode or algorithm blocks are provided.
Open Source Code No Project page https://wang-chaoyang.github.io/project/refldmseg is provided, but it is a general project page and does not explicitly state that the source code for the methodology described in the paper is provided or linked there. There is no direct link to a code repository or an unambiguous statement about code release.
Open Datasets Yes We adopt several popular datasets as part of our benchmark (Tab. 2), including PASCAL (Everingham et al. 2010), COCO (Lin et al. 2014), DAVIS-16 (Perazzi et al. 2016) and VSPW (Miao et al. 2021).
Dataset Splits Yes Table 2: Details of the combined datasets. We choose two image semantic segmentation (ISS) datasets, one video object segmentation (VOS) dataset, and one video semantic segmentation (VSS) dataset. (N) means the equivalent number of images. The table provides specific numbers for 'Train' and 'Val' splits for each dataset.
Hardware Specification No The paper does not provide specific hardware details such as exact GPU/CPU models, processor types, or memory amounts used for running the experiments.
Software Dependencies No The paper mentions 'SD 1.5 model as the initialization' and 'Alpha CLIP Vi T-L is adopted as the prompt encoder', which are specific models. However, it does not provide details on specific ancillary software dependencies with version numbers (e.g., Python, PyTorch, CUDA versions) needed to replicate the experiment.
Experiment Setup Yes Our model is jointly trained on the combined dataset for 160K iterations with a batch size of 64. We employ an Adam W optimizer... We set the CFG coefficient for query and instructions as 1.5 and 7, respectively.