Find and Perceive: Tell Visual Change with Fine-Grained Comparison
Authors: Feixiao Lv, Rui Wang, Lihua Jing, Lijun Liu
IJCAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Finally, we conduct extensive experiments on four change captioning datasets, and experimental results show that our proposed method F&P outperforms existing change caption methods and achieves new state-of-the-art performance. |
| Researcher Affiliation | Academia | 1Institute of Information Engineering, CAS, Beijing, China 2School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China EMAIL |
| Pseudocode | No | The paper describes its methodology through text and a block diagram (Figure 2), but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not contain any explicit statements about releasing source code, a link to a code repository, or mention of code in supplementary materials. |
| Open Datasets | Yes | We perform our main evaluation on two commonly used datasets, Birds-to-Words dataset [Forbes et al., 2019] and CLEVR-Change [Park et al., 2019] to verify the effectiveness of our method. In addition, we also compare our method with other methods on two additional datasets, Spot-the-Diff [Jhamtani and Berg-Kirkpatrick, 2018] and Image Editing-Request [Tan et al., 2019] to verify the generality of our method. |
| Dataset Splits | No | The paper mentions that 'Early-stop is applied on the main metric to avoid overfitting' which implies a validation set, but it does not specify explicit percentages or counts for training, validation, and test splits for any of the datasets used. |
| Hardware Specification | No | The paper describes model components like ResNet101, Transformer blocks, attention heads, and layer numbers, but does not specify any particular hardware (e.g., GPU models, CPU types) used for running the experiments. |
| Software Dependencies | No | The paper mentions tools and models like 'word2vec [Mikolov et al., 2013]', 'ResNet101 [He et al., 2016]', 'Transformer blocks [Vaswani et al., 2017]', and 'CLIP features [Guo et al., 2022]', but does not provide specific version numbers for these or other software libraries/frameworks used for implementation. |
| Experiment Setup | Yes | For Transformer blocks, the attention head is set to 8, and layer number is set to 3 for multi-layer Transformer, 2 for fine-grained feature learning, 2 for different enhancement. To ensure stable and progressively refined pseudo label selection, we apply fixed thresholds to attention weights in each iteration (0.04 in the first and 0.06 in the second). These threshold values are determined based on experimental performance. ... In the fine-tuning stage, the learning rate is set as 3e-5. Early-stop is applied on the main metric to avoid overfitting. |