Revisiting Change Captioning from Self-supervised Global-Part Alignment
Authors: Feixiao Lv, Rui Wang, Lihua Jing
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments show our method achieves the stateof-the-art results on four datasets. |
| Researcher Affiliation | Academia | 1Institute of Information Engineering, CAS, Beijing, China 2School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China EMAIL, EMAIL, EMAIL |
| Pseudocode | No | The paper includes mathematical formulations and flowcharts (Figure 2 and 3), but does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not contain any explicit statements or links indicating that the source code for the described methodology is publicly available. |
| Open Datasets | Yes | Datasets Birds-to-Words dataset (Forbes et al. 2019) consists of 41k sentences that describe fine-grained changes... CLEVR-Change dataset (Park, Darrell, and Rohrbach 2019) is a large-scale synthetic dataset... Spot-the-Diff dataset (Jhamtani and Berg-Kirkpatrick 2018) includes 13,192 aligned image pairs... Image Editing Request dataset (Tan et al. 2019) includes 3,939 aligned image pairs... |
| Dataset Splits | Yes | Birds-to-Words dataset (Forbes et al. 2019) consists of 41k sentences... This leads to 12,890/1,556/1,604 captions for train/val/test splits. |
| Hardware Specification | Yes | Both training and inference are implemented with Py Torch (Paszke et al. 2019) on RTX 3090 GPU. |
| Software Dependencies | No | The paper mentions "Py Torch (Paszke et al. 2019)", "EVA-Vi T-g/14 (Fang et al. 2023)", and "Vicuna-7B (Chiang et al. 2023)". While these are software/models, specific version numbers (e.g., PyTorch 1.9, Python 3.x, CUDA 11.x) for the libraries and environment are not provided. |
| Experiment Setup | Yes | All hidden size is 512. Both training and inference are implemented with Py Torch (Paszke et al. 2019) on RTX 3090 GPU. We apply EVA-Vi T-g/14 (Fang et al. 2023) and Vicuna-7B (Chiang et al. 2023) as image encoder and LLM, respectively. The above models without the proposed GPTA and SSFEC constitute our baseline. The head and layer numbers are set to 8 and 2 for Input Representation step, and to 8 and 4 for Self-supervised Fusion Change Encoding step on the four datasets, respectively. During training, We use Adam optimizer (Kingma and Ba 2014) to minimize the aforementioned losses and all parameters except MCA adapter are frozen. |