Revisiting Change Captioning from Self-supervised Global-Part Alignment

Authors: Feixiao Lv, Rui Wang, Lihua Jing

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments show our method achieves the stateof-the-art results on four datasets.
Researcher Affiliation Academia 1Institute of Information Engineering, CAS, Beijing, China 2School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China EMAIL, EMAIL, EMAIL
Pseudocode No The paper includes mathematical formulations and flowcharts (Figure 2 and 3), but does not contain any structured pseudocode or algorithm blocks.
Open Source Code No The paper does not contain any explicit statements or links indicating that the source code for the described methodology is publicly available.
Open Datasets Yes Datasets Birds-to-Words dataset (Forbes et al. 2019) consists of 41k sentences that describe fine-grained changes... CLEVR-Change dataset (Park, Darrell, and Rohrbach 2019) is a large-scale synthetic dataset... Spot-the-Diff dataset (Jhamtani and Berg-Kirkpatrick 2018) includes 13,192 aligned image pairs... Image Editing Request dataset (Tan et al. 2019) includes 3,939 aligned image pairs...
Dataset Splits Yes Birds-to-Words dataset (Forbes et al. 2019) consists of 41k sentences... This leads to 12,890/1,556/1,604 captions for train/val/test splits.
Hardware Specification Yes Both training and inference are implemented with Py Torch (Paszke et al. 2019) on RTX 3090 GPU.
Software Dependencies No The paper mentions "Py Torch (Paszke et al. 2019)", "EVA-Vi T-g/14 (Fang et al. 2023)", and "Vicuna-7B (Chiang et al. 2023)". While these are software/models, specific version numbers (e.g., PyTorch 1.9, Python 3.x, CUDA 11.x) for the libraries and environment are not provided.
Experiment Setup Yes All hidden size is 512. Both training and inference are implemented with Py Torch (Paszke et al. 2019) on RTX 3090 GPU. We apply EVA-Vi T-g/14 (Fang et al. 2023) and Vicuna-7B (Chiang et al. 2023) as image encoder and LLM, respectively. The above models without the proposed GPTA and SSFEC constitute our baseline. The head and layer numbers are set to 8 and 2 for Input Representation step, and to 8 and 4 for Self-supervised Fusion Change Encoding step on the four datasets, respectively. During training, We use Adam optimizer (Kingma and Ba 2014) to minimize the aforementioned losses and all parameters except MCA adapter are frozen.