Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1]
Region-aware Difference Distilling with Attribute-guided Contrastive Regularization for Change Captioning
Authors: Rong Li, Liang Li, Jiehua Zhang, Qiang Zhao, Hongkui Wang, Chenggang Yan
AAAI 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Promising results on three datasets demonstrate that our method outperforms the state-of-the-art change captioning methods. Extensive experiments demonstrate the superiority of our method, achieving state-of-the-art results on three public datasets. Also, the paper includes sections like 'Experiments', 'Performance Comparison', 'Ablation Study', and 'Qualitative Analysis'. |
| Researcher Affiliation | Academia | 1Hangzhou Dianzi University, Hangzhou, China; 2Institute of Computing Technology, Chinese Academy of Sciences; 3School of Software Engineering, Xi an Jiaotong University; 4Lishui Institute of Hangzhou Dianzi University. All listed institutions are universities or public research institutes. |
| Pseudocode | No | The paper describes the methodology using text, equations, and a diagram (Figure 2), but it does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not contain any explicit statement about the release of source code for the described methodology, nor does it provide a link to a code repository. |
| Open Datasets | Yes | CLEVR-Change (Park, Darrell, and Rohrbach 2019) is a synthetic dataset with 79,606 image pairs and 493,735 captions... LEVIR-CC (Liu et al. 2022a) is a remote sensing dataset containing 10,077 image pairs and 50,385 captions... Spot-the-Diff (Jhamtani and Berg-Kirkpatrick 2018) is a surveillance dataset with 13,192 image pairs... |
| Dataset Splits | Yes | CLEVR-Change [...] is split into 67,660 training, 3,976 validation, and 7,970 testing images. LEVIR-CC [...] with 6,815 images for training, 1,333 for validation, and 1,929 for testing. Spot-the-Diff [...] divided into training, validation, and testing sets in an 8:1:1 ratio. |
| Hardware Specification | No | The paper mentions using a pre-trained ResNet-101 as the visual backbone and the Adam optimizer, but it does not provide any specific details about the hardware used for running experiments. |
| Software Dependencies | No | The paper mentions using 'spaCy (Honnibal and Montani 2023)' for extracting noun and verb segments, but does not provide a specific version number for spaCy or any other software dependencies. |
| Experiment Setup | Yes | We use pre-trained Res Net-101 (He et al. 2016) as the visual backbone to extract image features into a dimension of 14 14 1024, and then these features are projected into a lower dimension of 512. The hidden size of the model and word embedding size are set to 512 and 300. We use Adam (Kingma and Ba 2014) optimizer to minimize the loss of Equation (16). |