Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1]

ViPCap: Retrieval Text-Based Visual Prompts for Lightweight Image Captioning

Authors: Taewhan Kim, Soeun Lee, Si-Woo Kim, Dong-Jin Kim

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results demonstrate that Vi PCap significantly outperforms prior lightweight captioning models in efficiency and effectiveness, demonstrating the potential for a plug-and-play solution. In this section, we conduct experiments to show the effectiveness of Vi PCap over existing methods. First, we evaluate the advantages of our proposed model over the baseline in Table 1 for in-domain and out-of-domain, and then we apply the Vi P module with the text-only training model to identify the role of Vi P module in Table 2, respectively. In Table 3, Vi P module evalutates performance of sampling method from different probability distributions. Moreover, we test the plug-and-play solution with different models in Table 4, Table 5, and with different prompt styles in Table 6. We conduct ablation studies to evaluate the effects of various components in our work, including the design of Vi P module, additional vector, and feature fusion network strategy.
Researcher Affiliation Academia Taewhan Kim, Soeun Lee, Si-Woo Kim, Dong-Jin Kim* Hanyang University, South Korea EMAIL
Pseudocode No The paper describes the proposed method in prose and through diagrams (Figure 3 and Figure 4) but does not include any structured pseudocode or algorithm blocks.
Open Source Code No The paper does not contain an unambiguous statement or a direct link to a source-code repository indicating that the authors are releasing the code for the methodology described in this paper. Phrases like "We release our code..." or specific repository URLs are absent.
Open Datasets Yes Our approach achieves superior performance on the COCO dataset (Lin et al. 2015) compared to our baseline model, Small Cap, and it significantly improves performance over previous lightweight models on the No Caps dataset (Agrawal et al. 2019). We conduct experiments on image captioning benchmarks, i.e., COCO dataset (Lin et al. 2015), No Caps (Agrawal et al. 2019), Flickr30k (Plummer et al. 2016).
Dataset Splits Yes For COCO and Flickr30k, we follow the Karpathy split (Karpathy and Fei-Fei 2015) used in the image captioning. We evaluate our model on the COCO and Flickr30k test set and No Caps validation and test datasets, as well as the cross-domain experiments.
Hardware Specification Yes Vi PCap requires 14M training parameters and is trained on a single NVIDIA 6000 GPU with a batch size of 128.
Software Dependencies No The paper mentions several tools and models like "GPT-2 (Radford et al. 2018)", "CLIP encoder (Radford et al. 2021)", and "FAISS (Johnson, Douze, and Jegou 2017)", but it does not specify software versions for programming languages, libraries, or frameworks used for implementation (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup Yes Vi PCap requires 14M training parameters and is trained on a single NVIDIA 6000 GPU with a batch size of 128. During training, the Vi P module uses a patch size of M=200, and hyperparameter α is set to 5. The model selects three retrieved captions per image (k=3) from the COCO datastore due to the limitation of 77 context length size in CLIP. The FFN and cross-attention layer include a 12-head cross-attention layer with a single-layer block. To reduce the computational cost, we scale the dimension of cross-attention layers from 64 to 16.