Leveraging MLLM Embeddings and Attribute Smoothing for Compositional Zero-Shot Learning

Authors: Xudong Yan, Songhe Feng, Yang Zhang, Jian Yang, Yueguan Lin, Haojun Fei

IJCAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments demonstrate that our method achieves state-of-the-art performance on three challenging datasets. The supplementary material and source code will be available at https://github.com/xud-yan/Trident. ... 4 Experiment 4.1 Experiment Setup 4.2 Results and Discussion 4.3 Ablation Study
Researcher Affiliation Collaboration 1School of Computer Science and Technology, Beijing Jiaotong University 2Qifu Technology EMAIL, {yangjian1, linyueguan, feihaojun}-EMAIL
Pseudocode No The paper describes the methodology in prose and mathematical formulations within the 'Approach' section (Section 3), but it does not contain a clearly labeled 'Pseudocode' or 'Algorithm' block.
Open Source Code Yes The supplementary material and source code will be available at https://github.com/xud-yan/Trident.
Open Datasets Yes We evaluate our model on three challenging CZSL datasets: MIT-states [Isola et al., 2015], C-GQA [Naeem et al., 2021], and VAW-CZSL [Saini et al., 2022]. The common data splits are presented in Table 1.
Dataset Splits Yes We evaluate our model on three challenging CZSL datasets: MIT-states [Isola et al., 2015], C-GQA [Naeem et al., 2021], and VAW-CZSL [Saini et al., 2022]. The common data splits are presented in Table 1. Table 1: Summary statistics of the datasets used in our experiments. Train Validation Test |A| |O| |Cs| |X| |Cs| |Cu| |X| |Cs| |Cu| |X| MIT-States 115 245 1262 30k 300 300 10k 400 400 13k C-GQA 413 674 5592 27k 1252 1040 7k 888 923 5k VAW-CZSL 440 541 1252 72k 2121 2322 10k 2449 2470 11k
Hardware Specification No The paper states: "We use the visual encoder of LLa VA v1.5, Vi T-Large-Patch14-336px as our frozen visual backbone." and "TRIDENT and all baseline models are trained with the batch size of 128 for 50 epochs under the Py Torch framework [Paszke et al., 2019]". This specifies model architecture and training parameters, but no concrete hardware details (e.g., GPU model, CPU, memory) are provided.
Software Dependencies No The paper mentions "under the Py Torch framework [Paszke et al., 2019]" and uses "LLa VA v1.5" and "GPT-3.5 [Open AI, 2023]". While PyTorch and GPT-3.5 are software/models, specific version numbers for PyTorch or other libraries are not provided.
Experiment Setup Yes We use the visual encoder of LLa VA v1.5, Vi T-Large-Patch14-336px as our frozen visual backbone. TRIDENT and all baseline models are trained with the batch size of 128 for 50 epochs under the Py Torch framework [Paszke et al., 2019]. The number of global features is set to 6, 2, and 4 for the three datasets, respectively, and the number of local features is twice that of the global features. The label smoothing factor is set to 0.09, 0.03, and 0.03 for the three datasets, respectively. The number of auxiliary attributes generated for each composition is set to 3. We train TRIDENT by Adam optimizer with the weight decay of 5e-5, learning rates of 1.5e-6 for word embedding and 2e-4 for other modules. We decay the learning rate by a factor of 0.1 at epoch 30 and 40. The temperature variable of cosine similarity δ is set to 0.05. For weighting coefficients γortho, γcomp, and γpri, we set them to 0.1, 1, 0.25, respectively.