Seeing the Unseen: Composing Outliers for Compositional Zero-Shot Learning
Authors: Chenchen Jing, Mingyu Liu, Hao Chen, Yuling Xi, Xingyuan Bu, Dong Gong, Chunhua Shen
IJCAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results on three datasets show the effectiveness of our method in both the closed-world setting and the open-world setting. |
| Researcher Affiliation | Collaboration | Chenchen Jing1,2 , Mingyu Liu3 , Hao Chen3 , Yuling Xi3 , Xingyuan Bu4 , Dong Gong5 , Chunhua Shen1,2 1College of Computer Science and Technology, Zhejiang University of Technology, Hangzhou, China 2Zhejiang Key Laboratory of Visual Information Intelligent Processing, Hangzhou, China 3Zhejiang University, China 4Alibaba Group 5The University of New South Wales |
| Pseudocode | No | The paper describes the method and architecture using natural language and figures (e.g., Figure 2: Overview of our method), but it does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not contain any explicit statement about releasing source code or provide a link to a code repository. It mentions using CLIP as a backbone but not providing their own implementation code. |
| Open Datasets | Yes | We conduct experiments on three widely used datasets, UT-Zappos [Yu and Grauman, 2014], MIT-States [Isola et al., 2015], and C-GQA [Naeem et al., 2021]. |
| Dataset Splits | Yes | UT-Zappos is a synthetic fine-grained dataset consisting of 116 kinds of shoe classes composed of 16 attributes (e.g., rubber) and 12 objects (e.g. sandal). The dataset is split into 83 seen and 15/18 unseen compositions for training and validation/testing. MIT-States consists of 53,753 crawled web images labeled with 1962 attribute-object. The dataset contains 1,262 seen and 300/400 unseen compositions for training and validation/testing, respectively. C-GQA contains over 9,000 common compositions and is split into 5,592 seen and 1,040/923 unseen compositions for training and validation/testing, respectively. |
| Hardware Specification | No | The paper mentions using ResNet and CLIP (ViT-L/14) as backbones but does not specify any particular hardware like GPU models, CPU types, or memory used for experiments. |
| Software Dependencies | No | The paper mentions using backbones like Res Net [He et al., 2016] and CLIP [Radford et al., 2021] but does not provide specific version numbers for any software, libraries, or programming languages used. |
| Experiment Setup | Yes | For the CLIP backbone, the training epochs for each dataset as 5/15 for the two stages, respectively. In the first stage, the hyper-parameters α1, α2, and α3 are set as (0.1, 0.1, 5.0) for the UT-Zappos, (0.01, 0.01, 1.0) for MIT-State, and (0.1, 0.5, 1.0) for C-GQA, respectively. For the Res Net backbone, the training epochs for each dataset as 50/100 for the two stages, respectively. The hyperparameters α1, α2, and α3 are set as (5.0, 0.1, 5.0) for the UT-Zappos, (5.0, 0.1, 1.0) for MIT-States, and (0.1, 1.0, 1.0) for C-GQA, respectively. |