Learning from True-False Labels via Multi-modal Prompt Retrieving

Authors: Zhongnian Li, Jinghao Xu, Peng Ying, Meng Wei, Xinzheng Xu

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results demonstrate the effectiveness of the proposed TFL setting and MRP learning method. The code to reproduce the experiments is at github.com/Tranquilxu/TMP. 4. Experiments 4.1. Experimental Setup Dataset. The efficacy of our method was evaluated on five distinct multi-class image classification datasets that feature both coarse-grained (CIFAR-100(Krizhevsky et al., 2009), Tiny Image Net(Le & Yang, 2015) and Caltech-101(Fei-Fei et al., 2004)) and fine-grained (Food-101(Bossard et al., 2014) and Stanford Cars(Krause et al., 2013)) classification in different domains. Table 2. Comparison results on VLMs-based TFLs in terms of classification accuracy (the higher, the better).
Researcher Affiliation Academia 1School of Computer Science and Technology, China University of Mining and Technology, Xuzhou, China 2Mine Digitization Engineering Research Center of the Ministry of Education, China University of Mining and Technology, Xuzhou, China 3State Key Lab. for Novel Software Technology, Nanjing University, Nanjing, China. Correspondence to: Xinzheng Xu <EMAIL>.
Pseudocode Yes A.4. Overall algorithm procedure Algorithm 1 illustrates the overall algorithm procedure. Through this process, we can learn a high-quality linear classifier and a convolutional-based prompt network. Algorithm 1 TFL learning via MPR Input: The TF labeled training set DT F = {(xi, ( yi, si))}N i=1; The convolutional-based prompt network gcnn( ); A matrix M, whose elements are all 1; The CLIP s image encoder g I( ); The number of epochs T ; The Stability Optimization hyperparameter m; Output: Model parameter θ1 for the linear classifier; Model parameter θ2 for gcnn( )
Open Source Code Yes The code to reproduce the experiments is at github.com/Tranquilxu/TMP.
Open Datasets Yes The efficacy of our method was evaluated on five distinct multi-class image classification datasets that feature both coarse-grained (CIFAR-100(Krizhevsky et al., 2009), Tiny Image Net(Le & Yang, 2015) and Caltech-101(Fei-Fei et al., 2004)) and fine-grained (Food-101(Bossard et al., 2014) and Stanford Cars(Krause et al., 2013)) classification in different domains. A.5. The details of datasets CIFAR-100(Krizhevsky et al., 2009): A coarse-grained dataset comprising 60,000 color images divided into 100 classes. Tiny-Image Net(Le & Yang, 2015): A coarse-grained dataset consists of 100,000 color images divided into 200 classes. Caltech-101(Fei-Fei et al., 2004): A coarse-grained dataset comprises images from 101 object categories and a background category. Food-101(Bossard et al., 2014): A fine-grained dataset in the food domain, comprising 101,000 images divided into 101 food categories. Stanford Cars(Krause et al., 2013): A fine-grained dataset in the car domain, comprising 16,185 images categorized into 196 car classes.
Dataset Splits Yes For each dataset, the label of each image in the training set is replaced with the True-False Label (TFL), and the labels in the test set remain unchanged from the ground-truth labels. More information related to the datasets is shown in the Appendix A.5. CIFAR-100(Krizhevsky et al., 2009): Each image is given in a 32 32 3 format, and each class contains 500 training images and 100 test images. Tiny-Image Net(Le & Yang, 2015): Each image is given in a 64 64 3 format, and each class contains 500 training images, 50 validation images and 50 test images. Food-101(Bossard et al., 2014): Each class contains 750 training images and 750 test images. Stanford Cars(Krause et al., 2013): The data is divided into almost a 50-50 train/test split with 8,144 training images and 8,041 testing images.
Hardware Specification Yes Unless otherwise noted, all models are trained for 50 epochs with a batch-size of 256 on a single NVIDIA RTX 4090 GPU. Figure 4. Comparison of training cost between TMP and weakly supervised learning methods. The numbers in the figure represent the average time (in seconds) required to train each method for single epoch. The experiments utilize the CIFAR-100 and Caltech-101 datasets, conducted on a single NVIDIA RTX 4090 GPU.
Software Dependencies No To ensure fair comparisons, for all experiments, we use CLIP with Vi T-L/14 as the vision backbone, and employ the Adam W optimizer(Loshchilov & Hutter, 2019) for the linear classifier with an initial learning rate of 1e 3, a weight decay parameter set to 0.9, and the minimum learning rate of 5e 6.
Experiment Setup Yes Implementation details. To ensure fair comparisons, for all experiments, we use CLIP with Vi T-L/14 as the vision backbone, and employ the Adam W optimizer(Loshchilov & Hutter, 2019) for the linear classifier with an initial learning rate of 1e 3, a weight decay parameter set to 0.9, and the minimum learning rate of 5e 6. Unless otherwise noted, all models are trained for 50 epochs with a batch-size of 256 on a single NVIDIA RTX 4090 GPU. In our experiments, we employ the Adam W optimizer for the convolutionalbased prompt network with an initial learning rate of 8e 2, a weight decay parameter set to 0.01, and the minimum learning rate of 5e 4. The hyperparameters KT and KI are set to 15 and 5, respectively. The size of matrix M is set to 224 224 1.