CLImage: Human-Annotated Datasets for Complementary-Label Learning
Authors: Hsiu-Hsuan Wang, Mai Tan Ha, Nai-Xuan Ye, Wei-I Lin, Hsuan-Tien Lin
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our efforts resulted in the creation of four datasets: CLCIFAR10, CLCIFAR20, CLMicro Image Net10, and CLMicro Image Net20, derived from well-known classification datasets CIFAR10, CIFAR100, and Tiny Image Net200. These datasets represent the very first real-world CLL datasets, namely CLImage, which are publicly available at: https://github.com/ntucllab/CLImage_Dataset. Through extensive benchmark experiments, we discovered a notable decrease in performance when transitioning from synthetically labeled datasets to real-world datasets. We investigated the key factors contributing to the decrease with a thorough dataset-level ablation study. |
| Researcher Affiliation | Academia | Hsiu-Hsuan Wang EMAIL Department of Computer Science and Information Engineering National Taiwan University |
| Pseudocode | Yes | An algorithmic description of the protocol is as follows. For each image x, 1. Uniformly sample four labels without replacement from the label set [K]. 2. Ask the annotator to select any one of the complementary label y from the four sampled labels. 3. Add the pair (x, y) to the complementary dataset. |
| Open Source Code | Yes | These datasets represent the very first real-world CLL datasets, namely CLImage, which are publicly available at: https://github.com/ntucllab/CLImage_Dataset. |
| Open Datasets | Yes | Our efforts resulted in the creation of four datasets: CLCIFAR10, CLCIFAR20, CLMicro Image Net10, and CLMicro Image Net20, derived from well-known classification datasets CIFAR10, CIFAR100, and Tiny Image Net200. These datasets represent the very first real-world CLL datasets, namely CLImage, which are publicly available at: https://github.com/ntucllab/CLImage_Dataset. |
| Dataset Splits | Yes | The CLCIFAR10 and CLCIFAR20 datasets each contain 50,000 training instances and 10,000 testing instances. For the CLMicro Image Net datasets, CLMicro Image Net10 has 5,000 training instances and 500 testing instances, whereas CLMicro Image Net20 includes 10,000 training instances and 1,000 testing instances. The learning rate was selected from {10-3, 5*10-4, 10-4, 5*10-5, 10-5} using a 10% hold-out validation set. |
| Hardware Specification | Yes | The experiments were run with Tesla V100-SXM2. |
| Software Dependencies | No | Then, we trained a Res Net18 (He et al., 2016) model using the baseline methods mentioned above on the single CLL dataset using the Adam optimizer for 300 epochs without learning rate scheduling. |
| Experiment Setup | Yes | Then, we trained a Res Net18 (He et al., 2016) model using the baseline methods mentioned above on the single CLL dataset using the Adam optimizer for 300 epochs without learning rate scheduling. Detailed results from the ablation study on various neural network architectures, which further justify our choice of Res Net18 as the backbone, are available in Appendix A.6. The training settings included a fixed weight decay of 10 4 and a batch size of 512. The experiments were run with Tesla V100-SXM2. For better generalization, we applied standard data augmentation technique, Random Horizontal Flip, Random Crop, and normalization to each image. The learning rate was selected from {10 3, 5 10 4, 10 4, 5 10 5, 10 5} using a 10% hold-out validation set. |