Iterative Self-Training with Class-Aware Text-to-Image Synthesis for Visual Task Learning

Authors: Xiang Zhang, Wanqing Zhao, Pengyang Li, Ying Liu, Hangzai Luo, Sheng Zhong, Jinye Peng, Jianping Fan

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on PASCAL VOC (Everingham et al. 2010) and MSCOCO (Lin et al. 2014) show that IST-CATS outperforms most existing synthetic, semi-supervised, and weakly-supervised methods in object detection and semantic segmentation. Ablation Studies To evaluate the effectiveness of IST-CATS, we conduct ablation experiments on the PASCAL VOC dataset.
Researcher Affiliation Academia Northwest University, Xi an, China {xiangz@, zhaowq@, lipengyang@stumail., liuying6@stumail., hzluo@, szhong@, pjy@, jfan@}nwu.edu.cn
Pseudocode No The paper describes the methodology using textual explanations and diagrams (e.g., Figure 2) but does not include any explicit pseudocode or algorithm blocks.
Open Source Code No The paper does not contain an explicit statement about releasing source code or a link to a code repository.
Open Datasets Yes Extensive experiments on PASCAL VOC (Everingham et al. 2010) and MSCOCO (Lin et al. 2014) show that IST-CATS outperforms most existing synthetic, semi-supervised, and weakly-supervised methods in object detection and semantic segmentation. To evaluate the performance of our IST-CATS in semantic segmentation tasks, we utilize category information from the PASCAL VOC 2012 and MSCOCO datasets as inputs for our class-aware text-to-image synthesis framework, resulting in the creation of the Syn-VOC and Syn-COCO datasets.
Dataset Splits Yes The Syn-VOC dataset consists of 51, 924 images with 20 object classes, which are further divided into 46, 731 training images and 5, 193 validation images. The Syn-COCO includes 154, 092 images with 80 object classes, and it is split into 138, 682 training images and 15, 410 validation images. Regarding object detection tasks, our evaluation focused on the PASCAL VOC 2007 and 2012 test sets, as well as the MSCOCO val set.
Hardware Specification Yes In our experiment, the models are trained on an NVIDIA RTX 2080 Ti GPU using Py Torch.
Software Dependencies No In our experiment, the models are trained on an NVIDIA RTX 2080 Ti GPU using Py Torch. We employ the pretrained Res Net101 on Image Net as the backbone for the segmentation network (i.e., Deep Labv3+ (Chen et al. 2018)). For object detection, we utilize YOLOv5x (Jocher et al. 2021) as our detector. The paper mentions software such as Py Torch, ResNet101, DeepLabv3+, and YOLOv5x, but does not provide specific version numbers for these.
Experiment Setup Yes The network is trained with mini-batch stochastic gradient descent (SGD) using a batch size of 8, weight decay of 0.0002, and momentum of 0.9 over 60 epochs. We apply data augmentation techniques such as random horizontal flipping and random cropping, which resized the images to 513 513. The initial learning rate for Deep Labv3+ is set to 4e 3 and decreases gradually using polynomial decay with a power of 0.9. For object detection... we opt for a batch size of 16 and initialize the learning rate to 0.00334, alongside a weight decay of 0.00025 and momentum of 0.74832. Input images are resized to 512 512 pixels, and training lasts for 50 epochs.