Semantic-Space-Intervened Diffusive Alignment for Visual Classification

Authors: Zixuan Li, Lei Meng, Guoqing Chao, Wei Wu, Yimeng Yang, Xiaoshuo Yan, Zhuang Qi, Xiangxu Meng

IJCAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results show that Se DA achieves stronger cross-modal feature alignment, leading to superior performance over existing methods across multiple scenarios. Extensive experiments are conducted on the general dataset NUS-WIDE, the domain-specific dataset VIREO Food-172, and the video dataset MSRVTT, including performance comparisons, ablation studies, in-depth analysis, and case studies.
Researcher Affiliation Academia Zixuan Li1 , Lei Meng1 , Guoqing Chao2 , Wei Wu1 , Yimeng Yang1 , Xiaoshuo Yan1 , Zhuang Qi1 , Xiangxu Meng1 1School of Software, Shandong University, Jinan, China 2School of Computer Science and Technology, Harbin Institute of Technology, Weihai, China EMAIL, EMAIL, EMAIL
Pseudocode No The paper describes the proposed method (Se DA) in detail, including its modules (PFIN, DSL, DST) and their mathematical formulations and optimization processes. However, it does not present any explicitly labeled 'Pseudocode' or 'Algorithm' block with structured steps.
Open Source Code No The paper does not contain an unambiguous statement from the authors about releasing their code for the described methodology, nor does it provide a direct link to a source-code repository.
Open Datasets Yes VIREO Food-172[Chen and Ngo, 2016]: A singlelabel dataset with 110,241 food images in 172 categories and an average of three text descriptions per image. NUS-WIDE[Chua et al., 2009]: A multi-label dataset of 203,598 images (after filtering) in 81 categories, with textual tags from a 1000-word vocabulary. MSRVTT[Xu et al., 2016]: A video dataset with 10,000 You Tube clips and 200,000 captions.
Dataset Splits Yes VIREO Food-172: It includes 66,071 training and 33,154 test images. NUS-WIDE: It has 121,962 training and 81,636 test images. MSRVTT: We used 7,010 videos for training and 2,990 for testing.
Hardware Specification Yes Our experiments were conducted on a single NVIDIA Tesla V100 GPU
Software Dependencies Yes Our experiments were conducted on a single NVIDIA Tesla V100 GPU, using Py Torch 1.10.2, and the batch size is 64.
Experiment Setup Yes In this experiment, we chose Adam as the optimizer for the model, with a weight decay of 1e-4. The learning rate for all neural networks was set between 1e-4 and 5e-5. The learning rate decayed to half of its original value every four training epochs. For the loss weights mentioned in the training strategy, we selected α1 and α2 between 0.1 and 2.0, the time step T between 900 and 1500, the staged step t between 0 and 500, while β and γ were chosen from [0.5, 1.0, 1.5, 2.0]. Our experiments were conducted on a single NVIDIA Tesla V100 GPU, using Py Torch 1.10.2, and the batch size is 64.