CP-DETR: Concept Prompt Guide DETR Toward Stronger Universal Object Detection

Authors: Qibo Chen, Weizhong Jin, Jianyue Ge, Mengdi Liu, Yuchao Yan, Jian Jiang, Li Yu, Xuanjiang Guo, Shuchang Li, Jianzhong Chen

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental CP-DETR demonstrates superior universal detection performance in a broad spectrum of scenarios. For example, our Swin-T backbone model achieves 47.6 zero-shot AP on LVIS, and the Swin-L backbone model achieves 32.2 zero-shot AP on ODin W35. Furthermore, our visual prompt generation method achieves 68.4 AP on COCO val by interactive detection, and the optimized prompt achieves 73.1 fully-shot AP on ODin W13.
Researcher Affiliation Industry China Mobile(Zhejiang) Research & Innovation Institute EMAIL
Pseudocode No The paper describes methods through architectural diagrams (Figure 1, Figure 2) and mathematical equations, but it does not contain any structured pseudocode or algorithm blocks.
Open Source Code No The paper does not contain any explicit statement about releasing source code, nor does it provide a link to a code repository.
Open Datasets Yes For the object level, we use publicly available detection datasets, which contain Objects365 (Shao et al. 2019) (O365), Open Images(Kuznetsova et al. 2020) (OI), V3Det (Wang et al. 2023), LVIS (Gupta, Dollar, and Girshick 2019) and COCO (Lin et al. 2014) datasets. For grounding or REC data, we used the Gold G (Kamath et al. 2021), Ref COCO/+/g (Yu et al. 2016; Mao et al. 2016), Visual Genome (Krishna et al. 2017) (VG) and Phrase Cut (Wu et al. 2020) datasets
Dataset Splits No The paper lists various datasets used for training and evaluation (e.g., O365, V3Det, Gold G, COCO, LVIS, ODin W). It refers to evaluation benchmarks (COCO, LVIS, ODin W, Ref C), implying standard test splits for these, but it does not explicitly provide details on how the combined training data or any specific dataset was split into training, validation, and test sets for their experiments, nor does it specify exact percentages or sample counts for these splits.
Hardware Specification Yes In all experiments, we use Adam W as the optimizer with weight decay set to 1e-4 and set a minibatch to 32 on 8 A100 40GB GPUs.
Software Dependencies No The paper mentions using specific models like "CLIP-L" and "Swin-Tiny and Swin Large" and an optimizer "Adam W", but it does not specify version numbers for core software components such as programming languages (e.g., Python), deep learning frameworks (e.g., PyTorch, TensorFlow), or CUDA libraries.
Experiment Setup Yes In all experiments, we use Adam W as the optimizer with weight decay set to 1e-4 and set a minibatch to 32 on 8 A100 40GB GPUs. In pre-training, the learning rate was set to 1e-5 for the text encoder and image backbone and 1e-4 for the rest of the modules, and a decay of 0.1 was applied at 80% and 90% of the total training steps. In visual prompt training, the O365, V3Det, Gold G, and OI datasets are used, the learning rate of the visual prompt encoder is set to 1e-4, and the training is performed for 0.5M iterations. In the optimized prompt, the learning rate of the embedding layer is set to 5e-2, the total number of training epochs is 24, and a decay of 0.1 is applied at 80% of the total training steps.