YOLO-RD: Introducing Relevant and Compact Explicit Knowledge to YOLO by Retriever-Dictionary

Authors: Hao-Tang Tsui, Chien-Yao Wang, Hong-Yuan Liao

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The experiments show that using the RD significantly improves model performance, achieving more than a 3% increase in mean Average Precision for object detection with less than a 1% increase in model parameters. Beyond YOLO, the RD module improves the effectiveness of 2-stage models and DETR-based architectures, such as Faster R-CNN and Deformable DETR.
Researcher Affiliation Academia Hao-Tang Tsui, Chien-Yao Wang & Hong-Yuan Mark Liao Institute of Information Science, Academia Sinica EMAIL
Pseudocode Yes A.6 PSEUDO CODE OF FULL TRAINING PROCESS OF Retriever-Dictionary MODEL Algorithm 1: Train a model with Retriever Dictionary Data: Dataset with images and bounding boxes Result: Trained model with Retriever Dictionary // Initialization of the Dictionary
Open Source Code Yes Code is released at https://github.com/henrytsui000/YOLO.
Open Datasets Yes We primarily validated the method on the Microsoft COCO dataset (Lin et al., 2014), training on the COCO 2017 train set and evaluating on the COCO 2017 validation set... For the Classification task, we used the CIFAR-100 dataset with the YOLOv9-classify model, using top-1 and top-5 accuracy as metrics... Using pre-trained weights from the MSCOCO dataset, we trained the model on the VOC (Everingham et al., 2010) dataset with three learning rate schedules: 10-epoch fast training, 100-epoch full-tuning, and training from scratch.
Dataset Splits Yes We primarily validated the method on the Microsoft COCO dataset (Lin et al., 2014), training on the COCO 2017 train set and evaluating on the COCO 2017 validation set... For Object Detection and Segmentation, we respectively used m AP and m AP@.5 as evaluation metrics... For the Classification task, we used the CIFAR-100 dataset... Latency was measured on a single Nvidia 3090 GPU without any external acceleration tools, using milliseconds per batch with a batch size of 32 on the MSCOCO validation set.
Hardware Specification Yes All experiments were conducted using 8 Nvidia V100 GPUs... The classification task on CIFAR-100 (Krizhevsky et al., 2009) took 2 hours on a single Nvidia 4090 GPU for 100 epochs. Latency was measured on a single Nvidia 3090 GPU without any external acceleration tools, using milliseconds per batch with a batch size of 32 on the MSCOCO validation set.
Software Dependencies No We primarily validated the method on the Microsoft COCO dataset (Lin et al., 2014), training on the COCO 2017 train set and evaluating on the COCO 2017 validation set. For Object Detection and Segmentation, we respectively used m AP and m AP@.5 as evaluation metrics, testing on YOLOv7, YOLOv9, Faster RCNN, and Deformable DETR. For the Classification task, we used the CIFAR-100 dataset with the YOLOv9-classify model, using top-1 and top-5 accuracy as metrics... We also trained a modified Faster RCNN, based on the mm-detection framework, for a maximum of 120 epochs over 3 days.
Experiment Setup Yes In the main series of experiments, we trained a modified YOLOv7 model, which included the addition of a Retriever-Dictionary Module, for 300 epochs in 2 days. The YOLOv9-based model was trained for 5 days with 500 epochs. We also trained a modified Faster RCNN, based on the mm-detection framework, for a maximum of 120 epochs over 3 days. For Deformable DETR, we trained for approximately 120 epochs over 7 days. The classification task on CIFAR-100 (Krizhevsky et al., 2009) took 2 hours on a single Nvidia 4090 GPU for 100 epochs. Latency was measured on a single Nvidia 3090 GPU without any external acceleration tools, using milliseconds per batch with a batch size of 32 on the MSCOCO validation set.