Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1]
CR2PQ: Continuous Relative Rotary Positional Query for Dense Visual Representation Learning
Authors: Shaofeng Zhang, Qiang Zhou, Sitong Wu, Haoru Tan, zhibin wang, Jinfa Huang, Junchi Yan
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our extensive experiments on standard datasets demonstrate state-of-the-art (SOTA) results. Compared to the previous SOTA method (PQCL), our approach achieves significant improvements on COCO: with 300 epochs of pretraining, CR2PQ obtains 3.4% m APbb and 2.1% m APmk improvements for detection and segmentation tasks, respectively. Furthermore, CR2PQ exhibits faster convergence, achieving 10.4% m APbb and 7.9% m APmk improvements over SOTA with just 40 epochs of pretraining. |
| Researcher Affiliation | Collaboration | 1Sch. of Computer Science & Sch. of Aitificial Intelligence, Shanghai Jiao Tong University 2INF Tech Co., Ltd., 3CUHK, 4HKU, 5Peking University. EMAIL |
| Pseudocode | Yes | Code 1 shows the implementation of how to compute relative coordinates matrix of rp B. Listing 1: Computing Relative Coordinates |
| Open Source Code | Yes | Code: https://github.com/Sherrylone/PQRoPE |
| Open Datasets | Yes | We conduct self-supervised pre-training on the Image Net-1K (Deng et al., 2009) training set with 1,000 classes, as used in SSL for both MIM (He et al., 2021) and contrastive learning (Chen et al., 2020a). We also transfer the encoder pre-trained by CR2PQ on MS-COCO (Lin et al., 2014) and ADE20K (Zhou et al., 2017) datasets. |
| Dataset Splits | Yes | MS COCO (Lin et al., 2014) is a large-scale object detection, segmentation, and captioning dataset: in particular, train 2017 and val 2017 splits contain 118K and 5K images, respectively. ... ADE20K (Zhou et al., 2017), which contains 150 fine-grained semantic categories and 25K training data. |
| Hardware Specification | Yes | The experiments are performed on a workstation with 32 V100 GPUs by default (if not otherwise specified). ... Specifically, we pre-train the Vi T-Large with 800 epochs with batch size 2048, distributed on 16 A100 GPUs with the base learning rate 1.5e-4. |
| Software Dependencies | No | We follow the basic configuration of mmdetection (Chen et al., 2019) for fine-tuning Mask R-CNN (He et al., 2017) with FPN (Lin et al., 2017) under the standard 1x schedule. ... We follow all the configurations of mmsegmentation (Contributors, 2020) for fine-tuning Semantic FPN (Lin et al., 2017) with 40K iterations and an input resolution of 512 × 512. The paper mentions software tools like 'mmdetection' and 'mmsegmentation' along with their corresponding citations, but it does not specify exact version numbers for these or any other software components used. |
| Experiment Setup | Yes | In line with CAE (Chen et al., 2022), we train with Adamw (Loshchilov & Hutter, 2018) and a batch size of 2048, distributed over 32 GPUs using Vi T-S/16 (batch size per GPU is 64). For Vi T-B, the learning rate is linearly ramped up during the first 40 epochs to its base value determined with the following linear scaling rule (Chen et al., 2020a): blr = 1.5e-4, Batch Size=2048, and lr = blr × Batch Size/256. For Vi T-S, we set blr as 1.75e-4. After warmup, we decay the learning rate with a cosine schedule (Loshchilov & Hutter, 2016). |