Reverse Region-to-Entity Annotation for Pixel-Level Visual Entity Linking
Authors: Zhengfei Xu, Sijia Zhao, Yanchao Hao, Xiaolong Liu, Lili Li, Yuyang Yin, Bo Li, Xi Chen, Xin Xin
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results show that models trained on this dataset improved accuracy by 18 points compared to zero-shot models. Additionally, the semantic tokenization method achieved a 5-point accuracy improvement over the trained baseline. |
| Researcher Affiliation | Collaboration | 1School of Computer Science and Technology, Beijing Institute of Technology, Beijing, China 2Platform and Content Group, Tencent, Beijing, China |
| Pseudocode | No | The paper describes the methodology and model architecture with figures and textual descriptions, but it does not contain a dedicated section or block labeled "Pseudocode" or "Algorithm" with structured steps. |
| Open Source Code | No | The paper provides a GitHub link specifically labeled "Datasets https://github.com/NP-NET-research/PL-VEL". It does not explicitly state that the source code for the methodology described in the paper is available at this link or any other location. |
| Open Datasets | Yes | We have constructed the Mask OVEN-Wiki dataset through an entirely automatic reverse region-entity annotation framework. This dataset contains over 5 million annotations aligning pixel-level regions with entity-level labels, which will advance visual understanding towards fine-grained. Datasets https://github.com/NP-NET-research/PL-VEL |
| Dataset Splits | Yes | Table 2: Statistics of the Mask OVEN-Wiki. Train Set Val Set Test Set Wiki Set Human Set # SEEN entities 7,943 2,470 1,604 199 7,943 2,339 8,733 2,015 # SEEN examples 4,464,748 23,514 51,906 588 291,327 7,460 8,733 12,057 # UNSEEN entities 0 0 1,588 433 7,944 3,096 1,956,412 2,429 # UNSEEN examples 0 0 56,549 1,406 316,817 7,979 1,956,412 11,100 # Total examples 4,464,748 23,514 108,455 1,964 608,144 15,439 1,965,145 23,157 |
| Hardware Specification | No | The paper mentions "computational resource constraints" but does not specify any particular hardware components such as GPU or CPU models used for experiments. |
| Software Dependencies | No | The paper mentions several models and approaches like "Conv Ne Xt CLIP", "Fast SAM", "Lo RA", and "Vicuna" and cites their respective papers, but it does not provide specific version numbers for these software components or other key libraries/frameworks (e.g., Python, PyTorch versions). |
| Experiment Setup | Yes | The pre-train stage used about 2 million wiki split samples. Due to computational resource constraints and the large size of the dataset (approximately 4.5 million samples), we limited the number of annotated samples per entity to fewer than 50 during the fine-tuning stage. As a result, we used about 7% of the total samples (approximately 0.3 million) in the fine-tuning stage. In addition, all input images were uniformly preprocessed to 512 512. The length of the ALD code is limited to 4 tokens. We have implemented a two-stage training strategy for our model. The vision encoder Conv Ne Xt CLIP (Liu et al. 2022) and the semantic tokenizer Fast SAM (Zhao et al. 2023) remain frozen, while the mask-aware visual extractor M and the visual-language projector are fully fine-tuned. The base LLM is fine-tuned with the Lo RA (Hu et al. 2022) approach. Both stages employ autoregressive language modeling loss to predict the next token (Liu et al. 2023a). |