Reverse Region-to-Entity Annotation for Pixel-Level Visual Entity Linking

Authors: Zhengfei Xu, Sijia Zhao, Yanchao Hao, Xiaolong Liu, Lili Li, Yuyang Yin, Bo Li, Xi Chen, Xin Xin

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results show that models trained on this dataset improved accuracy by 18 points compared to zero-shot models. Additionally, the semantic tokenization method achieved a 5-point accuracy improvement over the trained baseline.
Researcher Affiliation Collaboration 1School of Computer Science and Technology, Beijing Institute of Technology, Beijing, China 2Platform and Content Group, Tencent, Beijing, China
Pseudocode No The paper describes the methodology and model architecture with figures and textual descriptions, but it does not contain a dedicated section or block labeled "Pseudocode" or "Algorithm" with structured steps.
Open Source Code No The paper provides a GitHub link specifically labeled "Datasets https://github.com/NP-NET-research/PL-VEL". It does not explicitly state that the source code for the methodology described in the paper is available at this link or any other location.
Open Datasets Yes We have constructed the Mask OVEN-Wiki dataset through an entirely automatic reverse region-entity annotation framework. This dataset contains over 5 million annotations aligning pixel-level regions with entity-level labels, which will advance visual understanding towards fine-grained. Datasets https://github.com/NP-NET-research/PL-VEL
Dataset Splits Yes Table 2: Statistics of the Mask OVEN-Wiki. Train Set Val Set Test Set Wiki Set Human Set # SEEN entities 7,943 2,470 1,604 199 7,943 2,339 8,733 2,015 # SEEN examples 4,464,748 23,514 51,906 588 291,327 7,460 8,733 12,057 # UNSEEN entities 0 0 1,588 433 7,944 3,096 1,956,412 2,429 # UNSEEN examples 0 0 56,549 1,406 316,817 7,979 1,956,412 11,100 # Total examples 4,464,748 23,514 108,455 1,964 608,144 15,439 1,965,145 23,157
Hardware Specification No The paper mentions "computational resource constraints" but does not specify any particular hardware components such as GPU or CPU models used for experiments.
Software Dependencies No The paper mentions several models and approaches like "Conv Ne Xt CLIP", "Fast SAM", "Lo RA", and "Vicuna" and cites their respective papers, but it does not provide specific version numbers for these software components or other key libraries/frameworks (e.g., Python, PyTorch versions).
Experiment Setup Yes The pre-train stage used about 2 million wiki split samples. Due to computational resource constraints and the large size of the dataset (approximately 4.5 million samples), we limited the number of annotated samples per entity to fewer than 50 during the fine-tuning stage. As a result, we used about 7% of the total samples (approximately 0.3 million) in the fine-tuning stage. In addition, all input images were uniformly preprocessed to 512 512. The length of the ALD code is limited to 4 tokens. We have implemented a two-stage training strategy for our model. The vision encoder Conv Ne Xt CLIP (Liu et al. 2022) and the semantic tokenizer Fast SAM (Zhao et al. 2023) remain frozen, while the mask-aware visual extractor M and the visual-language projector are fully fine-tuned. The base LLM is fine-tuned with the Lo RA (Hu et al. 2022) approach. Both stages employ autoregressive language modeling loss to predict the next token (Liu et al. 2023a).