PixelWorld: Towards Perceiving Everything as Pixels

Authors: Zhiheng Lyu, Xueguang Ma, Wenhu Chen

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments across multiple benchmarks show that PEAP achieves comparable performance to token-based approaches on semantic understanding tasks, suggesting that vision transformers can partially capture global textual semantics without explicit tokenization. In Pixel World, we select 10 widely used benchmarks covering a diverse range of modalities and task scenarios. For each dataset, we construct both traditional token-based and pixel-based (PEAP) input formats using image synthesis and OCR techniques (see Table 1). We then evaluate vision language models of varying scales, from Qwen2VL-2B to GPT-4o.
Researcher Affiliation Academia Zhiheng Lyu1,2 Xueguang Ma1 Wenhu Chen1,2 1University of Waterloo 2Vector Institute, Toronto EMAIL
Pseudocode No The paper describes the PEAP-Fast algorithm conceptually in paragraph form (e.g., 'To reduce redundancy in visual inputs, we propose PEAP-Fast, which first identifies empty patches via a simple variance-based threshold...') but does not provide a clearly labeled pseudocode or algorithm block.
Open Source Code Yes The benchmark and code are publicly released to facilitate standardized comparison and future research on multimodal perception.
Open Datasets Yes In Pixel World, we select 10 widely used benchmarks covering a diverse range of modalities and task scenarios. For each dataset, we construct both traditional token-based and pixel-based (PEAP) input formats using image synthesis and OCR techniques (see Table 1). Dataset Name Size Task Modality Transfer Split GLUE (Wang, 2018) 59,879 Natural language understanding Synthesis test Super GLUE (Sarlin et al., 2020) 19,294 Natural language understanding Synthesis test MMLU-Pro (Wang et al., 2024b) 12,032 Domain knowledge and reasoning Synthesis test ARC (Clark et al., 2018) 3,548 Science question answering Synthesis test GSM8K (Cobbe et al., 2021) 1,319 Math problem solving Synthesis test MBPP (Austin et al., 2021) 757 Programming tasks Synthesis test Table Bench (Wu et al., 2024) 888 Table data understanding and analysis Synthesis test Math Verse (Zhang et al., 2025) 788 Math and visual reasoning Natural test MMMU-Pro (Yue et al., 2024) 1,730 Multimodal reasoning Synthesis test Slides VQA (Tanaka et al., 2023) 2,136 Multimodal question answering OCR test Wiki-SS (Ma et al., 2024) 3,000 Multimodal retrieval question answering OCR train
Dataset Splits Yes Table 1: Overview of datasets categorized by modality, usage, size, and split. Modality Transfer means the method to adopt the dataset into counterpart modality. For OCR, we adopt the result from the origin datasets. For Wiki SS-QA, since the positive document of the test set is not released, we subsample 3,000 training data points randomly to evaluate.
Hardware Specification Yes Table 3: Inference Time (s) of Qwen2VL-7B on Super GLUE dataset with single A100 server by PEAP and PEAP-Fast.
Software Dependencies No The paper mentions using the 'Python package dataframe_image' for rendering structured data but does not provide specific version numbers for any software dependencies.
Experiment Setup Yes By default, we employ Direct Prompting; however, for more complex and mathematical datasets such as MBPP (Austin et al., 2021), MMLU-Pro (Wang et al., 2024b), and Math Verse (Zhang et al., 2025), we adopt Chain-of-Thought (Co T) prompting to enhance performance. All evaluations are conducted in a zero-shot manner to mitigate potential performance degradation caused by the sensitivity of instruction-tuned large models to few-shot prompting. ... Image widths were adaptively adjusted between 512 and 1024 pixels based on text length, with a fixed height of 256 pixels. Font sizes ranged from 15 to 25 points, and padding varied from 5 to 30 pixels. To enhance robustness, we applied various types of noise, including radial, horizontal, vertical, and Multi-Gaussian noise, as well as high-frequency Gaussian noise to simulate distortions commonly introduced by real-world cameras.