reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Are Large Vision Language Models Good Game Players?

Authors: Xinyu Wang, Bohan Zhuang, Qi Wu

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Based on this framework, we conduct extensive experiments that explore the limitations of current LVLMs, such as handling long structured outputs and perceiving detailed and dense elements. Code and data are publicly available at https://github.com/xinkewang/LVLM-Playground.
Researcher Affiliation	Academia	1The University of Adelaide, Australia 2Zhejiang University, China
Pseudocode	No	The paper describes methodologies and calculations using formulas but does not include any clearly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code	Yes	Code and data are publicly available at https://github.com/xinkewang/LVLM-Playground.
Open Datasets	Yes	Code and data are publicly available at https://github.com/xinkewang/LVLM-Playground.
Dataset Splits	Yes	For Perceiving, Question Answering, and Rule-Following tasks, we utilized the simulator to generate 2,000 samples for each, followed by offline evaluation. For the End-to-End playing task, we conducted online evaluations, running 100 gameplays per model.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., GPU models, CPU types) used for running the experiments. It only lists the models that were evaluated.
Software Dependencies	No	The paper mentions support for 'commercial models, such as Open AI API, and open-source models, like those from the Hugging Face Transformers library' and uses the 'Stockfish1 engine' for Chess AI, but it does not provide specific version numbers for these or other software components.
Experiment Setup	Yes	All models were evaluated under the same conditions, including identical settings for maximum new tokens and task prompts.