reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Graph World Model

Authors: Tao Feng, Yexin Wu, Guanyu Lin, Jiaxuan You

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on 6 tasks from diverse domains, including multi-modal generation and matching, recommendation, graph prediction, multi-agent, retrieval-augmented generation, and planning and optimization, show that the same GWM outperforms or matches domain-specific baselines performance, benefits from multi-hop structures, and demonstrate strong zero-shot/few-shot capabilities on unseen new tasks.
Researcher Affiliation	Academia	1Department of Computer Science, University of Illinois Urbana Champaign Urbana, IL, USA.
Pseudocode	No	The paper describes methods using mathematical equations and textual explanations, for example: h(l) v = fv CONCAT(h(l 1) v , {h(l 1) u , u N(v)}) , (1). However, it does not include any clearly labeled 'Pseudocode' or 'Algorithm' blocks, nor does it present structured steps in a code-like format.
Open Source Code	Yes	Our codes for GWM is released at https://github.com/ ulab-uiuc/GWM.
Open Datasets	Yes	We employ Long Bench v2 (Bai et al., 2024), a benchmark specifically designed to test long-context understanding and reasoning. For the multi-modal generation task, we sample figurecaption pairs for training and evaluation. GWM and baseline models are tasked with reconstructing the figures from the provided captions. For the multi-modal matching task, we sample cited-citing pairs using reference relationship, namely the macro command \ref . In the recommendation task, we conduct extensive evaluations using three Amazon datasets extensively recognized in prior research (Mc Auley et al., 2015), specifically: Baby, Sports, and Outdoors, as well as Clothing Shoes, and Jewelry. In the traditional graph prediction task, we evaluate GWM on Cora, Pub Med, and HIV datasets. In the multi-agent collaboration task, we evaluate GWM on Agent Clinic (Schmidgall et al., 2024) benchmark, specifically Agent Clinic-NEJM collected from the New England Journal of Medicine (NEJM) case challenges. We employ the expert strategy dataset from the text-based embodied task framework, ALFWorld (Shridhar ets al., 2020; Yang et al., 2024).
Dataset Splits	Yes	We divide all datasets into training, validation, and test sets in an 8:1:1 ratio. We partition Agent Clinic-NEJM into training, validation, and test sets using a 4:1:1 split ratio. For this task, we divided the dataset into training, validation, and test sets in an 8:1:1 ratio. Our total sample size is 10,000, and it is divided into training, validation, and test sets in an 8:1:1 ratio.
Hardware Specification	Yes	All the experiments are conducted on NVIDIA A6000 GPUs.
Software Dependencies	No	The paper mentions software components such as Llama-3-8B, SD-v15, LLaVA-1.5-7B, CLIP, BERT, and Adam optimizer, but does not provide specific version numbers for underlying programming languages (e.g., Python), frameworks (e.g., PyTorch, TensorFlow), or other libraries, which are critical for full reproducibility.
Experiment Setup	Yes	Specifically, for the LLM module, we uniformly use Llama-3-8B, and for stable diffusion, we use SD-v15. For the image-to-text model used in GWM-T, we use LLa VA-1.5-7B. The image encoder and text decoder used in GWM-E are CLIP and BERT models, respectively. In addition, our multi-hop projector uses an n-hop MLP to aggregate features from different hops, where n-hop refers to the number of neighborhood hops of the graph nodes used. To ensure the training efficiency of the models, we set the maximum token length for all models at 2k. We use Adam optimizer (Diederik, 2014) for model training and gradually decay the learning rate with Lambda LR scheduler. Table 12: Hyper-parameter configuration for model training (Optimizer, Adam epsilon, Adam (β1, β2), Weight decay, Batch size per GPU, Gradient Accumulation, Epochs, Resolution, Learning rate).