reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Closed-Loop Long-Horizon Robotic Planning via Equilibrium Sequence Modeling

Authors: Jinghan Li, Zhicheng Sun, Yadong Mu

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our method is evaluated on the Virtual Home-Env benchmark, showing advanced performance with improved scaling w.r.t. inference-time computation. Code is available at https: //github.com/Singularity0104/ equilibrium-planner. [...] 4. Experiments [...] Table 1: Performance on Virtual Home-Env without correction. Our planner achieves state-of-the-art performance in most evaluations.
Researcher Affiliation	Academia	1Peking Unviersity, China. Correspondence to: Yadong Mu <EMAIL>.
Pseudocode	Yes	Algorithm 1 Inference of Equilibrium Planner
Open Source Code	Yes	Code is available at https: //github.com/Singularity0104/ equilibrium-planner.
Open Datasets	Yes	Our method is evaluated on the Virtual Home-Env benchmark (Puig et al., 2018; Liao et al., 2019), demonstrating its advantageous performance with better scaling w.r.t. inference computation than tree-based alternatives.
Dataset Splits	Yes	We randomly divide the Virtual Home-Env dataset into training set and test set in a 50:50 ratio. To analyze the generalizability of our method, we mainly study the following three subsets of the test set: novel scene set, novel task set, and novel scene and task set. Overall, the dataset contains 735 training trajectories, 468 trajectories within the novel task set, 95 trajectories within the novel scene set, 62 trajectories within the novel scene and task set.
Hardware Specification	No	The paper discusses 'Inference TFLOPS' and 'KV cache' for speeding up inference in Figure 5a and section B.3 respectively, but it does not specify any particular hardware components like GPU models (e.g., NVIDIA A100, RTX 2080 Ti) or CPU models used for the experiments.
Software Dependencies	No	Our implementation is consistent with the baseline methods by finetuning from Llama 3 8B (Dubey et al., 2024). The paper mentions the specific LLM (Llama 3 8B) used but does not provide specific version numbers for ancillary software dependencies like Python, PyTorch, or CUDA.
Experiment Setup	Yes	The equilibrium planner is finetuned for 6 iterations with a learning rate of 0.0002. [...] For the world model, we collect all interacting experiences between the planner and the environment, including plans and feedback, and finetune it for 5 epochs using the same learning rate of 0.0002. [...] A greedy LLM sampling strategy is used in later refinement steps until convergence. [...] The ratio of environmental interactions to world model calls is currently set to 1:1