reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

COMBO: Compositional World Models for Embodied Multi-Agent Cooperation

Authors: Hongxin Zhang, Zeyuan Wang, Qiushi Lyu, Zheyuan Zhang, Sunli Chen, Tianmin Shu, Behzad Dariush, Kwonjoon Lee, Yilun Du, Chuang Gan

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate our methods on three challenging benchmarks with 24 agents. The results show our compositional world model is effective and the framework enables the embodied agents to cooperate efficiently with different agents across various tasks and an arbitrary number of agents, showing the promising future of our proposed methods. More videos can be found at https://umass-embodied-agi.github.io/COMBO/.
Researcher Affiliation	Collaboration	Hongxin Zhang1 , Zeyuan Wang2 , Qiushi Lyu3 , Zheyuan Zhang1, Sunli Chen1, Tianmin Shu4, Behzad Dariush5, Kwonjoon Lee5, Yilun Du6, Chuang Gan1 1 University of Massachusetts Amherst 2 IIIS, Tsinghua University 3 Peking University 4 Johns Hopkins University 5 Honda Research Institute USA 6 MIT
Pseudocode	Yes	Algorithm 1 COMBO Planning Procedure for Agent i. 1: Input: Estimated world state s0 from oi, task goal G 2: Sub-modules: Action Proposer AP(s, G), Intent Tracker IT(s, G), Compositional World Model CWM(s, a), Outcome Evaluator OE(s, G) 3: Parameters: Action Proposes P, Planning Beams B, Rollout Depths D 4: plans [[s0]] 5: new plans [[s0]] 6: for d = 1 . . . D do 7: plans new plans[1...B] # Keeps Only B Plan Beams with Best Scores 8: new plans [] 9: for plan in plans do 10: s plan[ 1] # Get the Last Image State in the Plan Beam 11: ai,1:P AP(s, G) # Generate P Different Action Proposals 12: a i IT(s, G) # Infer Other Agents Possible Actions 13: for p = 1 . . . P do 14: a (ai,p, a i) 15: snext CWM(s, a) # Simulate Next State Conditioned on Joint Actions 16: new plans.append(plan + snext) 17: end for 18: end for 19: new plans sorted(new plans, OE(s, G)) # Sort Plans by the Score of the Final State 20: end for 21: plan new plans[1] # Return the Plan with the Best Score
Open Source Code	No	More videos can be found at https://umass-embodied-agi.github.io/COMBO/.
Open Datasets	Yes	We also adapt the 2D-Fetch Q challenge from Wang et al. (2022) with a visual observation space and high-level action space to evaluate our method.
Dataset Splits	No	Dataset Collection We collect random rollouts with a scripted planner and generated 107k videos of TDW-Game and 50k videos of TDW-Cook. For the Intent Tracker, we collected 40k short rollouts consisting of three images of consecutive observations and a textual description of the next actions of all the agents, converted by a template given the action history. For the Outcome Evaluator, we collected 138k data consisting of one image of the observation and a textual description of the state of each object in the image and the heuristic score, converted by a template given each object s location.
Hardware Specification	Yes	Compute We train the world model for 50k steps in the first stage with a batch size of 384 on 192 V-100 GPUs in 1 day. Then, we fine-tune the model for 25k steps in the second stage with 120 batch size on 120 V-100 GPUs in 1 day. Both the inpainting model and the super-resolution model are trained for 60k steps with a batch size of 288 on 24 V-100 GPUs in 1 day. Compute For each planning sub-module, we finetune LLa VA-1.5-7B with Lo RA for one epoch with a batch size of 144 on 18 V-100 GPUs in about 3 hours.
Software Dependencies	No	The video diffusion model of the compositional world model is built upon AVDC (Ko et al., 2023) codebase, with the architectural modification of introducing a cross attention layer to the text condition in the Res Net block and replacing the Perceiver with an MLP to enhance the text conditioning. More details are in Appendix B. All vision language models used in the main experiments are LLa VA-v1.5-7B. We finetune LLa VA-1.5-7B (Liu et al., 2023a) with Lo RA (Hu et al., 2021) for one epoch to obtain one shared model for each submodule across all tasks and cooperators.
Experiment Setup	Yes	The planning parameters are set to Action Proposes P=3, Planning Beams B=3, and Rollout Depths D=3 unless other specified. We train the world model for 50k steps in the first stage with a batch size of 384 on 192 V-100 GPUs in 1 day. Then, we fine-tune the model for 25k steps in the second stage with 120 batch size on 120 V-100 GPUs in 1 day. Both the inpainting model and the super-resolution model are trained for 60k steps with a batch size of 288 on 24 V-100 GPUs in 1 day. We use DDIM sampling across the experiments with guidance weight 5 for the text-guided video diffusion model.