COMBO: Compositional World Models for Embodied Multi-Agent Cooperation
Authors: Hongxin Zhang, Zeyuan Wang, Qiushi Lyu, Zheyuan Zhang, Sunli Chen, Tianmin Shu, Behzad Dariush, Kwonjoon Lee, Yilun Du, Chuang Gan
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate our methods on three challenging benchmarks with 24 agents. The results show our compositional world model is effective and the framework enables the embodied agents to cooperate efficiently with different agents across various tasks and an arbitrary number of agents, showing the promising future of our proposed methods. More videos can be found at https://umass-embodied-agi.github.io/COMBO/. |
| Researcher Affiliation | Collaboration | Hongxin Zhang1 , Zeyuan Wang2 , Qiushi Lyu3 , Zheyuan Zhang1, Sunli Chen1, Tianmin Shu4, Behzad Dariush5, Kwonjoon Lee5, Yilun Du6, Chuang Gan1 1 University of Massachusetts Amherst 2 IIIS, Tsinghua University 3 Peking University 4 Johns Hopkins University 5 Honda Research Institute USA 6 MIT |
| Pseudocode | Yes | Algorithm 1 COMBO Planning Procedure for Agent i. 1: Input: Estimated world state s0 from oi, task goal G 2: Sub-modules: Action Proposer AP(s, G), Intent Tracker IT(s, G), Compositional World Model CWM(s, a), Outcome Evaluator OE(s, G) 3: Parameters: Action Proposes P, Planning Beams B, Rollout Depths D 4: plans [[s0]] 5: new plans [[s0]] 6: for d = 1 . . . D do 7: plans new plans[1...B] # Keeps Only B Plan Beams with Best Scores 8: new plans [] 9: for plan in plans do 10: s plan[ 1] # Get the Last Image State in the Plan Beam 11: ai,1:P AP(s, G) # Generate P Different Action Proposals 12: a i IT(s, G) # Infer Other Agents Possible Actions 13: for p = 1 . . . P do 14: a (ai,p, a i) 15: snext CWM(s, a) # Simulate Next State Conditioned on Joint Actions 16: new plans.append(plan + snext) 17: end for 18: end for 19: new plans sorted(new plans, OE(s, G)) # Sort Plans by the Score of the Final State 20: end for 21: plan new plans[1] # Return the Plan with the Best Score |
| Open Source Code | No | More videos can be found at https://umass-embodied-agi.github.io/COMBO/. |
| Open Datasets | Yes | We also adapt the 2D-Fetch Q challenge from Wang et al. (2022) with a visual observation space and high-level action space to evaluate our method. |
| Dataset Splits | No | Dataset Collection We collect random rollouts with a scripted planner and generated 107k videos of TDW-Game and 50k videos of TDW-Cook. For the Intent Tracker, we collected 40k short rollouts consisting of three images of consecutive observations and a textual description of the next actions of all the agents, converted by a template given the action history. For the Outcome Evaluator, we collected 138k data consisting of one image of the observation and a textual description of the state of each object in the image and the heuristic score, converted by a template given each object s location. |
| Hardware Specification | Yes | Compute We train the world model for 50k steps in the first stage with a batch size of 384 on 192 V-100 GPUs in 1 day. Then, we fine-tune the model for 25k steps in the second stage with 120 batch size on 120 V-100 GPUs in 1 day. Both the inpainting model and the super-resolution model are trained for 60k steps with a batch size of 288 on 24 V-100 GPUs in 1 day. Compute For each planning sub-module, we finetune LLa VA-1.5-7B with Lo RA for one epoch with a batch size of 144 on 18 V-100 GPUs in about 3 hours. |
| Software Dependencies | No | The video diffusion model of the compositional world model is built upon AVDC (Ko et al., 2023) codebase, with the architectural modification of introducing a cross attention layer to the text condition in the Res Net block and replacing the Perceiver with an MLP to enhance the text conditioning. More details are in Appendix B. All vision language models used in the main experiments are LLa VA-v1.5-7B. We finetune LLa VA-1.5-7B (Liu et al., 2023a) with Lo RA (Hu et al., 2021) for one epoch to obtain one shared model for each submodule across all tasks and cooperators. |
| Experiment Setup | Yes | The planning parameters are set to Action Proposes P=3, Planning Beams B=3, and Rollout Depths D=3 unless other specified. We train the world model for 50k steps in the first stage with a batch size of 384 on 192 V-100 GPUs in 1 day. Then, we fine-tune the model for 25k steps in the second stage with 120 batch size on 120 V-100 GPUs in 1 day. Both the inpainting model and the super-resolution model are trained for 60k steps with a batch size of 288 on 24 V-100 GPUs in 1 day. We use DDIM sampling across the experiments with guidance weight 5 for the text-guided video diffusion model. |