Learning Policy Committees for Effective Personalization in MDPs with Diverse Tasks
Authors: Luise Ge, Michael Lanier, Anindya Sarkar, Bengisu Guresti, Chongjie Zhang, Yevgeniy Vorobeychik
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments on Mu Jo Co and Meta-World show that the proposed approach outperforms state-of-the-art multi-task, meta-, and task clustering baselines in training, generalization, and few-shot learning, often by a large margin. |
| Researcher Affiliation | Academia | 1Department of Computer Science & Engineering, Washington University in St. Louis. Correspondence to: Luise Ge <EMAIL>. |
| Pseudocode | Yes | The full pseudocode for the Greedy Intersection Algorithm (GIA) algorithm is provided as Algorithm 1. Algorithm 1 Greedy Intersection Input: T = {θi}N i=1, ϵ > 0, K 1 Output: Parameter cover C |
| Open Source Code | Yes | Our code is available at https://github.com/CERL-WUSTL/PACMAN/. |
| Open Datasets | Yes | Our experiments on Mu Jo Co and Meta-World show that the proposed approach outperforms state-of-the-art multi-task, meta-, and task clustering baselines in training, generalization, and few-shot learning, often by a large margin. |
| Dataset Splits | Yes | Mu Jo Co We selected two commonly used Mu Jo Co environments... use 100 tasks for training and another 100 for testing (in both zero-shot and few-shot settings)... Meta-World We focus on the set of robotic manipulation tasks in MT50, of which we use 30 for training and 20 for testing. |
| Hardware Specification | Yes | To illustrate, our Meta World experiments show that training a single policy for 1 million steps necessitates approximately 40 hours using an A40 GPU. |
| Software Dependencies | No | The text references a specific LLM model, Phi-3 Mini-128k Instruct (Microsoft, 2024), but does not specify programming languages, libraries, or frameworks with version numbers used for the implementation of the proposed method. |
| Experiment Setup | Yes | For clustering, we use K = 3, ϵ = .6, and use the gradient-based approach initialized with the result of the Greedy Intersection algorithm. For few-shot learning, we fine-tune all methods for 100 epochs. Meta-World ... We use K = 3 and ϵ = .7. Performance is a moving average success rate for the last 2000 evaluation episodes over 3 seeds. |