Learning Policy Committees for Effective Personalization in MDPs with Diverse Tasks

Authors: Luise Ge, Michael Lanier, Anindya Sarkar, Bengisu Guresti, Chongjie Zhang, Yevgeniy Vorobeychik

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments on Mu Jo Co and Meta-World show that the proposed approach outperforms state-of-the-art multi-task, meta-, and task clustering baselines in training, generalization, and few-shot learning, often by a large margin.
Researcher Affiliation Academia 1Department of Computer Science & Engineering, Washington University in St. Louis. Correspondence to: Luise Ge <EMAIL>.
Pseudocode Yes The full pseudocode for the Greedy Intersection Algorithm (GIA) algorithm is provided as Algorithm 1. Algorithm 1 Greedy Intersection Input: T = {θi}N i=1, ϵ > 0, K 1 Output: Parameter cover C
Open Source Code Yes Our code is available at https://github.com/CERL-WUSTL/PACMAN/.
Open Datasets Yes Our experiments on Mu Jo Co and Meta-World show that the proposed approach outperforms state-of-the-art multi-task, meta-, and task clustering baselines in training, generalization, and few-shot learning, often by a large margin.
Dataset Splits Yes Mu Jo Co We selected two commonly used Mu Jo Co environments... use 100 tasks for training and another 100 for testing (in both zero-shot and few-shot settings)... Meta-World We focus on the set of robotic manipulation tasks in MT50, of which we use 30 for training and 20 for testing.
Hardware Specification Yes To illustrate, our Meta World experiments show that training a single policy for 1 million steps necessitates approximately 40 hours using an A40 GPU.
Software Dependencies No The text references a specific LLM model, Phi-3 Mini-128k Instruct (Microsoft, 2024), but does not specify programming languages, libraries, or frameworks with version numbers used for the implementation of the proposed method.
Experiment Setup Yes For clustering, we use K = 3, ϵ = .6, and use the gradient-based approach initialized with the result of the Greedy Intersection algorithm. For few-shot learning, we fine-tune all methods for 100 epochs. Meta-World ... We use K = 3 and ϵ = .7. Performance is a moving average success rate for the last 2000 evaluation episodes over 3 seeds.