Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1]

Residual-MPPI: Online Policy Customization for Continuous Control

Authors: Pengcheng Wang, Chenran Li, Catherine Weaver, Kenta Kawamoto, Masayoshi Tomizuka, Chen Tang, Wei Zhan

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through our experiments, we demonstrate that the proposed Residual-MPPI algorithm can accomplish the fewshot/zero-shot online policy customization task effectively, including customizing the champion-level racing agent, Gran Turismo Sophy (GT Sophy) 1.0, in the challenging car racing scenario, Gran Turismo Sport (GTS) environment. ... Our experiments in Mu Jo Co demonstrate that our method can effectively achieve zero-shot policy customization with a provided offline trained dynamics model. ... In this section, we evaluate the performance of the proposed algorithms in different environments selected from Mu Jo Co (Todorov et al., 2012). In Sec. 4.1, we provide the configurations of our experiments, including the settings of policy customization tasks in different environments, baselines, and evaluation metrics. In Sec. 4.2, we present and analyze the experimental results.
Researcher Affiliation Collaboration 1Department of Mechanical Engineering, University of California, Berkeley. 2Department of Computer Science, University of Texas at Austin. 3Sony Research Inc., Japan.
Pseudocode Yes Algorithm 1 Residual-MPPI Input: Current state x0; Output: Action Sequence ˆU = (ˆu0, ˆu1, , ˆu T 1). Require: System dynamics F; Number of samples K; Planning horizon T; Prior policy π; Disturbance covariance matrix Σ; Add-on reward r R; Temperature scalar λ; Discounted factor γ
Open Source Code Yes Code for Mu Jo Co experiments is included in the supplementary and will be opensourced upon acceptance. Demo videos and code are available on our website: https://sites.google.com/view/residual-mppi.
Open Datasets No The paper primarily uses simulation environments (MuJoCo, Gran Turismo Sport) for experiments. While these environments generate data, there is no mention of external, pre-existing, or publicly available datasets used or released by the authors for training or evaluation, nor are specific links or citations provided for such datasets.
Dataset Splits No The paper conducts experiments in simulation environments (MuJoCo, Gran Turismo Sport) where data is generated through interaction. It refers to 'data collection' for dynamics model training (2K steps for MuJoCo, 2,000 and 100 laps for GTS) but does not provide specific train/test/validation splits for any pre-existing or collected datasets in the traditional sense. The evaluation results are given as means and standard deviations over running episodes or laps, which describes evaluation methodology rather than dataset partitioning.
Hardware Specification Yes All the experiments were conducted on Ubuntu 22.04 with Intel Core i9-9920X CPU @ 3.50GHz 24, NVIDIA Ge Force RTX 2080 Ti, and 125 GB RAM. ... All the GTS experiments were conducted on Play Station 5 (PS5) and Ubuntu 20.04 with 12th Gen Intel Core i9-12900F 24, NVIDIA Ge Force RTX 3090, and 126 GB RAM.
Software Dependencies No The prior policies were constructed using Soft Actor-Critic (SAC) with the Stable Baseline3 (Raffin et al., 2021) implementation. ... All the experiments were conducted on Ubuntu 22.04 ... All the GTS experiments were conducted on Play Station 5 (PS5) and Ubuntu 20.04. The paper mentions Stable Baseline3 and Ubuntu versions, but specific version numbers for key libraries or frameworks like Stable Baseline3 itself are not provided.
Experiment Setup Yes Table 3: RL Prior Policy Training Hyperparameters (Hidden Layers (256, 256), Activation Re Lu, γ 0.99, Learning Rate 3e 4, Batch Size 256, Optimizer Adam, etc.). Table 4: Mu Jo Co Offline Dynamics Training Hyperparameters (Hidden Layers (256, 256, 256, 256), Activation Mish, learning rate 1e 5, Batch Size 256, Optimizer Adam, etc.). Table 5: Planning Hyperparameter in Mu Jo Co Tasks (Horizon, Samples, Noise std., ω, γ, λ with specific values for each environment). Table 6: GTS Offline Dynamics Training Hyperparameters (History Length 8, Hidden Layers (2048, 2048, 2048), Activation Mish, Learning Rate 1e 5, Batch Size 256, Optimizer Adam, etc.). Table 7: Planning Hyperparameter in GTS (Horizon, Samples, Noise std., Top Ratio, ω, γ, λ with specific values). Table 8: Residual-SAC Training Hyperparameters (Hidden Layers (2048, 2048, 2048), Activation Re Lu, Learning Rate 1e 4, Batch Size 256, Optimizer Adam, etc.).