D2 Actor Critic: Diffusion Actor Meets Distributional Critic

Authors: Lunjun Zhang, Shuo Han, Hanrui Lyu, Bradly C. Stadie

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In these experiments, we evaluate the performance of D2AC across a range of domains to assess its generality, efficiency, and behavioral characteristics. We also investigate how D2AC compares to strong model-based baselines in selected tasks, aiming to understand how effectively it bridges the gap between model-free and model-based approaches.
Researcher Affiliation Academia Lunjun Zhang EMAIL Department of Computer Science, University of Toronto Shuo Han Tansio EMAIL Department of Statistics, Northwestern University Hanrui Lyu Hanrui EMAIL Department of Statistics, Northwestern University Bradly C Stadie EMAIL Department of Statistics, Northwestern University
Pseudocode Yes Algorithm 1 Policy Optimization (under Tanh Action Squashing) procedure Update(ϕ, α | s, a, σ, Q) [...] Algorithm 2 : D2 Actor Critic procedure Actor Critic(D, θ, ϕ, α)
Open Source Code No Videos of our policy in action can be found here 1. 1https://d2ac-actor-critic.github.io/ (This indicates it's a project/demo page, not explicitly a code repository.)
Open Datasets Yes We evaluate D2AC across three primary domains: (i) the Deep Mind Control Suite (Tassa et al., 2018), which consists mainly of locomotion tasks with dense rewards, (ii) the Multi-Goal RL environments from Plappert et al. (2018), which include robotic manipulation tasks with sparse rewards, and (iii) a predator prey environment inspired by biological survival dynamics (Lai et al., 2024), which emphasizes adaptive behavior in dynamic and high-stakes scenarios.
Dataset Splits Yes We train three agents (TD-MPC2, SAC, and our proposed D2AC) for 50,000 steps on Map Level 5 and evaluate: [...] (iii)Zero-Shot Transfer to unseen Map Level 9 (Table 2).
Hardware Specification No Figure 10: Wall-clock runtime comparison against TD-MPC2 and SAC on a single GPU. The figure highlights the differing computational profiles across tasks of varying complexity. While D2AC is substantially faster than the model-based TD-MPC2, its initial learning speed is surpassed by the simpler SAC on some tasks, illustrating a trade-off between computational overhead and the capacity for sustained learning on complex problems. (Only mentions "single GPU" without specific model, which is insufficient)
Software Dependencies No The paper does not explicitly state specific software versions for key dependencies like Python, PyTorch, or other libraries. It only mentions the optimizer (Adam W) and a non-linearity (Layer Normalization + ReLU) without providing their specific versions or the versions of the frameworks they belong to.
Experiment Setup Yes Table 4: Hyperparameters for D2AC provides specific values for Batch size, Optimizer, Learning rate for policy, Learning rate for critic, Learning rate for temperature α, α initialization, Weight Decay, Number of hidden layers (all networks), Number of hidden units per layer, Non-linearity, γ, λent, Polyak for target network, Target network update interval, Ratio between env vs optimization steps, Initial random trajectories, Number of parallel workers, Replay buffer size, [V min, V max], Number of bins (size of the support zq), σmin, σmax, σdata, ρ, M, M train, Noise-level conditioning, and Embedding size for noise-level conditioning.