D2 Actor Critic: Diffusion Actor Meets Distributional Critic
Authors: Lunjun Zhang, Shuo Han, Hanrui Lyu, Bradly C. Stadie
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In these experiments, we evaluate the performance of D2AC across a range of domains to assess its generality, efficiency, and behavioral characteristics. We also investigate how D2AC compares to strong model-based baselines in selected tasks, aiming to understand how effectively it bridges the gap between model-free and model-based approaches. |
| Researcher Affiliation | Academia | Lunjun Zhang EMAIL Department of Computer Science, University of Toronto Shuo Han Tansio EMAIL Department of Statistics, Northwestern University Hanrui Lyu Hanrui EMAIL Department of Statistics, Northwestern University Bradly C Stadie EMAIL Department of Statistics, Northwestern University |
| Pseudocode | Yes | Algorithm 1 Policy Optimization (under Tanh Action Squashing) procedure Update(ϕ, α | s, a, σ, Q) [...] Algorithm 2 : D2 Actor Critic procedure Actor Critic(D, θ, ϕ, α) |
| Open Source Code | No | Videos of our policy in action can be found here 1. 1https://d2ac-actor-critic.github.io/ (This indicates it's a project/demo page, not explicitly a code repository.) |
| Open Datasets | Yes | We evaluate D2AC across three primary domains: (i) the Deep Mind Control Suite (Tassa et al., 2018), which consists mainly of locomotion tasks with dense rewards, (ii) the Multi-Goal RL environments from Plappert et al. (2018), which include robotic manipulation tasks with sparse rewards, and (iii) a predator prey environment inspired by biological survival dynamics (Lai et al., 2024), which emphasizes adaptive behavior in dynamic and high-stakes scenarios. |
| Dataset Splits | Yes | We train three agents (TD-MPC2, SAC, and our proposed D2AC) for 50,000 steps on Map Level 5 and evaluate: [...] (iii)Zero-Shot Transfer to unseen Map Level 9 (Table 2). |
| Hardware Specification | No | Figure 10: Wall-clock runtime comparison against TD-MPC2 and SAC on a single GPU. The figure highlights the differing computational profiles across tasks of varying complexity. While D2AC is substantially faster than the model-based TD-MPC2, its initial learning speed is surpassed by the simpler SAC on some tasks, illustrating a trade-off between computational overhead and the capacity for sustained learning on complex problems. (Only mentions "single GPU" without specific model, which is insufficient) |
| Software Dependencies | No | The paper does not explicitly state specific software versions for key dependencies like Python, PyTorch, or other libraries. It only mentions the optimizer (Adam W) and a non-linearity (Layer Normalization + ReLU) without providing their specific versions or the versions of the frameworks they belong to. |
| Experiment Setup | Yes | Table 4: Hyperparameters for D2AC provides specific values for Batch size, Optimizer, Learning rate for policy, Learning rate for critic, Learning rate for temperature α, α initialization, Weight Decay, Number of hidden layers (all networks), Number of hidden units per layer, Non-linearity, γ, λent, Polyak for target network, Target network update interval, Ratio between env vs optimization steps, Initial random trajectories, Number of parallel workers, Replay buffer size, [V min, V max], Number of bins (size of the support zq), σmin, σmax, σdata, ρ, M, M train, Noise-level conditioning, and Embedding size for noise-level conditioning. |