reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

D2 Actor Critic: Diffusion Actor Meets Distributional Critic

Authors: Lunjun Zhang, Shuo Han, Hanrui Lyu, Bradly C. Stadie

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In these experiments, we evaluate the performance of D2AC across a range of domains to assess its generality, efficiency, and behavioral characteristics. We also investigate how D2AC compares to strong model-based baselines in selected tasks, aiming to understand how effectively it bridges the gap between model-free and model-based approaches.
Researcher Affiliation	Academia	Lunjun Zhang EMAIL Department of Computer Science, University of Toronto Shuo Han Tansio EMAIL Department of Statistics, Northwestern University Hanrui Lyu Hanrui EMAIL Department of Statistics, Northwestern University Bradly C Stadie EMAIL Department of Statistics, Northwestern University
Pseudocode	Yes	Algorithm 1 Policy Optimization (under Tanh Action Squashing) procedure Update(ϕ, α \| s, a, σ, Q) [...] Algorithm 2 : D2 Actor Critic procedure Actor Critic(D, θ, ϕ, α)
Open Source Code	No	Videos of our policy in action can be found here 1. 1https://d2ac-actor-critic.github.io/ (This indicates it's a project/demo page, not explicitly a code repository.)
Open Datasets	Yes	We evaluate D2AC across three primary domains: (i) the Deep Mind Control Suite (Tassa et al., 2018), which consists mainly of locomotion tasks with dense rewards, (ii) the Multi-Goal RL environments from Plappert et al. (2018), which include robotic manipulation tasks with sparse rewards, and (iii) a predator prey environment inspired by biological survival dynamics (Lai et al., 2024), which emphasizes adaptive behavior in dynamic and high-stakes scenarios.
Dataset Splits	Yes	We train three agents (TD-MPC2, SAC, and our proposed D2AC) for 50,000 steps on Map Level 5 and evaluate: [...] (iii)Zero-Shot Transfer to unseen Map Level 9 (Table 2).
Hardware Specification	No	Figure 10: Wall-clock runtime comparison against TD-MPC2 and SAC on a single GPU. The figure highlights the differing computational profiles across tasks of varying complexity. While D2AC is substantially faster than the model-based TD-MPC2, its initial learning speed is surpassed by the simpler SAC on some tasks, illustrating a trade-off between computational overhead and the capacity for sustained learning on complex problems. (Only mentions "single GPU" without specific model, which is insufficient)
Software Dependencies	No	The paper does not explicitly state specific software versions for key dependencies like Python, PyTorch, or other libraries. It only mentions the optimizer (Adam W) and a non-linearity (Layer Normalization + ReLU) without providing their specific versions or the versions of the frameworks they belong to.
Experiment Setup	Yes	Table 4: Hyperparameters for D2AC provides specific values for Batch size, Optimizer, Learning rate for policy, Learning rate for critic, Learning rate for temperature α, α initialization, Weight Decay, Number of hidden layers (all networks), Number of hidden units per layer, Non-linearity, γ, λent, Polyak for target network, Target network update interval, Ratio between env vs optimization steps, Initial random trajectories, Number of parallel workers, Replay buffer size, [V min, V max], Number of bins (size of the support zq), σmin, σmax, σdata, ρ, M, M train, Noise-level conditioning, and Embedding size for noise-level conditioning.