reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

DRDT3: Diffusion-Refined Decision Test-Time Training Model

Authors: Xingshuai Huang, Di Wu, Benoit Boulet

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments on multiple tasks in the D4RL benchmark, our DT3 model without diffusion refinement demonstrates improved performance over standard DT, while DRDT3 further achieves superior results compared to state-of-the-art DT-based and offline RL methods. Experiments on extensive tasks from the D4RL benchmark (Fu et al., 2020) demonstrate the superior performance of our proposed DT3 and DRDT3 over conventional offline RL and DT-based methods.
Researcher Affiliation	Academia	Xingshuai Huang EMAIL Department of Electrical and Computer Engineering Mc Gill University Di Wu EMAIL Department of Electrical and Computer Engineering Mc Gill University Benoit Boulet EMAIL Department of Electrical and Computer Engineering Mc Gill University
Pseudocode	Yes	Algorithm 1 Training of DRDT3, Algorithm 2 Inference of DRDT3
Open Source Code	No	The paper does not explicitly state that source code for the described methodology is publicly available, nor does it provide a link to a code repository. It refers to other papers for baselines, but not its own implementation.
Open Datasets	Yes	We conduct experiments to evaluate our proposed DRDT3 on the commonly used D4RL benchmark (Fu et al., 2020) using an AMD Ryzen 7 7700X 8-Core Processor with a single NVIDIA Ge Force RTX 4080 GPU.
Dataset Splits	No	The paper describes the characteristics and sources of the D4RL datasets used (e.g., "Medium datasets contain one million samples collected using a behavior policy...", "Medium-Expert datasets consist of two million samples..."), but it does not specify how these datasets were further split into training, validation, or testing sets for the authors' experiments. It refers to using the D4RL benchmark datasets, but not the specific splits used within their experiments.
Hardware Specification	Yes	We conduct experiments to evaluate our proposed DRDT3 on the commonly used D4RL benchmark (Fu et al., 2020) using an AMD Ryzen 7 7700X 8-Core Processor with a single NVIDIA Ge Force RTX 4080 GPU.
Software Dependencies	No	The paper does not provide specific version numbers for any software dependencies, such as programming languages, libraries, or frameworks. It mentions 'GPT-2 model' but without a version, and 'Optuna hyperparameter optimization framework' also without a version.
Experiment Setup	Yes	When implementing our proposed DRDT3, we train it for 50 epochs with 2000 gradient updates per epoch. The learning rate and batch size are designated as 0.0003 and 2048, respectively. To proceed with historical subtrajectories with DT3 module, we set the context length as 6. The Attention TTT block used in the DT3 module consists of 1-layer self-attention and 1-layer TTT with embedding dimensions of 128. We empirically set ζ = 0.2 based on the results.