reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Temporal Difference Flows

Authors: Jesse Farebrother, Matteo Pirotta, Andrea Tirinzoni, Remi Munos, Alessandro Lazaric, Ahmed Touati

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirically, we validate TD-Flow across a diverse set of domains on both generative metrics and downstream tasks, including policy evaluation. Moreover, integrating TD-Flow with recent behavior foundation models for planning over policies demonstrates substantial performance gains, underscoring its promise for long-horizon decision-making. We now present a series of experiments to assess the efficacy of our TD-based flow and diffusion approaches with baselines employing Generative Adversarial Networks (Goodfellow et al., 2014) and β-Variational Auto-Encoders (Higgins et al., 2017). We benchmark 22 tasks spanning 4 domains (Maze, Walker, Cheetah, Quadruped) from the Deep Mind Control Suite (Tunyasuvunakool et al., 2020).
Researcher Affiliation	Collaboration	Jesse Farebrother 1 2 Matteo Pirotta 3 Andrea Tirinzoni 3 R emi Munos 3 Alessandro Lazaric 3 Ahmed Touati 3 Work done at Meta 1Mc Gill University 2Mila Qu ebec AI Institute 3FAIR at Meta. Correspondence to: Jesse Farebrother <EMAIL>, Ahmed Touati <EMAIL>.
Pseudocode	Yes	We provide further implementation details and pseudo-code for all TD-Flow methods in Appendix C.3.1. Algorithm 1 Template for TD-Flow algorithms
Open Source Code	No	The paper does not contain any explicit statement about the release of open-source code or provide a link to a code repository.
Open Datasets	Yes	GHM training proceeds in an off-policy manner where we learn the successor measure of a TD3 policy using transition data from the Exo RL dataset (Yarats et al., 2022); specifically, we use a dataset of 10M transitions collected by a random network distillation policy (Burda et al., 2019).
Dataset Splits	No	The paper mentions using the Exo RL dataset for training but does not specify any training, validation, or test splits for this dataset. The evaluation protocol describes generating samples from the ground truth successor measure for evaluation, not splitting an existing dataset into train/test/validation sets.
Hardware Specification	No	The paper does not explicitly describe the specific hardware (e.g., GPU, CPU models, or cloud computing instance types) used for running its experiments.
Software Dependencies	No	The paper mentions several software components, architectures, and optimizers (e.g., Flow Matching, DDPM, U-Net, Adam W) and cites their respective papers but does not provide specific version numbers for any software libraries or dependencies, which are required for a reproducible setup.
Experiment Setup	Yes	Appendix C.4, titled 'Hyperparameters', provides detailed tables (Table 5, 6, 7, 8) outlining numerous hyperparameter values for training various models, including ODE dt (0.1), discretization steps (1,000), embedding dimensions (256), block dimensions (512, 512, 512 or 1024, 1024, 1024), optimizer parameters (Adam W β1 0.9, β2 0.999, ϵ 10^-4), learning rates (10^-4), weight decay (10^-3 or 10^-2), gradient steps (3M or 8M), batch size (1024), and target network EMA (10^-3 or 10^-4).