Temporal Difference Flows

Authors: Jesse Farebrother, Matteo Pirotta, Andrea Tirinzoni, Remi Munos, Alessandro Lazaric, Ahmed Touati

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, we validate TD-Flow across a diverse set of domains on both generative metrics and downstream tasks, including policy evaluation. Moreover, integrating TD-Flow with recent behavior foundation models for planning over policies demonstrates substantial performance gains, underscoring its promise for long-horizon decision-making. We now present a series of experiments to assess the efficacy of our TD-based flow and diffusion approaches with baselines employing Generative Adversarial Networks (Goodfellow et al., 2014) and β-Variational Auto-Encoders (Higgins et al., 2017). We benchmark 22 tasks spanning 4 domains (Maze, Walker, Cheetah, Quadruped) from the Deep Mind Control Suite (Tunyasuvunakool et al., 2020).
Researcher Affiliation Collaboration Jesse Farebrother 1 2 Matteo Pirotta 3 Andrea Tirinzoni 3 R emi Munos 3 Alessandro Lazaric 3 Ahmed Touati 3 Work done at Meta 1Mc Gill University 2Mila Qu ebec AI Institute 3FAIR at Meta. Correspondence to: Jesse Farebrother <EMAIL>, Ahmed Touati <EMAIL>.
Pseudocode Yes We provide further implementation details and pseudo-code for all TD-Flow methods in Appendix C.3.1. Algorithm 1 Template for TD-Flow algorithms
Open Source Code No The paper does not contain any explicit statement about the release of open-source code or provide a link to a code repository.
Open Datasets Yes GHM training proceeds in an off-policy manner where we learn the successor measure of a TD3 policy using transition data from the Exo RL dataset (Yarats et al., 2022); specifically, we use a dataset of 10M transitions collected by a random network distillation policy (Burda et al., 2019).
Dataset Splits No The paper mentions using the Exo RL dataset for training but does not specify any training, validation, or test splits for this dataset. The evaluation protocol describes generating samples from the ground truth successor measure for evaluation, not splitting an existing dataset into train/test/validation sets.
Hardware Specification No The paper does not explicitly describe the specific hardware (e.g., GPU, CPU models, or cloud computing instance types) used for running its experiments.
Software Dependencies No The paper mentions several software components, architectures, and optimizers (e.g., Flow Matching, DDPM, U-Net, Adam W) and cites their respective papers but does not provide specific version numbers for any software libraries or dependencies, which are required for a reproducible setup.
Experiment Setup Yes Appendix C.4, titled 'Hyperparameters', provides detailed tables (Table 5, 6, 7, 8) outlining numerous hyperparameter values for training various models, including ODE dt (0.1), discretization steps (1,000), embedding dimensions (256), block dimensions (512, 512, 512 or 1024, 1024, 1024), optimizer parameters (Adam W β1 0.9, β2 0.999, ϵ 10^-4), learning rates (10^-4), weight decay (10^-3 or 10^-2), gradient steps (3M or 8M), batch size (1024), and target network EMA (10^-3 or 10^-4).