Temporal Difference Flows
Authors: Jesse Farebrother, Matteo Pirotta, Andrea Tirinzoni, Remi Munos, Alessandro Lazaric, Ahmed Touati
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, we validate TD-Flow across a diverse set of domains on both generative metrics and downstream tasks, including policy evaluation. Moreover, integrating TD-Flow with recent behavior foundation models for planning over policies demonstrates substantial performance gains, underscoring its promise for long-horizon decision-making. We now present a series of experiments to assess the efficacy of our TD-based flow and diffusion approaches with baselines employing Generative Adversarial Networks (Goodfellow et al., 2014) and β-Variational Auto-Encoders (Higgins et al., 2017). We benchmark 22 tasks spanning 4 domains (Maze, Walker, Cheetah, Quadruped) from the Deep Mind Control Suite (Tunyasuvunakool et al., 2020). |
| Researcher Affiliation | Collaboration | Jesse Farebrother 1 2 Matteo Pirotta 3 Andrea Tirinzoni 3 R emi Munos 3 Alessandro Lazaric 3 Ahmed Touati 3 Work done at Meta 1Mc Gill University 2Mila Qu ebec AI Institute 3FAIR at Meta. Correspondence to: Jesse Farebrother <EMAIL>, Ahmed Touati <EMAIL>. |
| Pseudocode | Yes | We provide further implementation details and pseudo-code for all TD-Flow methods in Appendix C.3.1. Algorithm 1 Template for TD-Flow algorithms |
| Open Source Code | No | The paper does not contain any explicit statement about the release of open-source code or provide a link to a code repository. |
| Open Datasets | Yes | GHM training proceeds in an off-policy manner where we learn the successor measure of a TD3 policy using transition data from the Exo RL dataset (Yarats et al., 2022); specifically, we use a dataset of 10M transitions collected by a random network distillation policy (Burda et al., 2019). |
| Dataset Splits | No | The paper mentions using the Exo RL dataset for training but does not specify any training, validation, or test splits for this dataset. The evaluation protocol describes generating samples from the ground truth successor measure for evaluation, not splitting an existing dataset into train/test/validation sets. |
| Hardware Specification | No | The paper does not explicitly describe the specific hardware (e.g., GPU, CPU models, or cloud computing instance types) used for running its experiments. |
| Software Dependencies | No | The paper mentions several software components, architectures, and optimizers (e.g., Flow Matching, DDPM, U-Net, Adam W) and cites their respective papers but does not provide specific version numbers for any software libraries or dependencies, which are required for a reproducible setup. |
| Experiment Setup | Yes | Appendix C.4, titled 'Hyperparameters', provides detailed tables (Table 5, 6, 7, 8) outlining numerous hyperparameter values for training various models, including ODE dt (0.1), discretization steps (1,000), embedding dimensions (256), block dimensions (512, 512, 512 or 1024, 1024, 1024), optimizer parameters (Adam W β1 0.9, β2 0.999, ϵ 10^-4), learning rates (10^-4), weight decay (10^-3 or 10^-2), gradient steps (3M or 8M), batch size (1024), and target network EMA (10^-3 or 10^-4). |