reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Robust Autonomy Emerges from Self-Play

Authors: Marco Cusumano-Towner, David Hafner, Alexander Hertzberg, Brody Huval, Aleksei Petrenko, Eugene Vinitsky, Erik Wijmans, Taylor W. Killian, Stuart Bowers, Ozan Sener, Philipp Kraehenbuehl, Vladlen Koltun

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	The resulting policy achieves state-of-the-art performance on three independent autonomous driving benchmarks. The policy outperforms the prior state of the art when tested on recorded real-world scenarios, amidst human drivers, without ever seeing human data during training.
Researcher Affiliation	Industry	1Apple. Correspondence to: Philipp Kr ahenb uhl <EMAIL>.
Pseudocode	Yes	Algorithm 1 Advantage filtering
Open Source Code	No	The paper does not provide an explicit statement about releasing source code for the methodology or a direct link to a code repository. It refers to using third-party codebases like Stable Baselines.
Open Datasets	Yes	We test the GIGAFLOW policy in three leading independent third-party benchmarks: CARLA (Dosovitskiy et al., 2017), nu Plan (Caesar et al., 2022), and the Waymo Open Motion Dataset (Ettinger et al., 2021) (through the Waymax simulator (Gulino et al., 2023)).
Dataset Splits	Yes	The nu Plan benchmark consists of a training, validation and held out test set... We evaluate GIGAFLOW on the Val14 benchmark... For the full Waymo Open Motion Dataset (WOMD) 1.2.0 validation set consisting of 44 097 scenarios each 8 s long running at 10 Hz.
Hardware Specification	Yes	GIGAFLOW is capable of simulating and learning from 4.4 billion state transitions (7.2 million km of driving, or 42 years of continuous driving experience) per hour on a single 8-GPU node... On an 8-GPU A100 node, the policy allows inference throughput of 7.4 million decisions per second during experience collection at a batch size of 2.6 million, and eight gradient updates per second in the training phase with a batch size of 256 000.
Software Dependencies	No	GIGAFLOW is a batched simulator (Makoviychuk et al., 2021; Freeman et al., 2021; Petrenko et al., 2021; Shacklett et al., 2021), implemented in Py Torch (Ansel et al., 2024)... Agents are trained using a version of Proximal Policy Optimization (PPO) (Schulman et al., 2017) derived from the Stable Baselines codebase (Raffin et al., 2021). While software is mentioned with citations, specific version numbers are not provided.
Experiment Setup	Yes	Table A3 provides our final list of training hyperparameters. Training batch size 256 000 Batch size per GPU 32 000 Rollout length 128 Num. PPO epochs 3 Discount factor γ 0.999 λGAE 0.95 Max. episode length 1200 steps (360 s) PPO clipping ratio 0.2 Value function clipping None Initial LR α(0) 5 10 4 LR schedule Cosine Entropy coefficient 0.01 Value loss coefficient 0.5 Max grad. norm 0.5 Advantage normalization Enabled Adv. filtering threshold η 0.01 Amax (Alg. 1) Inference & training precision 16-bit AMP Model weights initialization Orthogonal, zero bias