Outcome-based Reinforcement Learning to Predict the Future
Authors: Benjamin Turtel, Danny Franklin, Kris Skotheim, Luke Hewitt, Philipp Schoenegger
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | On the evaluation side, we collect a novel dataset of 1,265 questions on Polymarket, with accompanying news headlines up to a given prediction date, taking several measures to avoid temporal leakage (a common issue when backtesting the accuracy of forecasting models, see Section 2.1). We assess accuracy with the soft-Brier score and calibration with expected calibration error (ECE). Finally, we quantify economic value by converting each forecast into a set of hypothetical trades and comparing realised profits with those of the frontier reasoning model o1 as a benchmark. |
| Researcher Affiliation | Collaboration | Benjamin Turtel, Danny Franklin, Kris Skotheim Lightning Rod Labs EMAIL Luke Hewitt Stanford University EMAIL Philipp Schoenegger London School of Economics and Political Science |
| Pseudocode | No | The paper describes the GRPO, Modified GRPO, and Re Max algorithms using mathematical equations and descriptive text, but no explicit pseudocode or algorithm blocks are provided. |
| Open Source Code | No | The paper refers to 'Lightning Rod Labs proprietary Foresight Learning framework' and uses an 'open-weight Qwen 2.5-14B base checkpoint' but does not provide a specific link or statement for the open-sourcing of their own RL pipeline or model implementations. |
| Open Datasets | Yes | The full dataset (including questions, relevant news headlines to aid prediction, model reasoning traces and predictions, and the true final resolution) is available here.1Dataset available at: https://huggingface.co/datasets/Lightning Rod Labs/outcome-rl-test-dataset |
| Dataset Splits | Yes | Overall, we find that on the 1,265-question hold-out set, a seven-run Re Max ensemble attains a Brier of 0.190 [0.178, 0.203] and an ECE of 0.062... We begin by collecting an ordered training dataset of 10,000 training questions from Polymarket... Finally, for a second set of experiments, we generate 100,000 additional training questions... We reserved the same held-out test set of 1,265 questions for all models. |
| Hardware Specification | Yes | All experiments ran on a single 8-GPU node. The GRPO (10k), Re Max, Modified-GRPO, DPO, and the large-scale GRPO-100k run used eight NVIDIA H100 GPUs, for approximately three days. |
| Software Dependencies | No | The paper mentions optimiser details like 'Adam W (β1 = 0.9, β2 = 0.999, ϵ = 1 10 8, no weight decay)' and 'bfloat16 precision', but does not provide specific version numbers for software libraries or dependencies such as Python, PyTorch, or CUDA. |
| Experiment Setup | Yes | Across algorithms we keep the optimisation scaffold identical: Adam W (β1 = 0.9, β2 = 0.999, ϵ = 1 10 8, no weight decay), bfloat16 precision, global grad-norm clip 1.0, and an entropy bonus coefficient of 0.001. We modify only the levers each method cares about. GRPO uses an actor learning rate of 1 10 6, an initial KL penalty of 0.005, PPO ratio-clip ϵ = 0.20, and G = 4 roll-outs per prompt; Modified-GRPO is identical except that it drops the σ division in the advantage to isolate the effect of normalisation. Re Max doubles the actor learning rate to 2 10 6, keeps the KL schedule unchanged, and trains its learned value baseline with 1 10 6 under an MSE loss scaled by 0.5. DPO is run for 4 epochs at β = 0.1 with a constant 1 10 5 learning rate and a batch size of 128 sequences. All runs employ automatic mixed precision and gradient accumulation to emulate two sequences per GPU. |