Outcome-based Reinforcement Learning to Predict the Future

Authors: Benjamin Turtel, Danny Franklin, Kris Skotheim, Luke Hewitt, Philipp Schoenegger

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental On the evaluation side, we collect a novel dataset of 1,265 questions on Polymarket, with accompanying news headlines up to a given prediction date, taking several measures to avoid temporal leakage (a common issue when backtesting the accuracy of forecasting models, see Section 2.1). We assess accuracy with the soft-Brier score and calibration with expected calibration error (ECE). Finally, we quantify economic value by converting each forecast into a set of hypothetical trades and comparing realised profits with those of the frontier reasoning model o1 as a benchmark.
Researcher Affiliation Collaboration Benjamin Turtel, Danny Franklin, Kris Skotheim Lightning Rod Labs EMAIL Luke Hewitt Stanford University EMAIL Philipp Schoenegger London School of Economics and Political Science
Pseudocode No The paper describes the GRPO, Modified GRPO, and Re Max algorithms using mathematical equations and descriptive text, but no explicit pseudocode or algorithm blocks are provided.
Open Source Code No The paper refers to 'Lightning Rod Labs proprietary Foresight Learning framework' and uses an 'open-weight Qwen 2.5-14B base checkpoint' but does not provide a specific link or statement for the open-sourcing of their own RL pipeline or model implementations.
Open Datasets Yes The full dataset (including questions, relevant news headlines to aid prediction, model reasoning traces and predictions, and the true final resolution) is available here.1Dataset available at: https://huggingface.co/datasets/Lightning Rod Labs/outcome-rl-test-dataset
Dataset Splits Yes Overall, we find that on the 1,265-question hold-out set, a seven-run Re Max ensemble attains a Brier of 0.190 [0.178, 0.203] and an ECE of 0.062... We begin by collecting an ordered training dataset of 10,000 training questions from Polymarket... Finally, for a second set of experiments, we generate 100,000 additional training questions... We reserved the same held-out test set of 1,265 questions for all models.
Hardware Specification Yes All experiments ran on a single 8-GPU node. The GRPO (10k), Re Max, Modified-GRPO, DPO, and the large-scale GRPO-100k run used eight NVIDIA H100 GPUs, for approximately three days.
Software Dependencies No The paper mentions optimiser details like 'Adam W (β1 = 0.9, β2 = 0.999, ϵ = 1 10 8, no weight decay)' and 'bfloat16 precision', but does not provide specific version numbers for software libraries or dependencies such as Python, PyTorch, or CUDA.
Experiment Setup Yes Across algorithms we keep the optimisation scaffold identical: Adam W (β1 = 0.9, β2 = 0.999, ϵ = 1 10 8, no weight decay), bfloat16 precision, global grad-norm clip 1.0, and an entropy bonus coefficient of 0.001. We modify only the levers each method cares about. GRPO uses an actor learning rate of 1 10 6, an initial KL penalty of 0.005, PPO ratio-clip ϵ = 0.20, and G = 4 roll-outs per prompt; Modified-GRPO is identical except that it drops the σ division in the advantage to isolate the effect of normalisation. Re Max doubles the actor learning rate to 2 10 6, keeps the KL schedule unchanged, and trains its learned value baseline with 1 10 6 under an MSE loss scaled by 0.5. DPO is run for 4 epochs at β = 0.1 with a constant 1 10 5 learning rate and a batch size of 128 sequences. All runs employ automatic mixed precision and gradient accumulation to emulate two sequences per GPU.