projects
BrierBench
Mar 2026
Can language models predict the future?
Not literally, of course. But prediction markets offer a compelling test: thousands of real-world questions with binary outcomes, crowd-calibrated probabilities, and — crucially — answers that haven’t happened yet. A model cannot have memorised something that hasn’t occurred.
ForecastBench exploits this. Every two weeks it samples 500 questions from Manifold, Polymarket, Metaculus, and INFER, runs ~25 models, and scores them against ground truth. The primary metric is the Brier score: , where is the forecast and is the outcome. Lower is better; 0.25 means you always said 50/50.
I wanted to know two things. First, whether the benchmark is reproducible — can an independent pipeline, using OpenRouter as a unified API gateway, match the official leaderboard? Second, how do frontier models actually differ in their forecasting behaviour?
Replicating the baseline
Using Gemini-2.5-Flash across all 24 historical question sets (18,484 questions, 22,829 matched forecasts), the raw Brier scores match closely:
| Metric | Ours | Leaderboard | Delta |
|---|---|---|---|
| Dataset Brier | 0.185 | 0.170 | +0.015 |
| Market Brier | 0.132 | 0.146 | -0.014 |
| Overall Brier | 0.181 | 0.158 | +0.023 |
Cohen’s d = 0.12 — a negligible effect size. The replication works.
Raw Brier score comparison for Gemini-2.5-Flash: our replication vs ForecastBench leaderboard, with 95% bootstrap confidence intervals.
One important caveat: we forecast retroactively (calling models in March 2026 on questions from 2024—2026). The model could have indirect knowledge of some outcomes. That the scores still match suggests retroactive knowledge has minimal impact on aggregate accuracy.
Per-question-set Brier scores showing temporal variation. Recent sets trend higher as fewer questions have resolved, biasing toward longer-horizon forecasts.
Forecasting personalities
The more interesting finding came from benchmarking three frontier models against live market prices from all four prediction platforms (247/250 questions matched, 98.8% coverage).
| Model | Mean |delta| | Median |delta| | Bias | Higher/Lower |
|---|---|---|---|---|
| GPT-5.4 | 0.220 | 0.154 | -0.036 | 100/113 |
| Gemini-3.1-Pro | 0.224 | 0.140 | -0.083 | 72/142 |
| Grok-4.20-beta | 0.274 | 0.200 | +0.061 | 118/97 |
Each model has a distinct personality:
GPT-5.4 is a consensus tracker. Its forecasts cluster near market prices, with the smallest median deviation. It rarely strays far from the crowd.
Gemini-3.1-Pro is conservative. It systematically predicts lower probabilities than the market — a negative bias of -0.083 means it’s consistently more sceptical.
Grok-4.20-beta is the contrarian. The highest mean deviation, a positive bias, and a fondness for extreme probabilities (92%, 95%). It is willing to disagree with the crowd.
Model forecast vs live market price for three frontier models. The diagonal represents perfect agreement with the crowd.
Whether contrarianism helps depends on whether the crowd is well-calibrated. On ForecastBench, the superforecaster median achieves a Brier Index of 70.8 — substantially better than any LLM (best: 64.2). Extreme disagreement with markets is, on average, wrong.
The pipeline
The full replication runs all 24 question sets for ~$10 (Gemini-2.5-Flash) in under an hour at 100 concurrent API calls. Deterministic disk caching means re-runs are free.
Market prices come from four platforms via heterogeneous methods: Manifold and Polymarket have public REST APIs; Metaculus hides its aggregated probabilities behind JavaScript (requiring browser automation); INFER needs authenticated scraping.
The difficulty-adjusted Brier Index — the leaderboard’s primary metric — uses a two-way fixed effects regression across all models and questions. You cannot compute it from a single model’s forecasts, which is the main limitation of any single-model replication.
References
- Karger, E., et al. (2025). “ForecastBench: A Dynamic Benchmark of AI Forecasting Capabilities.”
- Halawi, D., et al. (2024). “Approaching Human-Level Forecasting with Language Models.” arXiv:2402.18563.