projects

BrierBench

Mar 2026

Can language models predict the future?

Not literally, of course. But prediction markets offer a compelling test: thousands of real-world questions with binary outcomes, crowd-calibrated probabilities, and — crucially — answers that haven’t happened yet. A model cannot have memorised something that hasn’t occurred.

ForecastBench exploits this. Every two weeks it samples 500 questions from Manifold, Polymarket, Metaculus, and INFER, runs ~25 models, and scores them against ground truth. The primary metric is the Brier score: (fo)2(f - o)^2, where ff is the forecast and oo is the outcome. Lower is better; 0.25 means you always said 50/50.

I wanted to know two things. First, whether the benchmark is reproducible — can an independent pipeline, using OpenRouter as a unified API gateway, match the official leaderboard? Second, how do frontier models actually differ in their forecasting behaviour?

Replicating the baseline

Using Gemini-2.5-Flash across all 24 historical question sets (18,484 questions, 22,829 matched forecasts), the raw Brier scores match closely:

MetricOursLeaderboardDelta
Dataset Brier0.1850.170+0.015
Market Brier0.1320.146-0.014
Overall Brier0.1810.158+0.023

Cohen’s d = 0.12 — a negligible effect size. The replication works.

Brier score comparison showing our replication vs ForecastBench leaderboard

Raw Brier score comparison for Gemini-2.5-Flash: our replication vs ForecastBench leaderboard, with 95% bootstrap confidence intervals.

One important caveat: we forecast retroactively (calling models in March 2026 on questions from 2024—2026). The model could have indirect knowledge of some outcomes. That the scores still match suggests retroactive knowledge has minimal impact on aggregate accuracy.

Brier score per question set over time

Per-question-set Brier scores showing temporal variation. Recent sets trend higher as fewer questions have resolved, biasing toward longer-horizon forecasts.

Forecasting personalities

The more interesting finding came from benchmarking three frontier models against live market prices from all four prediction platforms (247/250 questions matched, 98.8% coverage).

ModelMean |delta|Median |delta|BiasHigher/Lower
GPT-5.40.2200.154-0.036100/113
Gemini-3.1-Pro0.2240.140-0.08372/142
Grok-4.20-beta0.2740.200+0.061118/97

Each model has a distinct personality:

GPT-5.4 is a consensus tracker. Its forecasts cluster near market prices, with the smallest median deviation. It rarely strays far from the crowd.

Gemini-3.1-Pro is conservative. It systematically predicts lower probabilities than the market — a negative bias of -0.083 means it’s consistently more sceptical.

Grok-4.20-beta is the contrarian. The highest mean deviation, a positive bias, and a fondness for extreme probabilities (92%, 95%). It is willing to disagree with the crowd.

Model forecast vs market price scatter for three frontier models

Model forecast vs live market price for three frontier models. The diagonal represents perfect agreement with the crowd.

Whether contrarianism helps depends on whether the crowd is well-calibrated. On ForecastBench, the superforecaster median achieves a Brier Index of 70.8 — substantially better than any LLM (best: 64.2). Extreme disagreement with markets is, on average, wrong.

The pipeline

The full replication runs all 24 question sets for ~$10 (Gemini-2.5-Flash) in under an hour at 100 concurrent API calls. Deterministic disk caching means re-runs are free.

Market prices come from four platforms via heterogeneous methods: Manifold and Polymarket have public REST APIs; Metaculus hides its aggregated probabilities behind JavaScript (requiring browser automation); INFER needs authenticated scraping.

The difficulty-adjusted Brier Index — the leaderboard’s primary metric — uses a two-way fixed effects regression across all models and questions. You cannot compute it from a single model’s forecasts, which is the main limitation of any single-model replication.

References