Why a 70% Pick Should Win 70% of the Time
A betting model can pick the right side again and again and still lose you money. The reason is calibration: whether the probabilities it prints are honest. Here's what calibration is, how to measure it, why it decides whether a number is safe to stake, and how we audit our own in public.
Accuracy and calibration are not the same thing
Two models can both go 55-45 picking winners and be worlds apart. The first says every pick is a coin-flip-plus, around 55%, and those picks win 55% of the time. The second calls everything 80% and wins the same 55%. Same accuracy. The first is calibrated; the second is lying to you about how sure it is.
Accuracy asks "did it pick the right side?" Calibration asks "when it said 70%, did those happen 70% of the time?" You can ace one and fail the other. A model that ranks teams perfectly but exaggerates every margin will look sharp on a results page and quietly destroy a bankroll, because the part you actually bet on — the probability — is wrong.
Why calibration is the number you stake on
Every responsible staking method, from full Kelly to the quarter-Kelly most sane bettors use, takes the model's probability as a direct input. Bet size scales with how far the model's number sits above the market's price. So the probability isn't decoration — it's the throttle on your bankroll.
Feed that formula an honest 58% and you bet a sensible amount. Feed it an overconfident 70% on the same game and you bet far more than the edge justifies. Do that across a season and the overconfident model loses money on picks that were often correct — it simply bet too much on each one. This is why a calibrated 58% is worth more than an overconfident 70%: not because it wins more, but because you size it right.
The calibration gap: one number, honestly reported
To measure calibration you bucket every resolved pick by what the model predicted, then compare the average prediction in each bucket to what actually happened. The difference is the calibration gap.
Negative means overconfident (the picks won less than promised). Positive means underconfident (the model was hedging). Inside roughly ±2 points is well calibrated. Here's the shape of a real, imperfect model — decent in the middle, overconfident at the top:
| Predicted bucket | Avg predicted | Actual hit rate | Calibration gap |
|---|---|---|---|
| 50–55% | 52.6% | 48.3% | −4.3 pt |
| 60–65% | 63.0% | 63.5% | +0.5 pt |
| 70–75% | 71.9% | 45.5% | −26.4 pt |
Read the last row. Picks the model loved at ~72% won less than half the time. That bucket is where an overconfident model does its damage, because those are exactly the picks it stakes the most on. The fix isn't more features — it's confidence, pulled back to match reality.
The reliability curve and the Brier score
Plot predicted probability on one axis and actual hit rate on the other and you get a reliability curve. A perfectly calibrated model traces the diagonal. Points above the line mean underconfident; points below mean overconfident. One glance tells you whether to trust the model's numbers.
If you want a single summary number, the Brier score is the standard: the average squared error between each prediction and its outcome (1 for a win, 0 for a loss). Lower is sharper. A coin flip scores about 0.25; genuinely sharp forecasting lives in the 0.18–0.21 range. Brier rewards being both right and honest about your confidence, which is exactly the pair of traits you want.
How you fix a model that's overconfident
The instinct is to add data. The fix is almost always the opposite — shrink the confidence toward 50% by the amount the history says it was overstated:
- A per-sport bias offset — if a sport's favorites have run, say, 11 points hot, pull every probability in that sport toward 50% to match.
- Platt scaling — refit a sigmoid on resolved results so the model's raw output maps onto the rate that actually occurred.
- Per-band shrink — target the specific confidence tier that keeps over-promising (that 70–75% row above) instead of flattening everything.
- Benching — when a market's measured gap is catastrophic and no amount of shrink recovers it, stop staking it until the gap comes back. Chasing a broken market with bigger corrections is how models dig deeper holes.
How we audit our own calibration
None of this is theoretical for us. Our model health page publishes the reliability curve, the per-bucket calibration gap, and the Brier score across every resolved pick — wins and losses both. When a tier runs overconfident, we flag it in plain sight rather than burying it; when a market's gap goes catastrophic, we bench it and say so. The point of putting the calibration in the open is simple: a probability you can't audit is a probability you shouldn't stake.