Lakeshore Edge Create Free Account
Field Guide · Reading a Model

Why a 70% Pick Should Win 70% of the Time

A betting model can pick the right side again and again and still lose you money. The reason is calibration: whether the probabilities it prints are honest. Here's what calibration is, how to measure it, why it decides whether a number is safe to stake, and how we audit our own in public.

By Lakeshore Edge · 8 min read

TL;DR A calibrated model's 70% picks win about 70% of the time. Calibration is separate from accuracy — a model can pick winners well and still be chronically overconfident. It matters because stake sizing takes the probability literally: an overstated 70% gets over-staked and bleeds even when the picks are "right." Measure it with the calibration gap (predicted minus actual) and the Brier score, fix it by shrinking toward 50%, and when a market is hopeless, bench it.

Accuracy and calibration are not the same thing

Two models can both go 55-45 picking winners and be worlds apart. The first says every pick is a coin-flip-plus, around 55%, and those picks win 55% of the time. The second calls everything 80% and wins the same 55%. Same accuracy. The first is calibrated; the second is lying to you about how sure it is.

Accuracy asks "did it pick the right side?" Calibration asks "when it said 70%, did those happen 70% of the time?" You can ace one and fail the other. A model that ranks teams perfectly but exaggerates every margin will look sharp on a results page and quietly destroy a bankroll, because the part you actually bet on — the probability — is wrong.

Why calibration is the number you stake on

Every responsible staking method, from full Kelly to the quarter-Kelly most sane bettors use, takes the model's probability as a direct input. Bet size scales with how far the model's number sits above the market's price. So the probability isn't decoration — it's the throttle on your bankroll.

stake ∝ ( model probability − market price )

Feed that formula an honest 58% and you bet a sensible amount. Feed it an overconfident 70% on the same game and you bet far more than the edge justifies. Do that across a season and the overconfident model loses money on picks that were often correct — it simply bet too much on each one. This is why a calibrated 58% is worth more than an overconfident 70%: not because it wins more, but because you size it right.

The calibration gap: one number, honestly reported

To measure calibration you bucket every resolved pick by what the model predicted, then compare the average prediction in each bucket to what actually happened. The difference is the calibration gap.

calibration gap = actual hit rateaverage predicted

Negative means overconfident (the picks won less than promised). Positive means underconfident (the model was hedging). Inside roughly ±2 points is well calibrated. Here's the shape of a real, imperfect model — decent in the middle, overconfident at the top:

Predicted bucketAvg predictedActual hit rateCalibration gap
50–55%52.6%48.3%−4.3 pt
60–65%63.0%63.5%+0.5 pt
70–75%71.9%45.5%−26.4 pt

Read the last row. Picks the model loved at ~72% won less than half the time. That bucket is where an overconfident model does its damage, because those are exactly the picks it stakes the most on. The fix isn't more features — it's confidence, pulled back to match reality.

The honest tell A model with positive closing-line value but a negative calibration gap is finding real edges and overstating their size. That's a calibration problem, not an edge problem — and it's fixable without touching the part that works.

The reliability curve and the Brier score

Plot predicted probability on one axis and actual hit rate on the other and you get a reliability curve. A perfectly calibrated model traces the diagonal. Points above the line mean underconfident; points below mean overconfident. One glance tells you whether to trust the model's numbers.

If you want a single summary number, the Brier score is the standard: the average squared error between each prediction and its outcome (1 for a win, 0 for a loss). Lower is sharper. A coin flip scores about 0.25; genuinely sharp forecasting lives in the 0.18–0.21 range. Brier rewards being both right and honest about your confidence, which is exactly the pair of traits you want.

How you fix a model that's overconfident

The instinct is to add data. The fix is almost always the opposite — shrink the confidence toward 50% by the amount the history says it was overstated:

How we audit our own calibration

None of this is theoretical for us. Our model health page publishes the reliability curve, the per-bucket calibration gap, and the Brier score across every resolved pick — wins and losses both. When a tier runs overconfident, we flag it in plain sight rather than burying it; when a market's gap goes catastrophic, we bench it and say so. The point of putting the calibration in the open is simple: a probability you can't audit is a probability you shouldn't stake.

See our calibration, live
The model page shows the reliability curve, every confidence bucket's gap, the Brier score, and the markets we've benched — updated as games settle.
Open the audit
Sports betting carries real financial risk. Past performance does not guarantee future results. This article is educational and is not betting advice. Bet responsibly and only with money you can afford to lose. If gambling is causing harm, visit ncpgambling.org or call 1-800-GAMBLER.