writing
The CIA was "Probably" Right
Apr 2026
In 1951, the CIA produced an intelligence estimate that a Soviet invasion of Yugoslavia was a
Kent told him 65% in favour of an attack. The chairman went pale. He’d read “serious possibility” as meaning something much lower. So had his colleagues.
Alarmed, Kent polled every member of the Board of National Estimates — the people who had actually written the estimate and approved that wording. Each had a different number in mind. The lowest was 20%. The highest was 80%.
Kent spent the next decade trying to fix this. In 1964 he published a now-declassified paper, “Words of Estimative Probability,” proposing a standardised scale: “almost certain” should mean 87—99%, “probable” should mean 63—87%, and so on down the line. It was sensible, precise, and the CIA did not adopt it.
But the problem didn’t go away just because nobody wanted to solve it. In the 1970s, researchers at Decisions and Designs, Inc. decided to test Kent’s concern empirically. They rounded up 23 NATO military intelligence officers1 and gave them sentences like “It is
The researchers then did something fateful. They plotted all 23 responses for each of 16 phrases, overlaid Kent’s proposed ranges as shaded bars, and published the whole thing as a single chart in their 1977 Handbook for Decisions Analysis.
Barclay et al. (1977). The dots are individual officer responses; the shaded bars are Kent’s proposed ranges. Note the crosshatching that makes dots hard to count.
That chart went everywhere. Richards Heuer’s Psychology of Intelligence Analysis; foundational CIA text. Pherson and Pherson’s Critical Thinking for Strategic Intelligence; standard university textbook. Risk management handbooks. Defence Department training materials. Hundreds of blog posts, conference slides, and LinkedIn thought pieces. If you’ve ever encountered the argument that verbal probability language is dangerously ambiguous, you have almost certainly seen some version of this chart.
Two charts walk into a bar
I noticed the problem by accident. I was comparing the 1977 Barclay original with a 2013 redrawing from the Pherson textbook. Same data, supposedly. Same 23 officers, same 16 phrases, same dots on the same axis.
Except “We Believe” looked completely different.
So I did something apparently radical, which is that I checked. I digitised three published versions of the chart — Barclay (1977), Heuer (1999), and Pherson (2013) — extracting every dot and recording its position.2
“We Believe” is the most dramatic case, but the problems are everywhere. The number of dots per statement varies from 16 to 23 across the three versions. It should always be 23. That’s how many officers took the survey. But the dots are drawn on top of a shading pattern used for Kent’s proposed ranges, and depending on who was squinting at the graphic, different dots got lost or invented.
The audit nobody read
I was feeling fairly pleased with myself until I discovered that Edmund Conrow, a risk management specialist, had noticed the same thing a decade earlier. In 2010 he presented an audit of the chart 3.
Here is what Conrow found:
-
The raw data is gone. Nobody published the actual numbers the 23 officers wrote down in numerical form, and the original records have been lost. Conrow contacted two of the co-authors of the 1977 report and asked them to look. They couldn’t find anything. Every version of this chart that exists is someone’s attempt to read dots off a graphic.
-
The Kent bars are wrong. Conrow checked the shaded ranges against Kent’s actual 1964 tables. Ten of twelve bars had one or both bounds incorrect. Errors ranged up to 25 percentage points. One bar — “highly unlikely” — shouldn’t exist at all, because Kent never proposed a range for that phrase. Someone in 1977 apparently made it up.
Each republication then introduced its own new problems.4
So was Kent actually right?
The chart is junk. But the claim it’s used to support — that people disagree about what probability words mean — might still be true.
A broken thermometer doesn’t mean the room isn’t hot.
To find out, you need clean data. I surveyed 99 English-speaking respondents with undergraduate or higher education, each shown 25 probability phrases in randomised order.5 The original 16 from Barclay plus nine more, including “frequent,” “realistic possibility,” “rare,” and “occasional.” No scenario context. Bare phrases and a slider.
I wasn’t the only one keen on new data a Reddit user called Zonination had independently run a survey through /r/samplesize in 2015, with 46 respondents and 17 phrases.6
The headline: Kent was right. He’s been right for fifty years.
| Phrase | NATO officers, 1970s | Zonination, 2015 | Hails, 20267 |
|---|---|---|---|
| Almost Certainly | ~85 | ~90 | ~90 |
| Probable | ~70 | ~70 | ~70 |
| Likely | ~75 | ~70 | ~75 |
| We Believe | ~70 | ~70 | ~70 |
| About Even | ~50 | ~50 | ~50 |
| Probably Not | ~20 | ~30 | ~20 |
| Unlikely | ~15 | ~20 | ~15 |
“Probable” has meant about 70% for half a century. NATO officers, Zonination’s respondents, and mine all converge on roughly the same number. The means are remarkably stable. Whatever people think “probable” means, they’ve been thinking it consistently since the Cold War.
But the variation Kent worried about hasn’t gone away either.
“About Even” is easy. Everyone puts it at 50%, give or take a couple of points. “Coin Flip” — which I included as a control — lands at 50% with almost no spread, because it is, in fact, a coin flip, and people know what coins do.
“We Believe” is a different story. Put ten people in a room and ask what it means; the gap between the most confident and the least is about 40 percentage points. That’s the width of an entire Kent category. It’s as though you asked ten people how far London is from Edinburgh and got answers ranging from Brighton to Inverness.
“Realistic Possibility” — a phrase that appears routinely in modern risk assessments — is just as bad. “Highly Unlikely” is worse. The spread is so wide that one reader’s “highly unlikely” is another reader’s “about even.”
There’s a pattern in the mess. Phrases with a mathematical anchor (“about even,” “better than even”) produce tight consensus — everyone clusters within a few points. Phrases that are evaluative or subjective (“we believe,” “realistic possibility,” “we doubt”) scatter across 30 to 40 points. And the low end of the probability scale is worse than the high end. “Highly Likely” produces a spread you could cover with a hand on the number line. “Highly Unlikely” produces a spread you’d need an arm.
This is precisely the wrong way round for intelligence analysis, where the consequential assessments are usually about low-probability, high-impact events. “It is highly unlikely that the adversary will deploy this capability” — that’s a sentence where you really want everyone in the room to agree on what “highly unlikely” means. They don’t. One reader hears 5%. Another hears 30%. That’s the difference between ignoring a threat and scrambling a response.
What broke
The story of this chart is, in miniature, the story of how evidence degrades in transmission.
Kent identified a real problem in 1951. Researchers produced real data in the 1970s. But the data was published as a graphic, not a table. The graphic was hard to read. It combined two unrelated studies without explanation.8 Then people copied it — not because they were careless, but because that’s what you do with a useful chart. You put it in your textbook. You redraw it for your slides. You cite the person who cited the person who cited the original.
Each copy introduced small errors. Dots were lost in the shading. Labels were misspelled. Kent ranges drifted from their correct values. Nobody checked because nobody needed to — the chart had been cited hundreds of times, it appeared in authoritative sources, and it supported a conclusion that was, as it happens, correct.
The conclusion is correct. That’s what makes this more than a gotcha. Five decades and four independent datasets confirm that verbal probability language is genuinely ambiguous. Kent’s proposed fix — pair your words with numbers — still works. Several intelligence agencies now mandate it. The UK’s Defence Intelligence, the US Director of National Intelligence, and parts of NATO all require analysts to attach numerical ranges to their verbal assessments.
But most of the world hasn’t caught up. Corporate risk registers still say “unlikely” without specifying whether that means 5% or 30%. Medical consent forms say side effects are “rare” without noting that “rare” has an SD of 11.3 in my data. Climate reports say outcomes have a “realistic possibility” without acknowledging that this phrase produces more disagreement among readers than almost any other in the probability lexicon.
The chart everyone cites to make this point is broken. But the point stands, and the evidence for it is now stronger than it has ever been — it comes from datasets most people haven’t seen.
Pair your words with numbers. It’s what Kent said in 1964. It’s still the answer.
If you want to see where you land, the survey is still open. Twenty-five phrases, a slider, three minutes.
Footnotes
-
Ranging from Squadron Leader to Lieutenant General, people whose literal job was reading probability language in intelligence reports. ↩
-
Dot coordinates were extracted from scanned images of each publication. NATO officer medians are estimated from the Barclay digitisation and should be treated as approximate given the grid distortion documented by Conrow (2010). The Zonination 2015 data is from their publicly available GitHub repository (n=46). ↩
-
Edmund Conrow, “Evaluation of Subjective Probability Statements,” AIAA Space 2010 Conference, doi:10.2514/6.2010-8739. ↩
-
Kerzner’s project management textbook — one of the most widely used in the field — mislabelled “almost certainly” as “almost likely.” Heuer added an “about even” row with a Kent range but no survey data, silently blending two studies. Hillson reported mode estimates derived from Boehm, who got them from a Defence Department manual, who got them from Barclay. By the time this game of telephone was over, five of eight mode estimates were wrong. ↩
-
Recruited via Prolific (n=99, English-speaking, undergraduate education or higher). Phrases were presented in randomised order as bare words with a slider response; respondents who failed control questions or produced incoherent/low-effort responses were filtered out. Three control items (coin flip, birthday, lightning) were included but excluded from analysis. ↩
-
The resulting joyplot visualisation won an Information is Beautiful award, which means it has now joined the long chain of reproductions of Kent’s original idea. ↩
-
Medians rounded to the nearest 5. With interquartile ranges of 10—20 points for most phrases, reporting to the nearest integer would be false precision. ↩
-
It also contained grid distortions in the shading, which caused different readers to lose or invent dots when manually transcribing the data. ↩