Scoring standard

Methodology

AI Prophecy Index tracks public AI forecasts against later evidence. It is an independent editorial ledger, not a model, a market forecast, or an official record from the people being tracked.

Explore predictions Read privacy & legal

Revision controlled15 source-passage checks

Status model

Resolved outcomes are separated from still-open and too-early claims.

Evidence rule

Source coverage is disclosure, not a guarantee of truth.

Predictions tracked

100%

Resolved hit rate

12/12 resolved; 44 unresolved

Resolved claims

2026-06-04

Last reviewed

Scope

Scope: what predictions qualify

The tracker includes public, attributable AI forecasts that can be connected to a date, source, forecaster, and later evidence. A claim is easier to score when it has a concrete time horizon, observable outcome, or falsifiable threshold.

Broad worldview claims, recommendations, and risk arguments may appear only when they are part of the public forecasting record. They are not scored as resolved unless the public evidence is specific enough.

Status model

Status model: how outcomes are judged

Confirmed

The public record substantially matches the specific claim being scored.

Incorrect

The claim missed a clear outcome, material condition, or stated timing window.

Still unfolding

The forecast is actively unfolding, but not enough evidence exists to resolve it.

Too early to score

The horizon has not arrived, or the public record is not specific enough to score.

Resolved hit rate is calculated as confirmed predictions divided by confirmed plus incorrect predictions. Still-unfolding and too-early-to-score claims remain visible in the ledger but are excluded from that rate.

Claim confidence

Claim confidence: how direct is the original forecast?

Claim confidence describes how directly a public statement is phrased as a forecast in the source material. It is not a probability estimate and does not mean the forecast is likely to be correct.

High-confidence entries are usually explicit forecasts or thresholds. Medium-confidence entries are more interpretive but still traceable to the public record. Implied entries should be read with extra caution because the forecast framing is inferred from broader public claims.

Resolution strength

Resolution strength: how strong is the evidence?

Confirmed and incorrect statuses are paired with a score-strength label. This separates the outcome from how literally the public evidence matches the forecast. It prevents aggregate or proxy-supported confirmations from looking identical to direct confirmations.

Direct evidence

The public evidence directly matches the scored condition or timing window.

Aggregate evidence

The public evidence confirms the broad threshold through aggregate or combined measures, while a narrower reading remains disclosed.

Partial match

The public evidence confirms a material part of the claim while leaving a narrower clause unresolved or contested.

Proxy-supported

The public evidence supports the claim through a close proxy rather than the exact condition named in the forecast.

Current resolved record: 12 resolved predictions; every resolved prediction has a score-strength label and rationale.

Evidence standard

Each prediction card carries source material, references, a last-reviewed date, and an AI-assisted assessment. Primary sources, original forecasts, official releases, technical reports, and public benchmark records are preferred where available.

Reputable reporting and institutional summaries are used when they are the clearest public evidence available.

Statuses can change as new evidence arrives. The ledger is reviewed periodically, not in real time. Scores reflect the public record as of the listed review date. A score is an editorial judgement about the public record as of the listed review date.

Evidence map

Resolved cards use a structured evidence map: each evidence item states what part of the score it supports, how directly it supports it, and which numbered references on the card were used. This is separate from the narrative assessment so readers can inspect the support path rather than only trusting the summary.

12/12

Resolved cards mapped

Evidence items

Missing maps

Direct evidence

The linked reference directly supports the evidence sentence.

Aggregate evidence

The linked references support the evidence sentence through combined or aggregate indicators.

Partial support

The linked references support part of the evidence sentence while leaving a narrower clause qualified.

Proxy support

The linked references support a close proxy rather than the exact condition named in the forecast.

Context signal

The linked references provide surrounding context that affects how the score should be read.

This improves auditability but does not replace external source-passage verification. A 9.5-grade launch should still spot-check the cited passages themselves, especially where the support type is aggregate, partial, proxy, or context.

Source passage audit

Source-passage verification is the manual review layer above the structured evidence map. Each resolved evidence item has one audit record keyed to the prediction ID, evidence-item number, checked date, verdict, and numbered references that were verified or access-limited.

15/15

Evidence items checked

Fully verified

Access-limited

Verified

The checked public pages support the evidence item without known automated-access gaps.

Verified with access limits

At least one attached reference supports the evidence item, but one or more attached pages could not be cleanly checked in this pass.

Needs revision

The cited passage path does not yet support the evidence item strongly enough for release.

Blocked

The evidence item cannot be verified from the currently attached source path.

This audit checks whether attached references support the public evidence item. It does not imply endorsement by the forecasters, exhaustively re-score unresolved claims, replace legal review, or prove that paywalled and bot-blocked pages are independently accessible to every reader.

Source passage verification records for resolved evidence items
Prediction	Evidence	Verdict	Checked	Passage note
cs-3	#1	Verified	2026-06-04	Sakana, DeepMind, and OpenAI source pages support AI-authored or AI-assisted research outputs in papers, algorithm discovery, and wet-lab protocol optimization.
cs-7	#1	Verified	2026-06-04	Boston Dynamics and Figure support early commercial humanoid deployment and manufacturing scale-up, while DeepMind supports frontier embodied-reasoning progress.
cs-8	#1	Verified	2026-06-04	TSMC capacity reporting and the CRS CHIPS Act overview support the hardware-fabrication bottleneck and state-subsidy portions of the evidence sentence.
cs-10	#1	Verified	2026-06-04	Gartner, CIO, Atlantic Council, Microsoft, and TechCrunch pages support large-scale AI spending, sovereign initiatives, national AI investment pressure, and DeepSeek V4 competitive context. The score remains aggregate because these sources support competitive pressure rather than the literal 10x-1000x economic-growth clause.
cs-13	#1	Verified	2026-06-04	Microsoft Research and Arc Institute pages verify generative biological-design capability and biosecurity-screening concerns. This verifies biological dual-use capability and screening risk without relying on the prior access-limited MIT Technology Review article.
cs-13	#2	Verified	2026-06-04	Anthropic, red.anthropic.com, Hacker News, and Axios pages support autonomous vulnerability discovery, gated Mythos access, wider Glasswing vulnerability findings, and government access disputes. This is context evidence for the same dual-use pattern, not direct biological-pathogen evidence.
la-3	#1	Verified	2026-06-04	Alphabet and Amazon public earnings releases support large cloud and AI-driven revenue scale, while pure AI revenue is not separately broken out. Keep the score strength aggregate because the cited public disclosures do not isolate pure AI revenue.
la-4	#1	Verified	2026-06-04	DCD, OpenAI, and TechCrunch pages support large Stargate GPU/site expansion and additional cash or compute commitments crossing the threshold on aggregate-capital terms. This verifies the aggregate-capital reading, not a single disclosed training run budget.
la-7	#1	Verified	2026-06-04	Gartner supports global AI spending above $1T, and Tom's Hardware supports the $725B hyperscaler capex estimate for 2026. The threshold is supported by broad spending forecasts; the capex-specific source is supporting context for the narrower infrastructure reading.
la-8	#1	Verified	2026-06-04	DOJ and FDD pages support the Chinese AI-related economic-espionage prosecution and theft-of-confidential-AI-technology portion of the claim. This is partial support; it does not prove a broad all-out campaign or full model-weight exfiltration.
la-8	#2	Verified	2026-06-04	TechCrunch supports the DeepSeek V4 preview release and its competitive frontier-model positioning. This context item remains weaker than the espionage item and should stay separate from direct claim support.
la-11	#1	Verified	2026-06-04	CDAO, DefenseScoop, and NSA pages support frontier AI contracts, GenAI.mil deployment, and intelligence-community AI-security infrastructure.
la-11	#2	Verified	2026-06-04	Axios pages support Mythos access as a multi-agency national-security issue involving NSA, CISA, Treasury, and White House negotiation.
la-15	#1	Verified	2026-06-04	Anthropic Glasswing, Epoch GPQA Diamond, and Artificial Analysis pages support frontier-model performance on graduate-level academic benchmarks.
ac-5	#1	Verified	2026-06-04	The 80,000 Hours transcript directly supports the no-quantitative-commitments claim, and OpenAI's policy paper supplies accessible qualitative-containment context rather than a binding quantitative commitment. This verifies the public claim as stated by Cotra; it is not a full legal or private-policy audit of every lab.

Source coverage

Source coverage is shown on each prediction card. A linked source is the original forecast material, an outcome reference, or a context reference attached directly to that card. Coverage is not a truth guarantee; it is a disclosure of how much public material is attached to the assessment.

56/56

Cards with a linked source

Multi-source reviews

Single-source reviews

Multi-source review means the card links the original forecast material and at least one additional review reference. Single-source review means the forecast source is linked, but separate outcome/context references are not listed on that card. Missing linked sources are treated as repair issues by the source-health audit.

Revision log

Material scoring changes should be logged with the review date, affected claim, new status, previous status, and source that changed the record. This release has one public baseline entry:

2026-06-04 · Current scoring baselineCurrent prediction statuses, evidence notes, source links, methodology page, and source-passage checks reflect the active public baseline for this static release.

Decision records

Public scoring decisions

Resolved predictions have public decision records that list the current status, score strength, review date, and reviewer note. This is not a private audit trail; it is part of how the public ledger explains its scored outcomes.

Public scoring decisions for resolved predictions
Prediction	Status	Strength	Reviewed	Decision note
cs-3	Confirmed	Direct evidence	2026-04-01	Multiple public examples show AI systems performing material research tasks before full AGI.
cs-7	Confirmed	Direct evidence	2026-04-30	Public robotics deployment and production data show the physical-automation bottleneck described in the claim.
cs-8	Confirmed	Direct evidence	2026-04-01	Public chip-supply, fab-capacity, and subsidy evidence identify fabrication capacity as a binding near-term scaling constraint.
cs-10	Confirmed	Aggregate evidence	2026-04-30	Broad national investment pressure is confirmed; the forecast's 10x-1000x growth-differential mechanism is not yet realized.
cs-13	Confirmed	Direct evidence	2026-04-10	Public biosecurity and cyber-security examples show advanced models materially improving dangerous dual-use capability.
la-3	Confirmed	Aggregate evidence	2026-04-01	Combined cloud and AI-driven revenue clearly exceeds the threshold, while pure AI revenue is not separately broken out by each company.
la-4	Confirmed	Aggregate evidence	2026-04-30	The capital threshold has been crossed, but a single-cluster reading remains narrower than the evidence.
la-7	Confirmed	Aggregate evidence	2026-04-30	Broad AI spending has crossed the threshold, while narrower infrastructure-only measures are described separately.
la-8	Confirmed	Partial match	2026-04-30	Evidence strongly supports AI-lab infiltration and research-secret theft; full frontier model-weight theft remains less directly documented.
la-11	Confirmed	Direct evidence	2026-04-30	Public defence, intelligence, and White House actions show national-security institutions actively engaging frontier AI development.
la-15	Confirmed	Direct evidence	2026-04-30	Public benchmark records show frontier models exceeding graduate-level and professional-academic baselines named in the claim.
ac-5	Confirmed	Direct evidence	2026-04-10	The public lab-policy record still lacks quantitative crunch-time resource commitments.

Limitations

This index is not a complete database of every AI forecast, and it does not claim to be neutral in the way a formal forecasting tournament would be. It is a transparent editorial tracking project with source links and visible scoring notes.

The tracked forecasters are not affiliated with this site. AI-assisted summaries and assessments may contain mistakes, omissions, or bias. Do not rely on this site for investment, legal, policy, or other consequential decisions without independent verification.

Corrections

Corrections and challenges

The best correction request identifies the prediction, the disputed status or evidence claim, and a public source that changes the scoring record. Corrections can be sent to scott@scotthazlitt.ai.

Include

Prediction ID or copied claim text.
The disputed status, evidence sentence, source link, or omission.
A public source URL and the specific passage or data point it changes.
Whether the request is a factual correction, source addition, status challenge, or wording concern.

Review states

Logged: the request has been received and attached to a card.
Reviewing: source material is being checked against the scoring standard.
Changed: a score, source, or evidence note was revised and added to the revision log.
No change: the challenge was reviewed, but the public record did not alter the score.

Explore predictions Read privacy & legal