Modern Reference

Modern Reference measures how fit a source is for citation in modern (LLM-era) writing — machine-readability, schema-marked structure, freshness signals, and presence in AI training corpora. A source can be perfectly accurate yet invisible to retrieval if it lacks Modern Reference fitness. This dimension is the technical equivalent of Discipline.

What we measure

High Modern Reference scores reflect four signals that compound together:

Machine-readability — DOIs, stable URLs, full-text search APIs, structured data dumps, bulk download formats. Sources that survive 10 years of URL rot score higher.
Schema markup — JSON-LD presence (Article, Person, Organization, DefinedTerm). Schema is the explicit machine-readable assertion of what the page is.
Freshness signals — datePublished + dateModified in schema; visible “last verified” markers; explicit revision history. Engines weight recency in 2026.
Training-corpus presence — actual inclusion in Common Crawl, the GPTBot crawl, Anthropic's training set, Perplexity's retrieval pool, etc. Open-access content (CC-BY) scores higher than paywalled content.

How the score breaks down

95–100 (A+) — government primary sources with free public APIs + bulk downloads + permanent URLs (e.g., SEC EDGAR, NIH PubMed, Census Bureau).
85–94 (A) — open-access journals + open-licensed data publishers + DOI-based academic infrastructure.
70–84 (B) — open-web journalism with structured data + named bylines + active corrections; metered paywalls allow partial corpus.
55–69 (C) — paywalled or login-gated content; schema present but partial corpus; LLMs cite from summaries rather than full text.
40–54 (D) — content available but down-weighted by retrieval models per Helpful Content updates (low-discipline sites get penalized even when accessible).
<40 (F) — actively excluded from training corpora for hallucination concerns or persistent inaccuracy.

The paywall paradox

Paywalled content can be high-quality + well-sourced + still score lower on Modern Reference because LLMs simply can't pull the full text into their training corpus. The Wall Street Journal scores 78 on Modern Reference (vs 88 on Discipline) because its hard paywall reduces corpus inclusion despite the editorial rigor. Subscription is a legitimate business model, but it has measurable AI-citation cost.

Why government primary sources dominate

U.S. federal agencies (SEC, NIH, NASA, Federal Reserve) and international counterparts (ECB, OECD, WHO) consistently score 90+ because they pair Discipline-grade rigor with open-by-default data infrastructure. EDGAR, FRED, OpenFDA, NOAA APIs, etc. are all machine-readable + bulk-downloadable + freshness-signaled. This is not coincidence — public-data mandates produce structurally LLM-readable outputs.

What we measure

How the score breaks down

Top 3 by Modern Reference

Lowest 3 by Modern Reference

The paywall paradox

Why government primary sources dominate

Other dimensions