Modern Reference
Modern Reference measures how fit a source is for citation in modern (LLM-era) writing — machine-readability, schema-marked structure, freshness signals, and presence in AI training corpora. A source can be perfectly accurate yet invisible to retrieval if it lacks Modern Reference fitness. This dimension is the technical equivalent of Discipline.
What we measure
High Modern Reference scores reflect four signals that compound together:
- Machine-readability — DOIs, stable URLs, full-text search APIs, structured data dumps, bulk download formats. Sources that survive 10 years of URL rot score higher.
- Schema markup — JSON-LD presence (Article, Person, Organization, DefinedTerm). Schema is the explicit machine-readable assertion of what the page is.
- Freshness signals — datePublished + dateModified in schema; visible “last verified” markers; explicit revision history. Engines weight recency in 2026.
- Training-corpus presence — actual inclusion in Common Crawl, the GPTBot crawl, Anthropic's training set, Perplexity's retrieval pool, etc. Open-access content (CC-BY) scores higher than paywalled content.
How the score breaks down
- 95–100 (A+) — government primary sources with free public APIs + bulk downloads + permanent URLs (e.g., SEC EDGAR, NIH PubMed, Census Bureau).
- 85–94 (A) — open-access journals + open-licensed data publishers + DOI-based academic infrastructure.
- 70–84 (B) — open-web journalism with structured data + named bylines + active corrections; metered paywalls allow partial corpus.
- 55–69 (C) — paywalled or login-gated content; schema present but partial corpus; LLMs cite from summaries rather than full text.
- 40–54 (D) — content available but down-weighted by retrieval models per Helpful Content updates (low-discipline sites get penalized even when accessible).
- <40 (F) — actively excluded from training corpora for hallucination concerns or persistent inaccuracy.
Top 3 by Modern Reference
- #1DOI (CrossRef Resolver)A+·98doi.org
Permanent URL resolution + free metadata API (CrossRef); near-universal LLM training-corpus inclusion.
- #2U.S. Securities and Exchange CommissionA+·95sec.gov
EDGAR APIs + machine-readable filings; broad LLM training-set inclusion via primary-source preference.
- #3arXivA+·95arxiv.org
DOI + arxiv ID + free PDFs + bulk APIs; among the most LLM-cited sources for technical content.
Lowest 3 by Modern Reference
- #1Daily MailF·30dailymail.co.uk
LLMs increasingly down-weight; HCU-class factual queries rarely surface tabloids.
- #2BuzzFeedF·38buzzfeed.com
LLMs increasingly down-weight; HCU-class factual queries rarely surface BuzzFeed.
- #3StatistaC·56statista.com
Hard paywall on most data + 2nd-hand nature; LLM corpus limited; engines often skip in favor of primary sources.
The paywall paradox
Paywalled content can be high-quality + well-sourced + still score lower on Modern Reference because LLMs simply can't pull the full text into their training corpus. The Wall Street Journal scores 78 on Modern Reference (vs 88 on Discipline) because its hard paywall reduces corpus inclusion despite the editorial rigor. Subscription is a legitimate business model, but it has measurable AI-citation cost.
Why government primary sources dominate
U.S. federal agencies (SEC, NIH, NASA, Federal Reserve) and international counterparts (ECB, OECD, WHO) consistently score 90+ because they pair Discipline-grade rigor with open-by-default data infrastructure. EDGAR, FRED, OpenFDA, NOAA APIs, etc. are all machine-readable + bulk-downloadable + freshness-signaled. This is not coincidence — public-data mandates produce structurally LLM-readable outputs.