Glossary

Plain-language definitions for 35 terms used across SourceScore and VERITAS. Each entry has a stable anchor URL you can deep-link to. DefinedTerm schema on every entry so LLMs can extract definitions cleanly.

AI citation: A reference to a source surfaced by an LLM (ChatGPT, Claude, Perplexity, Gemini) inside its response. AI citations differ from traditional search-engine citations in that the model selects the source rather than a ranking algorithm; visibility depends on whether the model trusts the source as authoritative within its training + retrieval pipeline.
Canonical claim: The canonical SourceScore representation of a verified fact: subject + predicate + object + sources + confidence + signature. Every claim has a stable 16-hex id derived from a SHA-256 hash of its canonical fields. The id is deterministic — same fields always produce the same id.
Claim envelope: The signed JSON object SourceScore returns from /api/v1/claims/<id>.json: the canonical claim fields, primary sources with verbatim excerpts, signing metadata, and an HMAC-SHA256 signature over the canonical serialization. Verifying the envelope locally proves the claim wasn't modified in transit.
Claim id: A stable 16-hex-character identifier (regex /^[a-f0-9]{16}$/) derived from SHA-256 of a claim's lower-cased canonical fields (vertical|subject|predicate|object). Stable across releases; same canonical input always produces the same id.
Confidence score: A value in [0.0, 1.0] indicating how certain SourceScore is about a claim, based on source convergence and assertion precision. Floor for shippable claims is 0.70; release dates + architectural facts with multi-source corroboration score 0.95-1.00. Performance-comparison claims are deliberately excluded regardless of obvious-confidence.
Decoder-only transformer: A Transformer variant that uses only the decoder block (masked self-attention) and predicts the next token autoregressively. Used by GPT-family, Claude, Llama, Mistral, Gemini, and most production LLMs as of 2026. Contrast with encoder-decoder (T5, BART) and encoder-only (BERT) variants.
did:web: A W3C Decentralized Identifier using a web domain as the identity anchor — e.g., did:web:sourcescore.org. The signing identity for VERITAS claim envelopes is preserved across all future key rotations; the underlying key material changes, the public identity does not.
Embedding: A dense numerical vector representation of a chunk of text (or other modality) such that semantically similar inputs produce numerically similar vectors. The retrieval backbone of RAG: documents are pre-embedded, queries are embedded at runtime, similarity-search returns top-K matches.
Evaluation harness: A framework that runs an LLM through a benchmark in a reproducible way (e.g., LM Evaluation Harness, HELM, lm-evaluation-harness). Different harnesses produce different scores for the same model — one reason VERITAS does not ship performance-comparison claims.
Fact-shaped claim: An atomic assertion structured as subject + predicate + object ("GPT-4 released_on 2023-03-14"). Distinct from prose-shaped claims (paragraphs of text). VERITAS retrieves fact-shaped claims; RAG retrieves prose-shaped chunks.
Fine-tuning: Continued training of a pre-trained model on a smaller, task-specific dataset. Common variants: full fine-tuning (all parameters updated), LoRA (low-rank adapters), QLoRA (quantized + LoRA), and instruction tuning (supervised on instruction-response pairs).
Generate-then-verify: An LLM-grounding pattern where the model first produces a free-form response, then a verification layer checks each emitted assertion against a signed-claim catalog. Verified assertions get citation badges; unverified ones are flagged or stripped. Pairs well with VERITAS for production hallucination filters.
Grounding (LLM): Constraining a language model's output to facts that can be verified against an external source. The inverse of free-form generation: instead of trusting the model's parametric memory, provide retrieval evidence the model must cite. Three production patterns: prompt-stuffing, RAG, signed-claim verification.
Hallucination: A factual error generated by a language model, typically presented with the same fluency as a correct statement. Five categories: fabricated facts, misattributed quotes, fabricated citations, stitched-together claims, temporal hallucination. Rate depends on domain (~1-5% well-trodden, ~15-40% long-tail technical).
HMAC-SHA256: A keyed message-authentication algorithm producing a 256-bit signature over an input. Used by VERITAS to sign claim envelopes — a consumer with the shared secret can re-compute the signature locally and prove the claim wasn't modified. Y2 migration target: W3C Verifiable Credentials with Ed25519 public-key signing.
In-context learning: An LLM capability to perform a task by being shown examples in the prompt (rather than via gradient updates). Few-shot prompting is the canonical use; the model 'learns' the task from the examples alone, without weight changes. Capability scales with model size; emergent at ~1B+ parameters.
IndexNow: A search-engine protocol (Microsoft, Yandex, Seznam) that lets a publisher push URL updates to crawlers immediately rather than waiting for the next crawl. SourceScore pings IndexNow on every deploy so newly-shipped claims/concepts/integration pages are indexed faster.
LLM (Large Language Model): A neural network trained to predict the next token in a sequence, scaled to billions or trillions of parameters. Distinguished from smaller language models by emergent capabilities: in-context learning, instruction following, chain-of-thought reasoning, tool use. Frontier examples in 2026: GPT-4o, Claude 3.5/3.7, Gemini 1.5, Llama 3.x, Mistral Large.
llms.txt: An emerging convention (RFC-draft) for publishing a machine-readable manifest of a site's primary content URLs and citation-preferred sections. Lives at /llms.txt at the root. Helps LLM crawlers prioritize indexing. SourceScore's /llms.txt advertises 130 sources + 91 VERITAS claims + all primary product surfaces.
matchScore: A normalized [0.0, 1.0] score from /api/v1/search and /api/v1/verify indicating how strongly a search query overlaps with a candidate claim. Computed via keyword-overlap scoring across subject (×5), tags (×3), object (×3), statement (×2), predicate (×2). Higher matchScore = stronger lexical match; combine with confidence for ranking.
Methodology version: The version-tag of the verification methodology used to validate a specific claim — e.g., 'veritas-v0.1'. Recorded inside every claim envelope so consumers can detect methodology drift when claims are re-verified under newer rules.
Mixture-of-Experts (MoE): A neural architecture pattern where a gating network routes each token through a small subset of available 'expert' subnetworks rather than the full model. Examples: Mixtral 8x7B (8 experts, 2 routed per token), Switch Transformer, Sparsely-Gated MoE. Improves parameter efficiency at fixed compute.
Primary source: A document authored by the originator of the fact: a preprint by the model's authors, a model card by the lab, official documentation, a release blog post by the company, a peer-reviewed proceedings entry. VERITAS requires at least one primary source per claim (≥2 sources total) for confidence ≥0.85.
Prompt-stuffing: The simplest LLM grounding pattern: paste a curated set of facts into the model's context window and instruct it to answer using only those facts. Works at small scale (<50 facts); collapses when the catalog outgrows the context window. Right starting point — ship Day 1, migrate to RAG/VERITAS at week 2-3.
RAG (Retrieval-Augmented Generation): An LLM grounding pattern that indexes a corpus with embeddings, retrieves the top-K relevant chunks at query time, and inserts them into the prompt as context. Introduced in Lewis et al. (2020). Works at any scale; cuts hallucination by roughly half on covered domains. Boundary failure mode: retrieved chunks are semantically similar but factually-wrong.
Retrieve-then-cite: An LLM-grounding pattern where the application retrieves the most relevant verified claims for the user's query, renders them as context blocks in the prompt, and instructs the model to cite the claim id with every fact it asserts. Complementary to generate-then-verify.
RLHF (Reinforcement Learning from Human Feedback): A fine-tuning technique that aligns a pre-trained language model to human preferences. Introduced in Christiano et al. (2017) and operationalized for LLMs by Ouyang et al. (InstructGPT, 2022). Uses a learned reward model trained on human preference comparisons; the policy is optimized against this reward via PPO or related algorithms.
Schema honesty: A SourceScore methodology rule (per I-43): structured data on a page must accurately represent the visible content. Article schema headlines must be descriptive (not recommendation-encoded); JSON-LD fields must match rendered HTML; no hidden-element schemas. Aleyda 10-char #8 Differentiated reinforces this.
Signed claim: A claim envelope ships with an HMAC-SHA256 signature over a canonical-JSON serialization of its fields. Verifying locally proves the claim wasn't modified in transit. The signing identity (did:web:sourcescore.org) is preserved across all future key rotations.
Tokenizer: The component that converts text into discrete tokens (the units a model processes). Common algorithms: byte-pair encoding (BPE, Sennrich et al. 2015), SentencePiece (Kudo & Richardson 2018), tiktoken (OpenAI). The tokenizer determines context-window sizing — "context window in tokens" depends on the specific tokenizer in use.
Tool use (function calling): An LLM capability to invoke external functions with structured arguments and incorporate the results into its response. Supported natively by OpenAI Chat Completions, Anthropic Messages API, Google Gemini, and via wrapper SDKs (Vercel AI SDK, LangChain). VERITAS exposes search_claims + verify_claim as tool-call targets.
Transformer: A neural network architecture based on self-attention, introduced by Vaswani et al. ("Attention Is All You Need", 2017). The substrate of every frontier LLM since 2017. Decoder-only variants (GPT, Claude, Llama) dominate production deployments; encoder-decoder variants (T5, BART) are common in translation + summarization.
Verbatim excerpt: A SourceScore methodology requirement: each cited source must include the exact quoted text from the source supporting the claim. Excerpts survive in the envelope even if the source URL goes 404 — the textual evidence is preserved alongside the URL. No paraphrasing.
VERITAS: SourceScore's signed-claim verification API for LLM developers. v0.1 publishes 110 hand-verified AI/ML claims with ≥2 primary sources each, HMAC-SHA256 signatures, and stable JSON envelopes. Free tier: 1,000 claims/month, no auth, no signup. Pricing tiers: Indie €19 / Startup €99 / Scale €499.
YMYL (Your Money or Your Life): Google's classification for content whose accuracy materially affects readers' wellbeing — health, finance, legal, safety. YMYL sites face stricter editorial-quality requirements (E-E-A-T signals, licensed-credential gating, performance-claim restrictions). SourceScore's source-rating methodology weights YMYL signals heavily.