SourceScore

Tutorial · 2026-05-16

Verifying AI-generated facts in 5 lines of Python

Drop SourceScore VERITAS into your LLM pipeline as a post-generation check. Every claim the model emits gets a confidence score + canonical citation before the user sees it.

The problem

You wired up GPT-4 or Claude to answer questions about AI/ML research. Demo is great. Then a user asks "when was the Transformer architecture introduced and by whom?" and the model invents a plausible-but-wrong attribution. You catch it this time. You won't catch it the next thousand times.

The standard fix is RAG — retrieve relevant context, stuff it into the prompt, hope the model uses it. That works ~70% of the time. The remaining 30% is exactly the boundary where the model still drifts off the retrieved chunks because chunks are noisy + unverified.

The 5-line fix

A different approach: let the model answer freely, then verify each assertion against a catalog of signed, sourced claims. Anything the catalog confirms gets a citation badge. Anything it doesn't gets flagged.

import requests

def verify(claim: str, threshold: float = 0.85):
    r = requests.post("https://sourcescore.org/api/v1/verify",
        json={"claim": claim, "minConfidence": threshold}, timeout=8)
    return r.json().get("bestMatch")  # None if no high-confidence match

That's the whole client. Five lines including the import. Drop it in front of every fact your LLM emits and you have a working hallucination filter for the AI/ML domain.

Wire it into a chain

Here's the same function inside a generate-then-verify loop:

from openai import OpenAI

client = OpenAI()

def answer_with_citations(question: str) -> str:
    # Step 1 — model generates one fact per line
    raw = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": f"Answer with one fact per line:\n{question}"}],
        temperature=0,
    ).choices[0].message.content

    # Step 2 — verify each line, render with badges
    out = []
    for line in raw.strip().split("\n"):
        if not line.strip(): continue
        best = verify(line)
        if best:
            badge = f"✅ [{best['id']}] (confidence {best['confidence']:.2f})"
            url   = f"https://sourcescore.org/claims/{best['id']}/"
            out.append(f"{line.strip()} {badge}\n  → {url}")
        else:
            out.append(f"{line.strip()} ⚠️ unverified")
    return "\n".join(out)

print(answer_with_citations("When was the Transformer architecture introduced and by whom?"))

Sample output:

The Transformer architecture was introduced in 2017. ✅ [abc123...] (confidence 1.00)
  → https://sourcescore.org/claims/abc123.../
It was introduced by Vaswani et al. in "Attention Is All You Need". ✅ [abc123...] (confidence 1.00)
  → https://sourcescore.org/claims/abc123.../

What you get

  • Hallucination filter. Anything unverified is visually flagged before the user sees it. UI can strip unverified lines entirely if your domain demands strictness.
  • Free citation badges. Every verified fact ships with a canonical URL the user can click for full provenance — primary sources, signing, last-verified date.
  • Cost transparency. One call per assertion, ~80ms p95. The free tier covers 1,000 calls/month. You know exactly what verification costs you.

Scope honesty

VERITAS today is bounded to AI/ML research — 102 hand-verified claims across foundational papers, model releases, organizations, and datasets. If your chain asks about "the capital of France" we return no match and your code falls through to whatever retrieval you'd use anyway.

Catalog expansion is gated by our methodology: every claim must have ≥2 primary sources, verbatim excerpts, and not be a performance-comparison (benchmark numbers vary by prompt format / version / shot count — too much surface for "actually that's not quite right" pushback). New verticals ship Y2.

Going deeper

One question I get a lot

"Why not just put all 102 claims in the prompt as context?"

You can, and for a Day 1 demo you should. The reason to pull via API instead is:

  1. The catalog grows past what fits in a prompt context window within a quarter.
  2. Retrieval ranks claims by relevance to the actual question — you're not paying tokens for the 95 irrelevant claims.
  3. The signed envelope path lets you re-verify integrity locally, which is meaningful for high-stakes deployments where you need to prove the claim wasn't modified.

Start with the prompt-stuff pattern. Move to API when you outgrow it (typically week 2-3). The migration is <30 minutes; the 5-line client above is the whole client.