SourceScore

Blog · Architecture · 2026-05-17

Multi-LLM grounding in 2026 — build once, deploy across OpenAI, Anthropic, Google, and open-weight

Single-provider lock-in is fragile in 2026. Pricing shifts, capability changes, and outages all argue for portability. Here's the architecture pattern that keeps your grounding layer LLM-agnostic — same verification, citation, and source-quality across every provider.

The single-provider problem

You picked OpenAI in 2023. Two years later: Anthropic shipped a better reasoning model for half the price, Google's Gemini handles your specific document type more cleanly, and DeepSeek-V3 is 10× cheaper than GPT-4o on equivalent tasks. Your customer asks why response quality varies week-to-week. Your CFO asks why the API line in the budget keeps doubling.

Single-LLM-provider lock-in cost roughly 30-50% of buyers in the 2024-2025 wave: a model gets deprecated, pricing shifts, or capabilities lag. The teams that survived these shifts built a portable grounding layer from day one. This post shows the architecture.

Why 2026 makes multi-LLM mandatory

  • Pricing variance is 5-10x. DeepSeek-V3 input tokens are ~$0.27/M. GPT-4o input tokens are ~$2.50/M. For high-volume RAG retrieval contexts, the input-token cost dominates. Switching models can cut bills in half.
  • Capability gaps shift quarterly. Claude 3.7 Sonnet (Feb 2025) was the best coding model for ~3 months; o3 (Dec 2024) was the best reasoning model; Gemini 1.5 Pro had the best long-context. No single model wins on all dimensions.
  • Outages happen. Major provider outages of 1-4 hours per quarter are routine. A multi-LLM stack with automatic failover keeps you up while competitors blank.
  • Regulatory + data-residency rules. Different providers have different jurisdictional footprints + compliance certifications. EU customers often need Anthropic-EU or self-hosted Llama; US gov needs FedRAMP-certified options.
  • Open-weight quality crossed the line. Llama 3.1 405B, DeepSeek-V3, Qwen 2.5 are production-quality. Self-host or run via Fireworks / Together AI for cost + privacy.

The architecture pattern

Three layers, each provider-agnostic:

  1. Router — selects which LLM provider to call per request based on rules (task type, cost budget, latency SLA, user tier).
  2. Adapter — normalizes input + output across provider APIs. OpenAI tools, Anthropic tool-use, Google function-calling, Llama function-calling all have different request shapes; the adapter hides this.
  3. Grounding layer — verification + citation applied to LLM output regardless of which provider produced it. Same claim catalog, same signatures, same canonical URLs.
# Python — multi-LLM grounding skeleton
from openai import OpenAI
from anthropic import Anthropic
import google.generativeai as genai
import requests

class LlmRouter:
    def __init__(self):
        self.openai = OpenAI()
        self.anthropic = Anthropic()
        genai.configure(api_key=GEMINI_API_KEY)
        self.gemini = genai.GenerativeModel("gemini-2.5-pro")

    def call(self, task: str, provider: str = None) -> str:
        # Router: pick provider based on task + budget + SLA
        provider = provider or self._pick_provider(task)
        if provider == "openai":
            r = self.openai.chat.completions.create(
                model="gpt-4o-mini",
                messages=[{"role": "user", "content": task}],
                temperature=0,
            )
            return r.choices[0].message.content
        elif provider == "anthropic":
            r = self.anthropic.messages.create(
                model="claude-sonnet-4-5-20250929",
                max_tokens=1024,
                messages=[{"role": "user", "content": task}],
            )
            return r.content[0].text
        elif provider == "gemini":
            r = self.gemini.generate_content(task)
            return r.text
        raise ValueError(f"unknown provider: {provider}")

    def _pick_provider(self, task: str) -> str:
        # Toy heuristic — real router uses cost + quality data
        if "code" in task.lower(): return "anthropic"
        if "long" in task.lower(): return "gemini"
        return "openai"

def ground(response: str) -> dict:
    """Verify factual assertions via SourceScore VERITAS regardless
    of which LLM produced the response. Same call, same citations."""
    assertions = extract_assertions(response)  # split into atomic claims
    verified = []
    for a in assertions:
        r = requests.post(
            "https://sourcescore.org/api/v1/verify",
            json={"claim": a, "minConfidence": 0.85},
            timeout=8,
        )
        verified.append(r.json().get("bestMatch"))
    return {"response": response, "citations": verified}

# Usage
router = LlmRouter()
raw = router.call("When was Llama 3.1 released?", provider="anthropic")
grounded = ground(raw)

The grounding layer is provider-agnostic — it sees only the response text. Switch the router's provider per request (cost, capability, latency, failover) without changing the verification logic.

Adapter libraries that work as of 2026

  • Vercel AI SDK — TypeScript-first; standardizes OpenAI, Anthropic, Google, Mistral, Cohere, Replicate. Stream support.
  • DSPy — Python; signature programming + provider abstraction.
  • LangChain — broad provider support; sometimes heavy abstraction.
  • Instructor — structured output across providers via Pydantic models.
  • LiteLLM — proxy-style adapter; drop-in replacement for OpenAI client across 100+ providers.
  • OpenRouter — hosted-proxy; single API key for 200+ models. Good for cost-optimized routing.

Routing rules that work in production

  1. Task type → model. Code-gen → Claude Sonnet 4.5 or Codestral. Long-context document analysis → Gemini 1.5 Pro. Reasoning → o3 or Claude 3.7 with extended thinking. Cheap classification → Gemini Flash or Mistral Small 3.
  2. User tier → cost-budget. Free-tier users → cheapest acceptable model (DeepSeek-V3, Mistral Small, Gemini Flash). Paid-tier → premium (Claude Sonnet 4.5, GPT-4o). Enterprise → premium + self-hosted fallback.
  3. Latency SLA → streaming + sub-second models. When <500ms first-token matters, use sub-second providers (Fireworks-hosted Llama, OpenAI gpt-4o-mini, Cohere Command R).
  4. Outage → automatic failover. Wrap each provider call in a try/except chain. Anthropic down → try OpenAI → try Gemini → try self-hosted Llama. Don't return errors to user; degrade gracefully.
  5. Regulatory zone → compliant provider. EU customer → Anthropic-EU or self-hosted. US government → FedRAMP-certified. China → Doubao or Hunyuan.

Why grounding-layer portability matters

The temptation is to use a provider's built-in grounding feature: Anthropic Citations API, OpenAI's new search-grounded responses, Google's built-in retrieval. Each is excellent within its provider.

But when you switch providers (cost, capability, outage), the grounding layer disappears with the provider. Your users see citations on Monday and none on Tuesday because you failed over to a different model. That's a trust collapse.

A provider-agnostic grounding layer (like SourceScore VERITAS, or a self-built RAG pipeline against a shared knowledge store) survives provider switches. The citations on Monday and Tuesday are identical — same claim IDs, same signatures, same canonical URLs — regardless of which LLM produced the raw response. Users don't see your infrastructure churn.

Provider-locked grounding still has its place

Anthropic Citations API and OpenAI's search-grounded responses excel at user-uploaded-doc citations within a single provider. The clean pattern: use both.

  • User-supplied document RAG → use the provider's native citation API
  • Shared knowledge base + cross-provider facts (model release dates, paper authorship, organizational facts) → use a portable grounding layer like VERITAS

See VERITAS vs Anthropic Citations API for the full head-to-head.

Getting started

  1. Pick an adapter library that fits your stack (Vercel AI SDK for TS, DSPy or LiteLLM for Python).
  2. Wrap your LLM calls behind the adapter. No business logic should hit provider SDKs directly.
  3. Add a router with at minimum 2 providers + automatic failover.
  4. Add a grounding layer that doesn't care which provider produced the response. Run the same 5-min quickstart regardless of backend.
  5. Test the failover by killing each provider in turn and confirming citations still render.

Related