Concept · Reference · 10th pillar
Multimodal AI — vision, audio, video, and cross-modal models
Definition, the 4 modality classes (vision-language, text-to-image, text-to-video, text-to-audio), 2021-2025 timeline (CLIP → DALL·E → GPT-4V → Pixtral → Sora → Veo 2 → SAM 2), 6 production patterns, 7 failure modes, and how multimodal verification differs from text-only fact-checking.
What is multimodal AI?
Multimodal AI is any system that processes or generates more than one modality of data — typically text, images, audio, or video. The defining capability is cross-modal reasoning: understanding the relationship between modalities, not just handling them in parallel.
A model that takes a photo of a receipt and extracts the total amount is multimodal (image → text). A model that generates a movie scene from a paragraph is multimodal (text → video). A pipeline that does OCR on a PDF then summarizes it is not truly multimodal — it's two pipelines glued together.
The 4 modality classes
- Vision-Language Models (VLMs) — image + text → text. Examples: GPT-4 Vision (OpenAI 2023-09), Claude 3 (Anthropic 2024-03), Pixtral 12B (Mistral 2024-09), Apple Intelligence (Apple 2024-10), LLaVA, Llama 3.2 Vision.
- Text-to-Image — text → image. Examples: DALL·E 3 (OpenAI 2023-10), Stable Diffusion 3 (Stability AI 2024-02), Midjourney (public beta 2022-07), Flux (Black Forest Labs 2024-08), Imagen 3 (Google DeepMind 2024).
- Text-to-Video — text → video. Examples: Sora (OpenAI 2024-02), Veo 2 (Google DeepMind 2024-12), Runway ML, Pika, Kling AI.
- Text-to-Audio — text → audio. Examples: Suno v4 (music, Suno 2024-11), ElevenLabs (voice, 2022-), AudioLM (Google 2022), Stable Audio (Stability AI 2023).
Some models cross multiple classes. GPT-4o (2024-05) accepts image + audio + text and emits text + audio. Gemini 1.5 Pro (Google 2024-02) handles all four. These omnimodal models are the 2024-2025 frontier direction.
Timeline — 2021 to 2025
- 2021-01 · CLIP + DALL·E 1 (OpenAI) — CLIP's contrastive dual-encoder pattern becomes the architectural ancestor of every modern multimodal system.
- 2022-04 · DALL·E 2 (OpenAI) — diffusion + CLIP-guided generation; first widely-usable photorealistic text-to-image.
- 2022-07 · Midjourney public beta — commoditized text-to-image for general users.
- 2022-08 · Stable Diffusion 1.0 (Stability AI) — open-weight; ignites the multimodal open-source ecosystem.
- 2022-09 · Whisper (OpenAI) — robust multilingual speech-to-text; foundational audio modality.
- 2023-03 · GPT-4 (multimodal capability announced; vision GA 2023-09-25).
- 2023-07 · SDXL (Stability AI) — quality jump on Stable Diffusion line.
- 2023-10 · DALL·E 3 (OpenAI) — better prompt-following + native ChatGPT integration.
- 2024-02 · Sora announced (OpenAI) — 60s video from text, watermark debate.
- 2024-02 · Stable Diffusion 3 (Stability AI) — Diffusion Transformer (DiT) architecture.
- 2024-05 · GPT-4o (OpenAI) — first production omnimodal at consumer scale.
- 2024-07 · SAM 2 (Meta AI) — real-time video segmentation; vision foundation model.
- 2024-08 · Flux (Black Forest Labs) — new open-weight contender; image-gen lead from former Stability researchers.
- 2024-09 · Pixtral 12B (Mistral) — first European open-weight multimodal at Apache 2.0.
- 2024-11 · Suno v4 (Suno) — best-in-class text-to-music quality.
- 2024-12 · Veo 2 (Google DeepMind) — 4K text-to-video; physics + cinematography upgrade.
- 2025-02 · Claude 3.7 Sonnet (Anthropic) + Grok 3 (xAI) — multimodal reasoning models with hybrid extended-thinking.
6 production patterns
- Document understanding — VLM extracts structured data from PDFs, receipts, screenshots, scanned forms. Replaces OCR + LLM stack.
- Visual search + retrieval — CLIP-style embeddings index a product catalog or asset library; user queries by photo or text.
- Generative design — text-to-image for marketing assets, product mockups, concept art. Output often refined via inpainting or controlled by depth/pose conditioning.
- Video summarization + chaptering — VLM samples frames, transcribes audio, produces structured chapter markers + summary.
- Accessibility — alt-text generation for images, audio descriptions for video, sign-language translation. High-leverage, often regulatory.
- Robotics + embodied agents — VLM grounds perception (camera) to natural-language instructions. Examples: Google RT-2, Tesla Optimus embodiment.
7 failure modes
- Hallucinated objects. VLM describes a cat in an image that contains no cat. More likely on low-resolution or ambiguous images.
- OCR errors on stylized text. VLM reads decorative fonts, handwriting, or low-contrast text incorrectly without admitting uncertainty.
- Counting failures. "How many people are in this photo?" remains surprisingly hard for VLMs as of 2024-2025.
- Spatial reasoning. "Which object is left of the chair?" — left-of-right confusion is common; models often default to image-frame coordinates.
- Text-image misalignment in generation. Text-to-image misses specific details (color of object 2, number of fingers, exact text in image). Worse on complex prompts.
- Watermark + provenance gaps. Generated media is hard to distinguish from real; social platforms struggling with detection. C2PA standard emerging; not universal.
- Modality leakage in evals. Models tested only on text-only tasks can pass while their vision capability has silently degraded. Run multimodal eval suites (MMMU, MathVista, BLINK) regularly.
Multimodal verification vs text-only
SourceScore VERITAS today covers text-only AI/ML claims (model release dates, paper authorship, parameter counts, benchmark scores). Multimodal verification is a different problem:
- Image provenance: "Was this image actually generated by this model?" — requires watermarking + perceptual hashing + reverse-search. C2PA spec emerging.
- Visual claim verification: "Does this graph in the AI's response actually show decreasing trend?" — requires re-rendering the data + comparison.
- Audio + video deepfake detection: separate research field (microexpression analysis, audio artifact detection). Not part of VERITAS scope.
For now, multimodal fact-checking pipelines should: (1) use VERITAS for the textual facts the multimodal output references (e.g., "this model was released on..."), (2) use dedicated provenance tools (C2PA, Content Credentials) for the media itself. Vertical expansion to multimodal verification is Y2+ scope for VERITAS.
Related
- Hallucination — the failure mode multimodal models share with text models
- Fine-tuning — applies equally to VLMs (LoRA on Pixtral, LLaVA fine-tunes)
- Embeddings — CLIP-style embeddings power multimodal retrieval
- Topic hub: Multimodal AI — catalog of multimodal-related claims
- Use case: Content moderation — pre-publish gate for AI-generated outputs including multimodal