Voice and audio AI — speech, music, and conversational platforms

Voice + audio is the next-most-important modality after text

Vision-language models grabbed 2023-2024 headlines, but voice + audio is quietly becoming the dominant interaction surface. ChatGPT's voice mode (2023-09), OpenAI Realtime API (2024-10), Anthropic's voice features, ElevenLabs Conversational AI (2024-11), and the wider Suno + Udio + Stable Audio music-generation wave have made audio a first-class AI modality. Voice-only smart speakers + earbuds + handsets push interaction toward speech, not text.

The audio AI stack has three layers

Layer 1: ASR (speech-to-text) — OpenAI Whisper (open-weight, foundational), Whisper large-v3 (2023-11). Layer 2: TTS (text-to-speech) — ElevenLabs (founded 2022), OpenAI tts-1, Anthropic, etc. Layer 3: Audio generation — Suno v3/v4 (music, 3-minute coherent tracks), Stable Audio 2.0 (Stability AI 2024-04 long-form music), AudioLM (Google 2022, foundational). Hume AI (founded 2021) adds emotion-recognition over speech. ElevenLabs Conversational AI (2024-11) combines all three layers into a single voice-agent API.

Why voice agents are the 2025 frontier

Production voice agents need ASR + LLM + TTS in single low-latency loop (<1 second from user voice to first response audio). Achieving sub-second latency requires careful integration — model parallelism, streaming output, voice-activity detection. The platforms that ship this integrated experience (ElevenLabs Conversational AI, OpenAI Realtime API, Anthropic voice features, Vapi, Retell) are the 2025 voice-agent leaders.

Why this catalog matters for verification

Voice-AI assistants confidently emit hallucinated facts the same way text LLMs do — but users can't easily fact-check while listening. The verification layer (SourceScore VERITAS) is doubly important in voice context: text-render the LLM response server-side, verify facts before TTS, only synthesize verified content. Plus, the voice + music + audio claims in this hub are themselves the kind of facts a voice-assistant might be asked about — the catalog grounds future voice agents asking about voice-AI history.

Defined terms (6)

ASR (Automatic Speech Recognition)

Converting audio of human speech to text. State-of-the-art systems: OpenAI Whisper, Google USM, Amazon Transcribe, ElevenLabs Speech-to-Text.

TTS (Text-to-Speech)

Converting text to natural-sounding speech audio. Leaders: ElevenLabs, OpenAI tts-1, Microsoft Azure Speech, Google WaveNet.

Voice agent

An end-to-end conversational AI that handles voice input + voice output in a single low-latency loop. Combines ASR + LLM + TTS.

Whisper

OpenAI's open-weight ASR model (2022-09), trained on 680k hours of multilingual audio. Foundational to most production speech-to-text systems today. Whisper large-v3 (2023-11) is current generation.

Suno

Music generation startup (founded 2023). Suno v3 (2024-03) + v4 (2024-11) generate coherent multi-minute songs with lyrics + instruments + vocals from text prompts.

ElevenLabs

Voice AI company (founded 2022) leading in voice cloning + TTS. ElevenLabs Conversational AI (2024-11) is their end-to-end voice agent platform.

Voice and audio AI — speech, music, and conversational platforms

Voice + audio is the next-most-important modality after text

The audio AI stack has three layers

Why voice agents are the 2025 frontier

Why this catalog matters for verification

Defined terms (6)

All claims in this topic (9)

Related

Other topic hubs

Concept pillars

Framework integrations