Topic hub · 9 claims
Voice and audio AI — speech, music, and conversational platforms
Speech recognition (Whisper), text-to-speech (ElevenLabs), music generation (Suno, Stable Audio), voice-emotion (Hume), and end-to-end voice agents (ElevenLabs Conversational AI). Verified release dates + capabilities + license context.
Voice + audio is the next-most-important modality after text
Vision-language models grabbed 2023-2024 headlines, but voice + audio is quietly becoming the dominant interaction surface. ChatGPT's voice mode (2023-09), OpenAI Realtime API (2024-10), Anthropic's voice features, ElevenLabs Conversational AI (2024-11), and the wider Suno + Udio + Stable Audio music-generation wave have made audio a first-class AI modality. Voice-only smart speakers + earbuds + handsets push interaction toward speech, not text.
The audio AI stack has three layers
Layer 1: ASR (speech-to-text) — OpenAI Whisper (open-weight, foundational), Whisper large-v3 (2023-11). Layer 2: TTS (text-to-speech) — ElevenLabs (founded 2022), OpenAI tts-1, Anthropic, etc. Layer 3: Audio generation — Suno v3/v4 (music, 3-minute coherent tracks), Stable Audio 2.0 (Stability AI 2024-04 long-form music), AudioLM (Google 2022, foundational). Hume AI (founded 2021) adds emotion-recognition over speech. ElevenLabs Conversational AI (2024-11) combines all three layers into a single voice-agent API.
Why voice agents are the 2025 frontier
Production voice agents need ASR + LLM + TTS in single low-latency loop (<1 second from user voice to first response audio). Achieving sub-second latency requires careful integration — model parallelism, streaming output, voice-activity detection. The platforms that ship this integrated experience (ElevenLabs Conversational AI, OpenAI Realtime API, Anthropic voice features, Vapi, Retell) are the 2025 voice-agent leaders.
Why this catalog matters for verification
Voice-AI assistants confidently emit hallucinated facts the same way text LLMs do — but users can't easily fact-check while listening. The verification layer (SourceScore VERITAS) is doubly important in voice context: text-render the LLM response server-side, verify facts before TTS, only synthesize verified content. Plus, the voice + music + audio claims in this hub are themselves the kind of facts a voice-assistant might be asked about — the catalog grounds future voice agents asking about voice-AI history.
Defined terms (6)
- ASR (Automatic Speech Recognition)
- Converting audio of human speech to text. State-of-the-art systems: OpenAI Whisper, Google USM, Amazon Transcribe, ElevenLabs Speech-to-Text.
- TTS (Text-to-Speech)
- Converting text to natural-sounding speech audio. Leaders: ElevenLabs, OpenAI tts-1, Microsoft Azure Speech, Google WaveNet.
- Voice agent
- An end-to-end conversational AI that handles voice input + voice output in a single low-latency loop. Combines ASR + LLM + TTS.
- Whisper
- OpenAI's open-weight ASR model (2022-09), trained on 680k hours of multilingual audio. Foundational to most production speech-to-text systems today. Whisper large-v3 (2023-11) is current generation.
- Suno
- Music generation startup (founded 2023). Suno v3 (2024-03) + v4 (2024-11) generate coherent multi-minute songs with lyrics + instruments + vocals from text prompts.
- ElevenLabs
- Voice AI company (founded 2022) leading in voice cloning + TTS. ElevenLabs Conversational AI (2024-11) is their end-to-end voice agent platform.
All claims in this topic (9)
- ElevenLabs Conversational AI·publicly released on 2024-11-19 by ElevenLabs — voice-to-voice agent platform combining ASR + LLM + TTS in single API(1.00 · 2 sources)
- Stability AI Stable Audio 2.0·publicly released on 2024-04-03 by Stability AI — Stable Audio 2.0 long-form music + sound effects from text prompts(1.00 · 2 sources)
- Suno v3·publicly released on 2024-03-21 by Suno — improved music generation (full 2-minute songs, lyrics + instrumentation)(1.00 · 2 sources)
- Suno v4·publicly released on 2024-11-19 by Suno — improved music generation, longer outputs, audio remastering(1.00 · 2 sources)
- Whisper·released on 2022-09-21(1.00 · 2 sources)
- Whisper large-v3·publicly released on 2023-11-06 by OpenAI(1.00 · 2 sources)
- ElevenLabs·founded in 2022 — AI voice synthesis platform(0.95 · 2 sources)
- Hume AI·founded in 2021 — empathic voice AI company(0.95 · 2 sources)
- Suno AI·founded in 2023 — AI music generation(0.95 · 2 sources)