AI digest: Voice gets real, agents get organised
Voice AI finally starts listening to how we actually talk, while OpenAI ships a less hallucinogenic model and everyone builds agent frameworks.
Voice tech is having a moment. Two new models show the industry moving past robotic speech towards something that might actually sound human.
Inworld’s voice model listens to how you speak
Inworld AI launched Realtime TTS-2, a voice model that conditions on full audio context rather than just text transcripts. This is clever architecture. Most TTS systems read words but miss the music of conversation - the pauses, emphasis, and rhythm that make speech feel natural.
Mistral tackles the expressivity problem
Mistral’s Voxtral TTS combines autoregressive and flow-matching architectures to solve voice cloning’s biggest weakness - maintaining emotional consistency. The hybrid approach is promising, though we’ll need to hear it in action to judge whether it actually closes that expressivity gap.
OpenAI ships GPT-5.5 Instant with fewer hallucinations
GPT-5.5 Instant is now ChatGPT’s default model, promising 52.5% fewer hallucinations on sensitive topics like medicine and law. OpenAI also added “memory sources” so you can see what stored context influences responses. Smart move - transparency builds trust, especially when the model is making claims about your health.
Agent frameworks go modular
This Python tutorial shows how to build modular agent systems with dynamic tool routing. The OS-like approach to agent capabilities makes sense - reusable skills with proper schemas and central registries. We’re seeing patterns emerge for how to structure these systems properly.
Google adds webhooks to Gemini API
Google’s new webhook system eliminates polling for long-running jobs like batch processing and video generation. About time - event-driven architectures are table stakes for production AI workflows.