AI digest: Voice wars heat up
Google and others push hard on real-time voice AI while practical deployment stories emerge.
Big week for voice AI as the majors rush to catch up with each other’s demos.
Google’s Gemini 3.1 Flash Live promises proper voice conversations
Google released Gemini 3.1 Flash Live with low-latency multimodal voice interactions through the Gemini Live API. The model processes audio, video, and tool use natively rather than stitching separate models together. This feels like Google’s answer to OpenAI’s Advanced Voice Mode, though we’ll see how the latency actually holds up in practice.
Mistral joins the voice party with Voxtral
Mistral dropped Voxtral TTS, their first text-to-speech model that can clone voices from just three seconds of audio across nine languages. It’s open-weight, which is refreshing given how locked down most voice cloning tech has been. The three-second cloning claim is bold but also slightly terrifying for obvious reasons.
Cohere enters speech recognition properly
Cohere launched Transcribe, their automatic speech recognition model aimed squarely at enterprise use. They’re positioning it as a direct competitor to proprietary ASR APIs with better accuracy. Smart move given how fragmented the enterprise speech market still is, though they’re late to a crowded party.
Someone actually made reasoning models practical
A nice coding tutorial emerged showing how to run Qwen3.5 reasoning models with Claude-style thinking using GGUF and 4-bit quantisation. You can switch between a 27B model and a lightweight 2B version with a single flag. This is the kind of practical deployment work that actually matters for getting these models into real use.