AI digest: Voice wars heat up

Big week for voice AI as the majors rush to catch up with each other’s demos.

Google’s Gemini 3.1 Flash Live promises proper voice conversations

Google released Gemini 3.1 Flash Live with low-latency multimodal voice interactions through the Gemini Live API. The model processes audio, video, and tool use natively rather than stitching separate models together. This feels like Google’s answer to OpenAI’s Advanced Voice Mode, though we’ll see how the latency actually holds up in practice.

Mistral joins the voice party with Voxtral

Mistral dropped Voxtral TTS, their first text-to-speech model that can clone voices from just three seconds of audio across nine languages. It’s open-weight, which is refreshing given how locked down most voice cloning tech has been. The three-second cloning claim is bold but also slightly terrifying for obvious reasons.

Cohere enters speech recognition properly

Cohere launched Transcribe, their automatic speech recognition model aimed squarely at enterprise use. They’re positioning it as a direct competitor to proprietary ASR APIs with better accuracy. Smart move given how fragmented the enterprise speech market still is, though they’re late to a crowded party.

Someone actually made reasoning models practical

A nice coding tutorial emerged showing how to run Qwen3.5 reasoning models with Claude-style thinking using GGUF and 4-bit quantisation. You can switch between a 27B model and a lightweight 2B version with a single flag. This is the kind of practical deployment work that actually matters for getting these models into real use.

AI digest: Agents get serious 24 Apr AI digest: Agents go mainstream 23 Apr AI digest: Tools get serious 22 Apr

Google’s Gemini 3.1 Flash Live promises proper voice conversations

Mistral joins the voice party with Voxtral

Cohere enters speech recognition properly

Someone actually made reasoning models practical

Related