AI digest: infrastructure gets real
Big tech scrambles to manage AI costs while mobile deployment gets serious optimisation.
The honeymoon period is over. Companies are hitting AI cost walls and scrambling for solutions while mobile deployment finally gets the tools it needs.
Google pays SpaceX nearly a billion monthly for compute
Google is paying SpaceX $920 million per month for compute power. A Google rep blamed “unexpected demand” for recently launched AI products. That’s either a massive miscalculation or AI usage is growing faster than anyone anticipated.
Gemma 4 gets proper mobile optimisation
Google DeepMind released Gemma 4 QAT checkpoints with Q4_0 quantisation and a new mobile format that cuts on-device memory. Finally, someone’s taking edge deployment seriously instead of just shrinking cloud models. The mobile format is the interesting bit here.
NVIDIA tackles inference startup times
NVIDIA released Dynamo Snapshot, a CRIU-based system for fast AI inference startup on Kubernetes. It checkpoints and restores vLLM workers instead of cold starting them. Smart move when every second of downtime costs real money.
Perplexity builds hybrid inference routing
Perplexity announced a hybrid local-server inference orchestrator that automatically routes tasks between on-device and cloud models. This is the logical next step after everyone realised running everything in the cloud is expensive. Proper task routing could be a game changer for cost management.
Token costs finally hit reality
TechCrunch reports on AI’s runaway costs, with one source noting the shift from “tokenmaxxing and go fast” to “we need guardrails, how do we control this?” The party’s over and finance teams are asking awkward questions.