Thoughts

Memory compression is just bandwidth denial pretending to be breakthrough

Google's TurboQuant and the rush to compress KV caches are treating symptoms whilst ignoring the real problem.

We’re watching the industry celebrate memory compression breakthroughs like they’ve solved fundamental scaling issues. Google’s TurboQuant promises 6x memory reduction with zero accuracy loss. Everyone’s building paged attention systems and quantisation frameworks. But this is just bandwidth theatre dressed up as innovation.

The compression trap

Memory compression is what happens when you’ve already lost the architecture war. We’re squeezing KV caches because we built transformer inference around the assumption that memory is infinite and bandwidth is free. Neither was ever true. Now we’re spending engineering cycles optimising around a broken premise instead of questioning why we need gigabytes of cached attention states in the first place.

The real tell is that all these compression schemes are “data-oblivious”. They’re generic solutions that don’t understand what the model is actually doing with that memory. We’re applying gzip-level thinking to attention mechanisms that could be fundamentally rethought.

Infrastructure as destiny

This compression obsession reveals something deeper about how we’re scaling AI systems. Instead of designing models that use memory efficiently, we’re building increasingly complex infrastructure to make wasteful models run faster. It’s like adding more lanes to a motorway instead of questioning whether everyone needs to drive to the same place at the same time.

The companies winning this game won’t be the ones with the best compression algorithms. They’ll be the ones who realised memory bandwidth was the constraint and designed around it from day one.