#inference-scaling
2 posts tagged inference-scaling.
Thoughts
Single checkpoints just turned model deployment into a version control nightmare
Embedding multiple model sizes in one checkpoint sounds clever until you realise you've just created the Git submodules of AI.
Memory compression is just bandwidth denial pretending to be breakthrough
Google's TurboQuant and the rush to compress KV caches are treating symptoms whilst ignoring the real problem.