TriAttention | SOFT CAT .ai

Long-chain reasoning models like DeepSeek-R1 can generate tens of thousands of tokens before solving a complex problem. Each token must be stored in the KV cache, which quickly becomes the bottleneck for inference speed and memory usage.

TriAttention, developed by researchers from MIT, NVIDIA, and Zhejiang University, compresses this cache without sacrificing attention quality. The method achieves 2.5× higher throughput compared to full attention whilst maintaining equivalent performance.

This isn’t just another approximation technique that trades quality for speed. The approach specifically targets the memory wall that hits when models work through extended reasoning chains. For anyone running long-context inference at scale, this could meaningfully reduce compute costs without degrading output quality.

The timing is particularly relevant given the recent focus on reasoning models that generate extensive internal monologues. We’re watching for implementation details and whether this approach generalises across different model architectures.

No interactive tool for this one yet. Browse all tools for more.