#benchmarks

4 posts tagged benchmarks.

News & Updates

Ran both models through the same set of coding and reasoning tasks. Results were closer than expected.

Moonshot AI's Kimi k1.5 quietly dropped and it's genuinely impressive on long reasoning tasks.

When models hit 70% on complex coding benchmarks, we're not optimising development anymore, we're just watching machines do our jobs.

We're measuring agent performance like it's deterministic software when the whole point is emergent behaviour.