#benchmarks
4 posts tagged benchmarks.
News & Updates
DeepSeek R1 vs Claude 3.5: a head-to-head on real tasks
Ran both models through the same set of coding and reasoning tasks. Results were closer than expected.
Testing Kimi k1.5: the reasoning model nobody's talking about
Moonshot AI's Kimi k1.5 quietly dropped and it's genuinely impressive on long reasoning tasks.
Thoughts
Agentic coding models just made software engineering a spectator sport
When models hit 70% on complex coding benchmarks, we're not optimising development anymore, we're just watching machines do our jobs.
Agent benchmarks are just unit tests for unpredictable systems
We're measuring agent performance like it's deterministic software when the whole point is emergent behaviour.