#testing

5 posts tagged testing.

Thoughts

We're measuring agent performance like it's deterministic software when the whole point is emergent behaviour.

While everyone obsesses over model capabilities, we're shipping AI systems with testing practices from 2015.

While researchers obsess over benchmarks, the real breakthroughs are happening in production environments where models meet reality.

Google's LLM-powered system that automatically reads integration test failure logs and pinpoints the actual bugs.

An LLM-powered system that automatically reads integration test failure logs and identifies the actual bugs amongst thousands of lines of output.