1 post tagged evaluation.
We're measuring agent performance like it's deterministic software when the whole point is emergent behaviour.