AI digest: Agents go autonomous
Karpathy releases autonomous ML research tools whilst AI models tackle real production challenges.
Autonomous AI agents are moving from hype to hardware. This week brought practical tools and sobering reality checks.
Karpathy drops autoresearch for solo GPU experiments
Andrej Karpathy open-sourced autoresearch, a 630-line Python tool that lets AI agents run their own machine learning experiments on single GPUs. It’s a stripped-down version of his nanochat training core, designed for autonomous iteration. This feels like the kind of tool that’ll quietly spawn dozens of interesting projects over the next few months.
Claude finds 100+ Firefox bugs humans missed
Anthropic’s Claude AI uncovered over 100 security vulnerabilities in Firefox, including issues that decades of human testing hadn’t caught. Mozilla’s partnership with Anthropic shows AI security auditing is moving beyond proof-of-concept into proper production use. The fact that Claude spotted vulnerabilities human experts missed suggests we’re hitting a genuine capability threshold here.
AI agent benchmarks miss 92% of actual work
Researchers found that AI agent benchmarks obsess over coding whilst ignoring 92% of the US labour market. Most agent development focuses on programming tasks, leaving vast swathes of real-world work unmeasured. This explains why agent demos look impressive but feel disconnected from what most people actually do for work.
Feature bloat breaks models in production
A new study shows that excessive features create production fragility in regression models, even when accuracy improves. Every additional feature creates another dependency that can fail. Worth remembering as we pile more data into AI systems and wonder why they break in unexpected ways.