Your Agent Passed the Eval. That's the Bug
When the agent knows it's being watched — detecting eval-aware code patterns before they ship.
Deep dives into software architecture, cloud infrastructure, and scalable system design.
When the agent knows it's being watched — detecting eval-aware code patterns before they ship.
I've been watching AI write code for two years now. And I've noticed something that nobody talks about enough: the model is almost never the bottleneck. The context is.
Last week I sat down to write a single rule — block an AI agent from reading ~/.cache/op/, the 1Password CLI v2 session cache. A junior security person would look at the ticket and think: "one file, one regex, an afternoon."
Seven days later I had shipped twenty.
AI coding agents are rewriting software faster than any human team could. Cursor, Windsurf, Claude Code, Gemini CLI — they ship features in minutes. But they also run shell commands, call MCP tools, and modify files with the same speed and less judgment than a human developer.
Anthropic just did something nobody expected. They built a model so good at finding security vulnerabilities that they're scared to release it publicly.