Correct code isn’t enough
Thoughts on coding agents, after letting them write all my code for months
Since the summer of 2025 close to 100% of my code has been written by various coding agents, and I wanted to share some of my experience, and the lessons I have learned as I have become better at using them.
Wasteman’s note: I will refer to coding agents interchangeably with “codex” in this post, as it’s the only coding agent I use today.
A harness is not enough
When I first started using agents I was skeptical of the quality of the code it would output, so I was pretty meticulous about reviewing code line by line, and slowly adding more rules to make sure codex doesn’t repeat mistakes. The problem I was solving here was, “can I get the agent to write correct code”, facilitated by this harness of rules, and good tests. Overtime the harness I added started to be really effective, and I could reliably get codex to one shot tasks.
I started to trust codex enough that I would barely review the code and perform very light UX validations before pushing code through. At some point however, the quality of codex’s output started to degrade when I wanted to make changes, especially if it was something fundamental about my app. Since I ignored the code for a sufficient amount of time, I was not well equipped to debug through all the slop that had accumulated.
We all have that experience with a manager who knows nothing about your system and is always asking you stupid questions about why things take so long, and asking “shouldn’t that task be simple?”. And I realized I had become that manager, except my employees were a bunch of codex agents. I didn’t understand the code or system enough to ask the right questions and guide codex to write the code the way I wanted it to.
Wasteman’s note: Candidly I also became this type of manager during my short stint in management, so I’m not surprised it happened again with non human agents. Probably more evidence of why I should stay an individual contributor.
The harness I built solved the narrow problem of, can it write “correct” code exactly as specified in my unit tests and rules. But I am left with two new problems
Does codex have enough context to infer what the definition of correctness is given a new task?
Does it have context on “why” we made decisions in the past, and does it have any intuition for what is the “right” thing to do for our domain?
Context is more than code, and its more than documentation
One of the ways I tried to solve this context problem, is by having in repo documentation about the “why” we did things with the hope that codex would read through this and avoid falling down common pitfalls. It did improve performance for a time, but no matter how rigorous I was in adding decision logs and documentation in code, I eventually ran into the same problems again.
The two fundamental problems with this approach I observed are
The more context documentation you provide, the more likely codex will forget it when it compacts. So even if the information is there, codex may not follow it because it forgot.
Human language is fundamentally a lossy form of communication
I think (1) can be solved with smarter models, but I think (2) is actually a fundamental problem when working with models that can’t continually learn (I am referring to continual learning as the actual weights of the model changing not just a KV cache in memory). When we speak and write, we aren’t just regurgitating saved thoughts in our brain. Language is just the interface into our minds where a much deeper representation of our knowledge exists. But there is so much knowledge in our brains we either don’t know how to express, or there are no words to describe it.
From Peter Naur’s essay Programming as Theory Building
A main claim of the Theory Building View of programming is that an essential part of any program, the theory of it, is something that could not conceivably be expressed, but is inextricably bound to human beings
And I think this is what I am getting at, when I say that context is more than code and documentation. Humans have the ability to represent knowledge in a deep representation that cannot be mapped 1:1 purely in human language. Peter Naur would say humans can “build a theory” of the system, which you cannot simply codify in markdown files. Until AI labs give us a model that can continually learn, this will be be a problem.
Wasteman’s note: I don’t agree with Naur’s statement that this is inextricably bound to human beings, but I think he is correct that you need an agent that has a deeper understanding of the system than the code and documents can provide.
Nothing replaces understanding the code
It’s easy to get enamored with using coding agents in an imperative way, “Build me X” without any strong opinions on how codex should build it. But at some point the code will become complex enough that codex cannot efficiently solve the problem on it’s own with its limited context window. This is where your understanding of the system is essential to getting the most out of your agent.
The better you understand the code, the better questions you can ask and the better intuition you have on whether codex is solving the problem the right way or not. This is precisely why so many of us have observed that senior engineers have been those that have benefited the most from AI. They know enough to ask the right questions, and have built years of intuition that they use to guide the agent.
You have to be the source of context
In my view, the missing piece of the loop to get a truly autonomous coding agent is an agent that can continually learn and share context forward to new subagents completing subtasks. With today’s models, we have to be this agent. We have deep representations of the systems we have built, compacted in a much more intelligent way than a KV cache of weights of tokens. So we have to be the ones injecting our opinions and “taste” about what the right problems to solve are, and the right ways to solve them are.

