"Vibe Coding"

December 09, 2025 —

3 Minute Read

I’ve been using LLMs a lot lately, and in the last few weeks have been intentionally exploring their utility in a software engineering context. My conclusion is that I believe that “Vibe Coding” is actually the exact opposite of the correct way to use LLMs:

LLMs thrive on context, which allows them to do a much better job the more precise and verbose the prompt is. They’re also always going to reference additional context that you didn’t include, just by way of the way they’re trained - instructions are only overlaid on top of existing probabilities, which include exclusively things you didn’t ask for. So it makes sense that large amounts of context are helpful/required for getting good results.

Which brings us to “Vibe Coding” - the idea here is that verbose product specs are dead because you can just fiddle your way into something that works rather than thinking through exactly how it should work first. The LLM will trial and error with you until you get the result you’re looking for.

This is a bizarre thing to use LLMs for, as they will invariably just make a bad version of something that already exists unless you are quite precise with your instructions. But beyond that, the culture of letting the LLM do the initial experimentation, and then leaving the “hard work” to humans leaves engineers with nothing but messes to clean up. And often, those messes could have been avoided or circumvented by writing a good spec first.

Ironically, the LLM then could do a much much better job reviewing, adjusting, writing tests for, and otherwise truly pairing on a problem if you had written the spec to begin with, and made a best effort at writing the first version yourself. This also results in clear style references and test examples that the LLM can follow, allowing it to truly help instead of being relegated to creating throwaway code that over promises to a product manager and creates a pile of technical debt in its wake.

I suspect that these shortcomings in the use cases that appear to be most heavily encouraged by vibe code boosters are just the tip of the iceberg. Something that a human can do is be told “don’t abstract things in this part of the codebase” or “we have a style guide for readability that includes X”, and generally they only need to be told once. In my short experiment I have encountered many things that would be easy to tell a junior engineer one time and never have to worry about again, which to get an LLM to “remember” would require a tedious list of dos and don’ts in the LLM context file, along with infinitely subtle variations as it finds new and creative ways to attempt to write code that matches the token probabilities instead of code that matches your internal culture.

For me, one of the bright spots of the latest rounds of models is that in a sufficiently tested codebase, it can do a decent job of writing tests around functionality - I often find this helpful for counter-factual tests, when I’m trying to fix something and I need a test to fail when I fix it. I can then manually update the test to verify that I changed what I expected to change. LLM-written tests are often extremely pedantic and corner-case oriented, which is very helpful when changing large systems. The long-term value of these tests is certainly up for debate, but at least for the exercise of “I know it doesn’t work now, but how do I know I didn’t regress with a fix”, LLMs are looking like a powerful tool for existence proofs today.

However, I will caveat that I have had little success even with very high context with asking agentic LLM to fix bugs or write tests for unknown states — which probably is about what you should expect, it is just predicting what the code would look like when the bug was fixed, but unlike a human, “context” here has absolutely no impact on the most important parts of the UX of your application, it’s just a pedantic list of commands and explanations.

I think this last note is very helpful for setting expectations about when LLMs can be useful in a coding context, and once again highlights the etymological damage that OpenAI and others have done when it comes to helping us use these new tools. Under no circumstances should you think of these tools as “thinking” or “helpers”, but rather “very accurate predictors” - if you give it a really nicely shaped problem and have it predict what fits inside, they’re incredibly powerful, and will certainly have a meaningful impact on my life. But the moment that I ask it to do something that isn’t obvious and constrained, it (predictably) generates reams of trash.