Minding the Gap: Thoughts on LLMs, Abstraction, and Complexity
Ruminations on the impact of Generative AI on the field of software engineering.

Opining on the demise of software development at the hands of Generative AI is a sure way to drive traffic. I understand the impulse, it does feel like something big is happening in our little corner of the world. However, I don’t think I’ve yet read an account of the phenomenon that has convinced me the writer has thought beyond the most surface level activities of software development. Yes LLMs can generate useful code, and yes, that is valuable even if that’s all they are ever capable of, but stopping at “can write code” is an incomplete assessment.
I’ll be honest, like many software engineers I’m still grappling with what these changes will mean for our industry and our craft. This post is, as much as anything, is my way of trying to process some important considerations I haven’t seen widely discussed. So I apologize up front if some of these ideas don’t feel fully baked or some of the consequences are left as an exercise for you to think about. In truth, I think anyone who claims with confidence to understand how AI will change the job of a software engineer is either lying to themselves, or trying to sell you something.
I also want to issue a disclaimer that some of the arguments I’m about to make will sound a little like an old man yelling at clouds, but I want to stress that is not at all my intention! Change is inevitable, especially in technology. Change isn’t what I’m worried about. What I’m worried about is that we know so little about how to make good software, and how to do it repeatedly, that there is a real risk that in the rush to adopt a fantastic productivity boosting new tool, one which carries the potential to radically change our relationship to code, we will inadvertently lose something essential along the way.
One of the problems I see is that, after only 60 or so years as a vocation, I don’t think we know what the essential parts of software development truly are. I’ve written about how I think we have gotten this wrong in some important ways because I think we still have a lot left to learn about the construction of software systems. The development of software is a process unlike nearly any other kind of creative or constructive endeavor that humans engage in, and with so little experience to support us, we would be wise not to assume we have figured it all out yet.
Having said that, there is no doubt that using tools like Cursor or Windsurf feels like having some kind of cheat code for dealing with the tedious parts of software development. “Vibe Coding” creates a level of playfulness in development, that usually only exists in fleeting moments of the process when you are deep in flow. We should pay attention to that feeling! Why does using these tools feel so different? Much of what I have read, and my own experience has been similar, is the feeling of moving at previously unachievable speed, with a magical connection between the speed of your thoughts and the speed of the output. You can go from basic idea to running system in minutes, when it may have taken you hours-to-days, or even longer before. That is a remarkable gain, and I don’t think it should be so quickly dismissed as “fancy autocomplete”.
As an aside, I also think it is wrong to denigrate those who are coming into this career path now and making heavy use of GenAI tooling to enhance their existing workflows. For one thing, using AI to accelerate some tasks like sophisticated autocomplete is quite reasonable and a categorically different use case than letting AI design whole systems from a few prompts. Second, we would do well to remember that in the 2000s, the Java ecosystem aggressively invested in autocomplete tooling for IDEs. This inspired many old time C and C++ engineers to make fun of the Java community because using autocomplete was seen as a gimmick or a crutch that kept you from being a “real programmer”. Some things never change, sigh.
Back to the experience - The first time I watched a demo of Cursor I was really impressed, but that feeling quickly gave way to a question: why do we need to generate so much code to solve what seems like fairly basic problems?
This is not exactly a novel observation. For nearly as long as high-level programming languages have existed, software engineers have complained about the amount of boilerplate in their code or redundant levels of abstraction that slow them down. Still, there was something about watching several thousand lines of Java, config, css, and sql being generated from a 4 line prompt that made the scale of complexity somehow more visceral. The closest analog I can think of is that it felt like joining a legacy project and getting exposed to the code base for the first time. It’s the feeling of being overwhelmed in such a way that you just have to accept you won’t be able to quickly understand how any of it works - except this code base had existed for less than 10 minutes.
Feeling adrift in a code base that is less than an hour old is a new kind of feeling.
By contrast when you write each line by hand, you can’t help but develop a deep, intuitive feeling for the entire codebase. It’s the kind of familiarity I have thinking about the rooms in my house. I don’t have to think about where the couch is in the living room because I know exactly where it is, I can dodge it even when all the lights are out. Likewise, in a codebase I wrote the old fashioned way, the login code is just something I know.
To be clear, I am not making a moral judgement here. Hand planing every two-by-four when framing a house will connect you deeply to the project, but it will also take you orders of magnitude longer to complete the project and it is unlikely to dramatically improve the quality of the build compared to purchasing cut lumber. It’s possible that the coming future will require us to become comfortable with less intellectual control over the code itself. It remains to be seen just how much we depend on that intellectual control without realizing it.
Unease aside, what have I learned? Well after a short time experimenting with these tools I have two observations that feel consequential. The first is that complexity appears to be conserved, the second is that obfuscation is not abstraction.
Conservation of Complexity
First, GenAI tools do not reduce overall system complexity, but they are good at masking it. All of the poor, leaky abstractions, cryptic and bloated configuration languages, and crufty dependency management still exist. All that has changed is that we are now faster at creating it, and you don’t see most of it happen. The implication is that our systems will be able to grow larger, faster, but that we will be no better at reasoning about them, and will certainly be less competent at maintaining them. Furthermore, with machines that can churn through boilerplate, the pressure to reduce the accidental complexity in our tooling and libraries will also diminish.
Relying on existing languages, libraries, and frameworks means that GenAI code will also be prone to generating the same subtle bugs, security, and privacy issues as humans, only now it will be done much faster, and with less oversight. Even high quality code review practices are insufficient to guarantee quality (experience at Google has shown that only intentional platform-level design can ensure quality attributes are maintained). Accelerating the growth of a system, built atop shaky foundations does nothing to make it more performant, more reliable, or more maintainable. It only ensures there is “more” of it.
Of course, I expect companies to appear promising to use AI to make all of this complexity tractable by summarizing or explaining a code base, but I think hard problems still stand in our way. First, there is strikingly limited evidence that AI can accurately explain why, or how it makes design decisions, and that lack of trust will be amplified when the output of AI cannot be verified by humans. Second, and more fundamentally, we lack a consistent theory to guide the development of reliable, stable, and compact abstractions that compose well across our systems. Without such abstractions, we lack a common language to describe the properties of a system in a repeatable way. Any effort to explain or summarize a moderately sized system without shared abstraction semantics will be lossy. You can test this idea for yourself. How much information can you throw out of your database schema before you lose critical semantics? Do you trust AI to determine what’s critical? I think we are going to find that that complexity resists compression, and that means efforts to simplify-by-summary will lead to critical failures.
However, let’s say we press on with this experiment and we find ourselves in a situation where we are reliant on AI to write code, and reliant on them to understand what code has been written. This is a world in which we will have a truly new relationship to our technology. At scale, it may become difficult-to-impossible for humans to understand how our technical systems work. You may say “our systems are already that large and no one understands all of it” and I would agree, but what will make AI-driven system development so much more challenging is that systems of enormous scale will appear almost overnight. Today, very large systems take months-to-years to evolve, and teams of humans are intimately involved in the construction of every line. Furthermore, we have developed 60 years worth of practice managing software teams so as to ensure on-going development and maintenance is possible. We aren’t always great at it, but at least we understand, in theory, how to manage technical systems using human ones. In an AI-driven context, even line-level programmers will have to get comfortable with a level of delegation, and therefore, isolation from the execution of their intent that is usually reserved for leaders of an organization of thousands of people.
Obfuscation != Abstraction
My second observation begins with a common argument I have heard in favor of using LLMs as programming tools. The argument goes something like this:
Using an LLM is just raising the abstraction level of the input language. The code we are used to writing will eventually be abstracted away because we will be able to talk to the machine conversationally and it will produce the system to our specification. This is not that different from the way high-level languages abstracted away assembly or machine code.
I think this gets the impact of LLMs as programming tools wrong. LLMs do not raise the level of abstraction, at least not in any conventional way. At best, LLMs obscure complexity, but that is not the same as building an abstraction.
Before I explain how this analysis gets it wrong, it’s worth asking why abstraction matters in the first place. Abstraction, in the computer science sense, is the practice of not just hiding complexity, but insulating the world from ever knowing it exists. Good abstractions, and this is certainly true of good programming languages, allow you to operate exclusively in terms of the provided interface. The value of a good abstraction is that it allows you to think about, and operate on, more sophisticated objects without ever having to think about how they work - including when they break! Abstractions are essential to large system development because they allow a human mind to develop a mental model of system behavior without having to simulate the execution of thousands of lines of code.
It’s worth noting that while abstraction is essential to software development, we (the industry) are not as good as we need to be at teaching people how to do it well, in part because I don’t think anyone has figured out how to do it reliably.
When one sets out to raise the level of abstraction in a system or a programming language, it usually involves creating a more expressive “language” (forgive the term overloading) which allows a user of the abstraction to more compactly and directly express intent. For example, using a high-level programming language like Ruby or Python allows a software engineer to express software concepts in a manner that is much more comfortable for humans accustomed to using natural language. Specifically, it allows us to use subject, object, verb constructions that are a very human way of describing how two “things” relate to each other, for example: authService.login(user) is clear enough that even a non-programmer can explain the action being performed. The same program written in assembly language, which operates at a much lower-abstraction level, not only requires many more instructions to achieve the same goal, but the instructions are far more limited, and it is much more tedious to express and understand the higher level relationships being described.
Modern programming languages strike a really useful balance between precision and expressiveness that makes it possible to describe sophisticated intent succinctly, and in a human readable way. I don’t believe programming languages are the only, or even best way to express technical intent, but we have a lot of practice using them, and whatever replaces them should be better in most dimensions for us to consider adoption.
To that end, I’ve heard the argument made that with a pure natural language interface we have achieved the ultimate high-level language, and that in so doing, the only limitation to what we can create is what we can describe in a chat window. When watching a demo of Cursor or Windsurf, that certainly seems true, but I think that argument overlooks a real challenge. The relationship between a high-level programming language and assembly code is a carefully constructed one. High-level programming languages are designed to be systematically transformed into a target representation in the lower level language. The transformation between high-level and low-level code is a mostly deterministic mapping from a more complex expressive system to a less complex one. Prompt-driven programming, on the other hand, has almost none of these properties.
It’s all just a big misunderstanding
Developing software with an LLM takes relatively simple natural language prompts and turns them into working systems. However, creating a large software system entails the specification of thousands of requirements for design, configuration, and implementation. For example, which design patterns to use; what level of concurrency to use; how to configure API calls; how to set up a test environment; how to prioritize tradeoffs between privacy and safety; how to gracefully fail - all of these considerations, and so many more require choices to be made.
In fact, if you attempted to use natural language to document every design decision in a medium sized system, the resulting text would be far larger than the code used to express the same thing. That’s because natural language is not a concise higher-level abstraction for software development. Yes, a natural language like English is more complex and can express ideas that are far beyond the realm of any software system to implement, but it is exactly that breadth, and even more so the lack of precision, which make it so challenging to use English to describe the design and implementation of a software system.
Systems designed using only natural language will import all the unintended misunderstandings inherent in normal human communication while still operating on very deterministic hardware, and which we rely on to perform very deterministic tasks. Using a language with nearly unlimited descriptive capacity to create a precise specification for a machine with millions of custom parts seems like it is destined to be frustrating, if not dangerous. To see how this might go wrong, grab someone in your team and both of you try to describe the architecture of your most critical system using no diagrams, only natural language. How closely related are your descriptions? I won’t hold my breath because there are an infinite number of ways to describe, emphasize, or express the purpose of millions of lines of code.
Alas, building the system may not even be half the problem. Once it’s running you will have a much more difficult time proving it performs all the behaviors you want, and none that you don’t. Sadly, LLMs have yet to offer much improvement to the state-of-the-art in software testing. Most demos I have seen focus on rapidly generating unit tests, and while doing so will help ensure a class of issues is found, there is also a substantial risk of drowning a project in thousands of change detector tests. In addition, as we gain speed and confidence in the development of individual services, we will begin finding more issues in between services. Today we solve these kinds of problems using integration tests. However, creating and maintaining integration testing environments is a poorly understood problem at scale, involving truly hard problems like test data, load generation and more. Even our best examples of integration testing tend to be quite slow and flaky, not the kind of thing that lends itself to a fast/edit refresh loop. A breakthrough in integration testing techniques will be required if we want LLMs to take over a meaningful level of responsibility for larger system development.
LLMs, as deployed today, are not the right tool for large system construction, at least not on their own. We have some work to do on our own to build a bridge between the capabilities of LLMs and the incredible complexity we have created. Perhaps the availability of these new tools will force us to confront the rickety foundations we have built for ourselves.
How much does this matter?
Now for the hard question: Should we be worried about this? Does it actually matter if LLMs aren’t raising the level of abstraction by some academic definition? After all, a lot of the complexity we deal with is self imposed and toilsome to deal with. It would be really nice not to suffer from decision fatigue so often!
Perhaps unsurprisingly, I think that without some intervention there are real risks. Fundamentally, it seems bad that we may completely alter the methods of software production in a way that leaves intact all the complexity but does not improve our ability to describe or reason the outcomes because of a primary focus on throughput of code delivery.
If we would like to avoid creating a civilizational-scale house of cards in the form of billions of lines of AI generated code, I think there are a few ideas we should consider. I admit upfront that these ideas are probably not enough, and not yet as deeply considered as I’d like them to be, in part because we just haven’t had enough time with these new kinds of tools to know what will work. However, I do think I am beginning to understand the properties a good solution will have.
Get the complexity of our house in order
This is a bit of a longshot, but we need to consider as an industry, if the endless variety of solutions to slightly different problems are really creating more value, or just adding incremental noise to the cacophony of choice that awaits every software engineer trying to figure out how to build a new system. We should work collectively to try to raise the abstraction level of the frameworks, libraries, and tools we work with, rather than make N more versions of binary bundling tools. At the moment it seems unlikely that LLMs (representing ~internet average opinion) are any better at identifying the essential vs accidental complexity of our frameworks, libraries, and platforms and they would benefit from some more expert guidance in the form of a reduced optionality.
Guidance could come in the form of industry wide standardization efforts at every layer in the stack. Specifically, I think developing new standard protocols or APIs that formalize popular framework abstractions into ratified standards would support the creation of new higher order building blocks that don’t just obscure complexity but render it redundant. In my opinion, we have squeezed all the useful innovation out of a few areas of our stack, like frontend frameworks, adopting the best ideas as standards will allow us to build a level of reliability and conformance into LLM generated systems. Of course, this will also require adjusting the cultural incentives we have developed for celebrating the continuous reinvention of the wheel. If done well, new libraries and frameworks will be valued by their ability to remove complexity rather than multiply it. LLMs creating systems based on these new building blocks will have a set of primitives that leave less room for misinterpretation and errors.
Failing that (and I suspect we will), I predict organizations whose CTOs, senior engineers, and other technical leaders can maintain discipline over their technical choices will have an advantage. Any force that can act to constrain the output space of GenAI output will make managing the systems they generate more tractable. It won’t be enough just to limit technical choices like frameworks. Highly opinionated forms of governance, from API patterns to the minute details of formatting in a file will ensure a much more predictable, and therefore understandable output.
Develop a robust system design language
Another approach to consider is the development of techniques for system-level design - like a programming language but for higher-order system-level components. I know, I know, UML has been promising this for at least 30 years, but this may actually be the time!
I’ve seen a couple versions of this argument on the internet, and it feels fairly intuitive to me. LLMs, and GenAI generally, represent a new way to interface with a machine, so the idea of requiring a new method of input seems reasonable. I think this could be an interesting opportunity to rethink how we capture our ideas and intent. What becomes possible if we have higher-order communication mechanisms like diagrams, video, or audio? To take advantage of person-millennia of useful software practice, any such “language” would need to preserve the critical aspect of our current toolsets like merge-ability, diff-ability, static analysis, etc. That probably rules out the more exotic input formats for now, but it’s interesting to think about what an “audio design” even means.
Regardless of the medium, in order to allow the machines to do what they do best - work fast - we need a language that focuses on outcomes rather than mechanical transformations. It should allow a developer to intuitively describe the architecture, properties, invariants and other design concepts that ensure the quality attributes we want are built in by design and require less direct human oversight. You can see glimmers of this approach in Cursor’s rule files, which work as a stacked set of prompts that are passed to every invocation of the underlying LLM in order to constrain its output. At present they are still pretty crude, but the pattern of developing a constraint language is clearly emerging.
While it would be nice if the LLMs were reliable enough to produce output consistent for a given input, I suspect we will also need to work out exactly how to test and validate that an LLM generated system meets the definitional spec and does so without any extraneous parts. That implies we will need a testing strategy including new techniques and frameworks to verify that the generated system meets the input specification.
In the early days of LLM excitement I had many conversations about shifting the focus of engineers from writing feature code to writing unit and integration tests, as it seemed the most natural path to developing a language based on constraints. Unfortunately, in practice, many software engineers have been burnt out on testing and the idea of it becoming the sole focus of the job made many I spoke with uneasy. Whether that unease will hold over time, I’m not sure, but it does suggest that a shift from “create” to “validate” represents a meaningful change in the nature of the work and that will certainly have consequences on who wants to do the work and how we train new practitioners.
New tools to support integrated simulation, testing, and observability
In a future where system internals become more opaque we should consider mechanisms other than direct code inspection that allow us to regain intellectual control. One possible approach would be to develop runtimes and frameworks which are built for deep simulation. A kind of simulation that would allow you to run one or more micro-service sized binaries on a test bench that simulates every aspect of IO, performance, data, dependencies, everything necessary to convince each service that they are running in production. The ability to put a complete large system, including all dependencies and simulated data, into a production-like hermetic environment where aspects of performance, security, capacity can be tweaked, replayed, and observed will give us the power to evaluate the complete behavior of a system well before it is deployed.
Imagine being able to rapidly assess the performance of your service endpoints in the presence of a 30% network bandwidth loss, or fuzz 100% of your service endpoints automatically and continuously. How would that change the way you work?
Such an environment may also be well suited to channeling the strength of LLMs to propose exhaustive lists of interesting questions to ask. I can imagine co-developing simulation-based tests to run against a system to validate expected properties and interactively making adjustments to the code to achieve my desired outcome.
Today, this level of control is only possible in highly specialized environments or trivially small systems. Pushing it further will require rethinking a lot about the way we design our foundational runtimes. However, if it was successful, we would have a powerful tool to help us regain an understanding of system behavior in a way that wouldn’t require us to “trust” an LLM.
The Future is Uncomfortably Exciting?
At least once a week I find myself reflecting on the idea that has barely been 60 years since programming computers became a job, and yet in that time so much has happened. Now, we are marveling at how they are getting closer to programming themselves. I’m also reminded that the success of digital technology also means it sits in more corners of our lives than ever before, and the level of responsibility for getting it right has never been higher.
There is no question the world of software engineering has changed forever. And, as with any such change, only hindsight will be suitable to assess just how significant the impact will be. Yet, it’s important to remember that change is amoral. We, who comprise the industry will give this change meaning for ourselves and our society. It is regrettable that we don’t have more time to develop a better set of theories to guide us. Change rarely happens when you are prepared, and this is no different. What happens next will be up to us as we explore what’s possible, share what we’ve learned, propose new ways of operating, and work to define the future. I hope that we are able to articulate a positive vision of creation, with less complexity, and more confidence in our work. In the race to apply GenAI to the development of software I hope we don’t sacrifice the essential human role of understanding and responsibility for the world we create for ourselves.
Special thanks to Tom Manshreck, Titus Winters, Jonathan Schuster, and George Fairbanks for slogging through the initial draft and helping me refine my thinking.
I don't do too much social media, but you can find me on LinkedIn.