Thursday, April 18, 2024

Optimizations

“Premature optimization is the root of all evil” -- Donald Knuth

Code generally implements a series of steps for the computer to follow. I am using a slightly broader definition than just an ‘algorithm’ or ‘heuristic’, which are usually defined as mappings between input and output. It is widened to include any sort of code that interacts with one or more endpoints.

We’ll talk about three general possible versions of this code. The first does the steps in an obvious way. The second adds unnecessary extra steps as well. And the third does the steps in a non-intuitive way that is faster. We can call these normal, excessive, and optimized.

Most times when you see people “optimize code” they are actually just taking excessive code and replacing it with normal code. That is, they are not optimizing it, really they just aren’t doing the useless work anymore.

If you take excessive code and fix it, you are not doing premature optimization, you’re just coding it properly. The excessive version was a mistake. It was wasting resources, which is unnecessary. Not doing that anymore is not really optimizing stuff.

If you have good coding habits, for the most part, you will write normal code most of the time. But it takes a lot of practice to master. And it comes from changing how you see the code and how you construct it.

Sometimes normal code is not fast enough. You will need to optimize it. Most serious optimizations tend to be limited to logarithmic gains. That is, you start with O(n^2) and bring it down to O(n). Sorting, for example, starts with O(n!) and gets it to O(n log n). All of these types of optimizations involve visualizing the code from a very non-intuitive viewpoint and using that view to leverage some information that circumvents the normal, intuitive route. These are the hardcode optimizations. The ones that we are warned not to try right away.

It is easy while trying to optimize code to break it instead. It is also easy to make it a whole lot slower. Some optimizations like adding in caching seem to be deceptively easy, but doing it incorrectly causes all sorts of unwelcomed bugs.

Making tradeoffs like space-time are sort of optimizations. They may appear to alter the performance but it can be misleading. My favorite example is matching unique elements in sets. The obvious way to code it is with two for loops. You take each member of the first set and compare it to the second one. But you can swap time for space. In that case, you pass through the first set and hash it, then pass through the second set and see if it is in the hash. If the respective sizes are m and n, the obvious algorithm is O(n*m) where the hashed version is O(n+m). For small data, the extra hash table shifts the operation from being multiplicative to additive. But if you scaled that up to large enough data, the management of the hash table and memory could eliminate most of those gains. It’s also worth noting that it is bounded as logarithmic, you see that by setting m to be another n.

The real takeaway though is to learn to code just the instructions that are necessary to complete the work. You so often see code that is doing all sorts of unnecessary stuff, mostly because the author does not know how to structure it better or understand what happens underneath. You also see code that does and undoes various fiddling over and over again as the data moves through the system. Deciding on a canonical representation and diligently sticking to that can avoid a lot of that waste.

Debloating code is not optimizing it. Sure it makes it run faster and with fewer resources, but it is simply removing what should have not been there in the first place. We need to teach coding in a better way so that programmers learn how to write stuff correctly the first time. Premature optimizations, though, are still the root of all evil. You need to get your code working first before you start messing with logarithmic reductions. They can be a bit mind-bending at times.

Thursday, April 11, 2024

Scope

One of the keys to getting good quality out of software development is to control the scope of each line of code carefully.

This connection isn’t particularly intuitive, but it is strong and useful.

We can loosely define the scope of any piece of code as the percentage of other lines of code in the system that ‘might’ be affected by a change to it.

In the simplest case, if you comment out the initialization of the connection to a database, all other lines of code that do things with that database will no longer work correctly. They will error out. So, the scope of the initialization is that large chunk of code that relies on or messes with the data in the database and any code that depends on that code. For most systems this a huge amount of code.

Way back, in the very early days, people realized that global variables were bad. Once you declare a variable as global, any other line of code can access it, so the scope is effectively 100%. If you are debugging, and the global variable changes unexpectedly, you have to go through every other line of code that possibly changed it at the wrong time, to fully assess and understand the bug. In a sizable program that would be a crazy amount of time. So, we came to the conclusion long ago that globals, while convenient, were also really bad. And that it is a pure scope issue. We also figured out that it was true for flow-of-control, like goto statements. As it is true for function calls too, we can pretty assume it is true in one way or another for all code and data in the system.

Lots of paradigms center around reducing the scope in the code. You encapsulate variables in Object Oriented, you make them immutable in Functional Programming. These are both ways of tightening down the scope. All the modifiers like public and private do that too. Some mechanisms to include code from other files do that. Any sort of package name, or module name. Things like interfaces are also trying to put forth restrictions on what can be called when. The most significant scope reduction is strongly typed languages, as they will not let you do the wrong thing on the wrong data type at the wrong time.

So, we’ve known for a long time that reducing the scope of as much code as much as you can is very important, but why?

Oddly it has nothing to do with the initial coding. Reducing scope while coding makes coding more complicated. You have to think carefully about the reduction and remember a lot of other little related details. It will slow down the coding. It is a pain. It is friction. But doing it properly is always worth it.

The reason we want to do this is debugging and bug fixes.

If you have spent the time to tighten down the scope, and there is a bug in and around that line of code, then when you change it, you can figure out exactly what effect the change will have on the other lines of code.

Going back to the global example, if the variable is local and scoped tightly to a loop, then the only code that can be affected by a change is within the loop itself. It may change the final results of the loop computations, but if you are fixing it, that is probably desirable.

If inside of the loop you referenced a global, in a multi-threaded environment, you will never really know what your change did, what other side effects happened, and whether or not you have really fixed the bug or just got lost while trying to fix it. The bug could be what you see on the code or it could be elsewhere, the behavior is not deterministic. Unlimited scope is a bad thing.

A well-scoped program means that you can be very sure of the impact that any code change you make is going to have. Certainty is a huge plus while coding, particularly in a high-stress environment.

There is a bug, it needs to be fixed correctly right away, making a bunch of failed attempts to fix it will only diminish the trust people around you have in your abilities to get it all working. Lack of trust tends to both make the environment more stressful and also force people to discount what you are saying. It is pretty awful.

There were various movements in the past that said if you did “X” you would no longer get any bugs. I won’t go into specifics, but any technique to help reduce bugs is good, but no technique will ever get rid of all bugs. It is impossible. They will always occur, we are human after all, and we will always have to deal with them.

Testing part of a big program is not the same as fully testing the entire program, and fully testing an entire program is always so much work that it is extremely rare that we even attempt to do it. In an ancient post, I said that testing was like playing a game of battleship with a limited set of pegs, if you use them wisely, more of the bugs will be gone, but some will always remain.

This means that for every system, with all its lines of code, there will come a day when there is at least one serious bug that escaped and is now causing big problems. Always.

When you tighten the scope, while you have spent longer in coding, you will get absolutely massive reductions in the impacts of these bugs coming to light. The bug will pop up, you will be able to look at your readable code and get an idea of why it occurred, then formulate a change to it for which you absolutely are certain of the total impact of that change. You make the change, push it out, and everything goes according to plan.

But that is if and only if you tightened the scope properly. If you didn’t then any sort of change you make is entirely relying on blind luck, which as you will find, tends to fail just when you need it the most.

Cutting down on the chaos of bug fixing has a longer-term effect. If some bugs made it to production, and the handling of them was a mess, then it eats away at any time needed to continue development. This forces the programmers to take shortcuts, and these shortcuts tend to go bad and cause more bugs.

Before you know it, the code is a huge scrambled mess, everybody is angry and the bugs just keep coming, only faster now. It is getting caught in this cycle that will pull the quality down into the mud like hyper-gravity. Each slip-up in handling the issues eat more and more time and causes more stress, which fuels more shortcuts, and suddenly you are caught up in this with no easy way out.

It’s why coming out of the gate really fast with coding generally fails as a strategy for building stuff. You're trying to pound out as much code as quickly as you can, but you are ignoring issues like scope and readability to get faster. That seems to work initially, but once the code goes into QA or actual usage, the whole thing blows up rather badly in your face, and the hasty quality of the initial code leads to it degenerating further into an iky ball of mud.

The alternative is to come out really slowly. Put a lot of effort into readability and scope on the lowest most fundamental parts of the system. Wire it really tightly. Everyone will be nervous that the project is not proceeding fast enough, but you need to ignore that. If the foundations are really good, and you’ve been careful with the coding, then as you get higher you can get a bit sloppier. Those upper-level bugs tend to have less intrinsic scope.

Having lots of code will never make a project better. Having really good code will. Getting to really good code is slow and boring, but it will mitigate a great deal of the ugliness that would have come later, so it is always worth it.

Learn to control the scope and spend time to make that a habit. Resiste the panic, and just make sure that the things you coded do what they are supposed to do in any and all circumstances. If you want to save more time, do a lot of reuse, as much as you can get in. And don’t forget to keep the whole thing really readable, otherwise it is just an obfuscated mess.

Thursday, April 4, 2024

Expression

The idea is to express the instructions to the computer that you’ve crafted in a succinct but entirely verifiable way.

If the expression is huge, the size itself will cripple your ability to verify that the instructions are correct.

If the expression is shrunk with cryptic syntax, maybe when you write it you will remember how it works, but as time goes by that knowledge fades and it will cripple your ability to verify that it is correct.

If the expression is fragmented all over the place, the lack of locality will cripple your ability to verify that it is correct.

Spaghetti code or scrambled structure is the same. Same with globals, bad names, poor formatting, etc. You can’t just look at it and mostly know that it will do what it needs to do. Obviously, this type of visual verification saves a huge amount of time debugging but it also tends to prevent a lot of mistakes in the first place.

Hiding the way things work is usually not a problem with a small amount of code. It's the small size that makes it absorbable, so you can verify it. But as the size grows, little mistakes have much larger consequences. A badly written medium-sized system is tricky to debug, but for large and huge systems it verges on impossible. Small mistakes in code organization can eat through big chunks of time. Splatter coding techniques may seem fun, but they are a guaranteed recipe for poor quality.

Getting the right degree of readability in code takes a careful balancing of all aspects of expression. Naming, logic, structure, and a lot of other issues. If you see good code, it isn’t always obvious how much work went into rearranging it to make it simple and clear, but it certainly wasn’t just chucked out in a few minutes. The authors spent quite a bit of effort on clarity and readability. They pay close attention to the details, in that way, it is not dissimilar to writing big articles or books. Careful editing is very important.

The quality of code is closely related to the diligence and care applied by the author. You think clearly about how to really solve the problem, then your code as cleanly as you can for each of the moving parts, and then you relentlessly edit it over and over again until it is ready to go. That is the recipe for good code.

Thursday, March 28, 2024

Over Complicated

I’ve seen many, many variations of programmers reacting to what they believe is over-complexity.

A common one is if they are working on a massive system, with a tonne of rules and administration. They feel like a little cog. They can’t do what they want, the way they want to do it. If they went rogue their work would harm the others.

Having a little place in a large project isn’t always fun. So people rail about complexity, but they mean the whole overall complexity of the project, not just specific parts of the code. That is the standards, conventions, and processes are complex. Sometimes they single out little pieces, but usually really it's the whole thing that is bugging them.

The key problem here isn’t complexity. It is that a lot of people working together need serious coordination. If it's a single-person project or even a team of three, then sure the standards can be dynamic. And inconsistencies, while annoying, aren’t often fatal in small codebases. But when it’s hundreds of people who all have to be in sync, that takes effort. Complexity. It’s overhead, but absolutely necessary. Even a small deviation from the right path costs a lot of time and money. Coding for one-person throw-away projects is way different than coding for huge multi-team efforts. It’s a rather wide spectrum.

I’ve also seen programmers upset by layering. When some programmers read code, they really want to see everything all the way down to the lowest level. They find that reading code that has lots of underlying function calls annoys them, I guess because they feel they have to read all of those functions first. The irony is that most code interacts with frameworks or calls lots of libraries, so it is all heavily layered these days one way or the other.

Good layering picks primitives and self-descriptive names so that you don’t have to look underneath. That it is hiding code, i.e. encapsulating complexity, is actually its strength. When you read higher-level code, you can just trust that the functions do what they say they do. If they are used all over the system, then the reuse means they are even more reliable.

But still, you’ll have a pretty nicely layered piece of work and there will always be somebody that complains that it is too complicated. Too many functions; too many layers. They want to mix everything together into a giant, mostly unreadable, mega-function that is optimized for single-stepping with a debugger. Write once, read never. Then they might code super fast but only because they keep writing the same code over and over again. Not really mastery, just speed.

I’ve seen a lot of programmers choke on the enormous complexity of the problem domain itself. I guess they are intimidated enough by learning all of the technical parts, that they really don’t want to understand how the system itself is being used as a solution in the domain. This leads to a noticeable lack of empathy for the users and stuff that is awkward. The features are there, but essentially unusable.

Sometimes they ignore reality and completely drop it out of the underlying data model. Then they throw patches everywhere on top to fake it. Sometimes they ignore the state of the art and craft crude algorithms that don’t work very well. There are lots of variations on this.

The complexity that they are upset about is the problem domain itself. It is what it is, and often for any sort of domain if you look inside of it there are all sorts of crazy historical and counter-intuitive hiccups. It is messy. But it is also reality, and any solution that doesn’t accept that will likely create more problems than it fixes. Overly simple solutions are often worse than no solution.

You sometimes see application programmers reacting to systems programming like this too. They don’t want to refactor their code to put in an appropriate write-through cache for example, instead, they just fill up a local hash table (map, dictionary) with a lot of junk and hope for the best. Coordination, locking, and any sort of synchronization is glossed over as it is just too slow or hard to understand. The very worst case is when their stuff mostly works, except for the occasional Heisenbug that never, ever gets fixed. Integrity isn’t a well-understood concept either. Sometimes the system crashes nicely, but sometimes it gets corrupted. Opps.

Pretty much any time a programmer doesn’t want to investigate or dig deeper, the reason they give is over-complexity. It’s the one-size-fits-all answer for everything, including burnout.

Sometimes over-complexity is real. Horrifically scrambled spaghetti code written by someone who was completely lost, or crazy obfuscated names written by someone who just didn’t care. A scrambled heavy architecture that goes way too far. But sometimes, the problem is that the code is far too simple to solve stuff correctly and it is just spinning off grief all over the place; it needs to get replaced with something that is actually more complicated but that better matches the real complexity of the problems.

You can usually tell the difference. If a programmer says something is over-complicated, but cannot list out any specifics about why, then it is probably a feeling, not an observation. If they understand why it is too complex, then they also understand how to remove that complexity. You would see it tangled there caught between the other necessary stuff. So, they would be able to fix the issue and have a precise sense of the time difference between refactoring and rewriting. If they don’t have that clarity then it is just a feeling that things might be made simpler, which is often incorrect. On the outside everything seems simpler than on the inside. The complexity we have trouble wrangling is always that inside complexity.

Thursday, March 21, 2024

Mangled Complexity

There is something hard to do.

Some of the people involved are having trouble wrapping their heads around the problem.

They get some parts of their understanding wrong. In small, subtle ways, but still wrong.

Then they base the solution on their understanding.

Their misunderstanding causes a clump of complexity. It is not accidental, they deliberately choose to solve the problem in a specific way. It is not really artificial, as the solution itself isn’t piling on complexity, instead it comes from a misunderstanding of the problem space, thus in a way the problem itself.

This is mangled complexity. The misunderstanding causes a hiccup, and some of the complexity on top is mangled.

Mangled complexity is extraordinarily hard to get rid of. It is usually tied to a person, their agenda, and the way they are going about performing their role. Often one person gets it wrong, then ropes in a lot of others who share the same mistake, so it starts to become institutionalized. Everybody insists that the mistake is correct, and everybody is incentivized to continue to insist that the mistake is correct.

Sometimes even when you can finally dispel the mistake, people don’t want to fix the issue as they fear it is too much effort. So, it gets locked into the bottom of all sorts of other issues.

We are building a house of cards when we choose to ignore things we find are wrong. A delay caused by unmangling complexity is a massive amount of time saved.