Saturday, December 30, 2017

Immutability

All too often software development practices are driven by fads.

What starts out as a good idea gets watered down to achieve popularity and then is touted as a cure-all or silver bullet. It may initially sound appealing, but eventually, enough evidence accrues that it is, as expected, far from perfect. At that point the negatives can overwhelm the positives, resulting in a predictable backlash, so everyone rushes off to the next big thing. We’ve been through this cycle many, many times over the last few decades.

That is unfortunate, in that technologies, practices, idioms, ideas, etc. are always a mixture of both good and bad. Rather than continually searching for the next, new, easy, thoughtless, perfect approach we really should just accept that both sides of the coin exist, while really focusing on trying to enhance our work. That is, no approach is completely good or completely bad, just applicable or not to a given situation. New isn’t any better than old. Old isn’t intrinsically worse. Technologies and ideas should be judged on their own merits, not by what is considered popular.

It’s not about right or wrong, it is about finding a delicate balance that improves the underlying quality at a reasonable cost. Because in the end, it's what we’ve built that really matters, not the technologies or process we used.

Sometimes, a technique like immutability is really strong and will greatly enhance the code, making it easier and more likely to be correct. But it needs to be used in moderation. It will fix some problems, but it can easily make others far worse.

Recently, I’ve run into a bunch of situations where immutability was abused, but still, the coders were insisting that it was helping. Once people are committed, it isn’t unusual for them to turn a blind eye to the flaws; to stick to a bad path, even in the face of contradictory evidence.

The idea of immutability became popular from the family of functional programming languages. There are good reasons for this. In general, spaghetti code and globals cause unknown and unintentional side-effects that are brutal to track down. Bug fixing becomes long, arduous and dangerous. The core improvement is that if there are no side-effects if the scope is tightly controlled, it is far easier to reason about the behavior of subparts of the code. Thus, we can predict the impact of any change.

That means that the underlying code then can no longer be allowed to modify any of its incoming data. It's not just about global variables, it's also any internal data and the inputs to functions. To enforce immutability, once some data has been created, anywhere, it becomes untouchable. It can’t be changed. If the data is wrong later, then there should be only one place in the system where it was constructed, thus only one place in the system where it should be fixed. If a whole lot of data has that attribute, then if it is buggy, you can reason quickly about how to repair it with a great deal of confidence that it won’t cascade into other bugs.

While that, for specific types of code, is really great and really strong, it is not, as some believe, a general attribute. The most obvious flaw is that there could be dozens of scattered locations were the immutable data is possibly created, thus diminishing its scope benefits. You might still have to spend a lot of time tracking down where a specific instance was created incorrectly. You’ve protected the downstream effects but still have an upstream problem.

So, immutability does not magically infer better programmatic qualities. But it can be far worse. There are rather obvious places in a system where it makes the code slower, more resource intensive, more complex and way less readable. There is a much darker side to immutability as well.

Performance is the obvious first issue.

It was quite notable in Java that its immutable strings were severely hurting its performance. Code often has to do considerable string fiddling and quite often this is distributed in many locations throughout the program as the final string gets built up by many different computations.

Minimizing this fiddling is a great way to get a micro-boost in speed, but no matter how clever, all of that fiddling never really goes away. Quite obviously, if strings are immutable, then for each fiddle you temporarily double the string, which is both memory usage and CPU time spent copying and cleaning up.

Thus pretty quickly the Java language descended into having special string buffers which were mutable and later an optimization to hide them when compiling. The takeaway from this is that setting things that need to change ‘frequently’ to be immutable is extraordinarily costly. For data that doesn’t change often, it is fine. For data that changes a lot, it is a bad idea if there are any time or resource constraints.

Performance isn’t nearly as critical as it used to be, but it still matters enough that resources shouldn’t just be pissed away. Most systems, most of the time, could benefit from better performance.

In a similar manner to the Java strings, there was a project that decided that all of the incoming read-only data it was collecting should be stored immutability in the database. It was not events, but just fairly simple static data that was changeable externally.

The belief was that this should provide both safety and a ‘history’ of any incoming changes, so basically they believed it was a means of making the ETL easier (only using inserts) and getting better auditing (keeping every change).

In general, strong decompositions of an underlying problem make them easier and safer to reason about the behavior and tend towards fewer side-effects, but intermixing two distinctly different problems causes unintentional complexity and makes it harder to work through the corner-cases. Decomposition is about separating the pieces, not combining them. We’ve known that for a long time, in that a tangle of badly nested if statements are usually described as spaghetti code. It is hard to reason about and expensive to fix. They need to be separated to be usable.

So it’s not surprising then that this persistent immutable collection of data required overly complex gymnastics in the query language just to be read. Each select from the database needs to pick through a much larger set of different but essentially redundant records, each time the data is accessed. The extra storage is huge (since there are extra management costs like secondary indices), the performance is really awful and the semantics of querying the data are convoluted and hopelessly unreadable. They could have added another stage to copy that data into a second schema, but besides being resource intensive and duplicating a lot of data, it would also add in a lag time to the processing. Hardly seems better.

If the data was nicely arranged in at least a 3rd NF schema and the data flow was idempotent (correctly using inserts or updates as needed) then simple logging or an audit log would have easily solved that secondary tracking problem with the extra benefit that it might be finite for a far shorter timeframe then the rest of the data. That easily meets all of the constraints and is a far more manageable approach to solving both problems. Immutability, in this case, is going against the grain of far older and more elegant means of dealing with the requirements.

Since immutability is often sold as a general safety mechanism, it is not uncommon for people to leverage that into misunderstandings as well. So, for instance, there was one example of a complex immutable data structure but it spanned two completely different contexts. Memory and persistence. The programmers assumed that the benefits of immutability in one context automatically extended to the other.

A small subset of the immutable structure was in memory with the full version persisting. This is intrinsically safe. You can build up new structures on top of this in memory and then persist later when completed.

The flaw is to assume however that any and all ‘operations’ on this arrangement would preserve that safety just because the data is immutable. If one copies the immutable structure in memory but does not mirror that copy within the other context then just because the new structure is immutable doesn’t mean it will stay in sync and not get corrupted. There aren’t two things to synchronize, now there are three and that is the problem.

Even if the memory copy is not touched, the copy operation introduces a disconnect between the three contexts that amounts to the data equivalent of a race condition. That is the copies are not independent of the original or the persisted version; mutable or immutable that ‘dependence’ dominates every other attribute. So, a delete on the persistent side would not be automatically reflected in the copy. A change on the original would not be reflected either. It is very easy to throw the three different instances out of whack, the individual immutability of each is irrelevant. To maintain its safety, the immutability must span all instances of the data at all structural levels, but that is far too resource intensive to be a viable option. Thus partial immutability causes far more problems and is harder to reason about than just making the whole construct mutable.

Immutability does, however, have some very strong and important usages. Quite obviously for a functional program in the appropriate language, being able to reason correctly about its behavior is huge. To get any sort of programming quality, at bare minimum, you have to be able to quickly correct the code after its initially written and we’ve yet to find an automated means to ensure that code is bug free with respect to its usage in the real world (it can still be proven to be mathematically correct though, but that is a bit different).

In object-oriented languages, using a design pattern like flyweights, for a fixed and finite set of data can be a huge performance boost with a fairly small increase in complexity. Immutable objects also help with concurrency and locking. As well, for very strict programming problems, compile-time exceptions for immutable static data are great and runtime exceptions when touching immutable data help a lot in debugging.

Immutability has its place in our toolboxes, but it just needs to be understood that it is not the first approach you should choose and that it never comes for free. It’s an advanced approach that should be fully understood before utilizing it.

We really need to stop searching for silver bullets. Programming requires knowledge and thought, so any promise of what are essentially thoughtless approaches to quickly encode an answer to a sub-problem is almost always dangerous.

In general, if a programmer doesn’t have long running experience or extensively deep knowledge then they should err on the side of caution (which will obviously be less efficient) with a focus on consistency.

Half-understanding a deep and complex approach will always result in a mess. You need to seek out deep knowledge and/or experience first; decades have passed so trying to reinvent that experience on your own in a short time frame is futile; trying to absorb a casual blog post on a popular trick is also futile. The best thing to do is to ensure that the non-routine parts of systems are staffed with experience and that programmers wishing to push the envelope of their skills seek out knowledge and mentors first, not just flail at it hoping to rediscover what is already known.

Routine programming itself is fairly easy and straightforward, but not if you approach it as rocket science. It is really easy to make it harder than it needs to be.

No comments:

Post a Comment

Thanks for the Feedback!