Sunday, October 12, 2008

The Structure of Elegance

Computer programming, for many coders is essentially creating a series of instructions that need to be executed. The order of execution is unimportant, so long as all of the instructions have been completed. This instruction-oriented perspective generally implies that the breakup of the actual instructions themselves into methods or functions is more or less arbitrary. So long as enough instructions are executed, these programmers don't seem to mind the structure or order. It is a common way of seeing code, but it has its own inherent problems.

Over the years, I've found that the more randomized the code base, the harder it is to make it work properly. Random and brute force code both have the annoying attribute that you cannot easily tell visually whether or not the code is correct, or even close to correct. On the other hand, a well-balanced, well-structured program not only looks cleaner, but if all of the pieces are in the right place, the imperfections are obvious. Bad code stands out.

Yes, you can see the bugs caused by the inconsistencies in construction. They become obvious blights on an otherwise clean canvas. You should be able to read the code and have some idea of what it is doing, and whether or not it will work correctly.

Clearly, being able to visually detect inconsistencies in code is a highly critical aspect in achieving high quality. Testing is hit or miss, with never enough resources, getting it right at a lower lever is far more effective.

Since we're frequently digging into the code, if it is obvious that there are problems that need to be corrected, it is easier to correct the problems as they are found, rather than allow them to build up into bigger issues.

If code is just some mysterious mess until it's running in a debugger, then new code is tossed in haphazardly, causing a toxic buildup. Relying on a debugger is a dangerous practice because you're only walking through one specific path at a time, this makes it easy and likely that the corner cases will have a significant number of bugs.

While paradigms such as object-oriented were intended to discourage programmers from creating spaghetti code, they can't actually stop it from happening. Logic spread randomly through a messy series of objects, even if they have plausible real-world sounding names, is no better than a random series of functions and procedures. There must be structure to the code, or else the code is a mess.

Nothing in code should ever be arbitrary or random. Ever.

The way to avoid these types of problems is by effectively normalizing the code. Relational database theory has a similar concept, whereby a set of rules is applied to a schema in increasing order to make it more and more orderly. The most orderly version, known as 4th normal form (or possibly 5th, I always get that confused), is considered to be the correct one to used in most general circumstances. Certainly, even if the schema has been denormalized for performance, most good data architectures are still well aware of the equivalent fourth normal form version of their database. They know what it is, before they choose to violate it.

The process of normalization applies an increasingly strict rule-base on an existing structure, to force it into some generalized simplifications. You can't arbitrary simplify everything:

http://theprogrammersparadox.blogspot.com/2007/12/nature-of-simple.html

but these rules of schema normalization take into account the necessary variables to bring down the schema's redundancy and overall complexity. There are no doubt trade-offs made, but they are fairer than just leaving it up to chance.

Code too, can be modified with a simple set of rules until it is in a cleaner more normalized version. Simplifications, particularly when there are subjective elements are never straight-forward, but within reason the purpose of applying rules to a code base is to amplify the readability without incurring a considerable expense to the performance. This brings down the major variables into something more tangible.

This isn't a particularly new idea, refactoring has been around for ages, but I often think that many people aren't applying it effectively because they have no idea what it should become. It's a series of small transformations based on localized issues, but that still leaves it all rather arbitrary. When, where and why at a global level do these things help the code, and when are they actually making it worse?


FROM EXPERIENCE

For this posting, which is going to be very long, I want to go through my own perspective on code. In particular, on how I see it's internal structure and why I think it can be normalized. It's a long, and often painful argument, but without people understanding the foundations I have no really good way of just boiling this down into nice little bits of advice. Sorry.

In one of my very early development experiences I was lucky enough to work with a medium sized code base that had been heavily edited by a lot of very determined programmers. The results were as close to elegant as I've ever been able to see in any real production code.

Everything had its place, and there was a place for everything; it all fit nicely, it was obvious and everything was right were you would expect it. If you closed your eyes and guessed where a specific set of instructions would have been placed, you'd find that they were almost always exactly were they should be. It was in its own technical way incredibly beautiful.

Fixing and extending a good normalized code base is a pleasant task, hacking away at some pile of muck is not. There are less bugs, changes are obvious and extending the original code is actually fun. Because I had a good experience early on, I was always really sensitive to the difference between a disorganized mess and something more elegant. And, more importantly I was always aware of just how little can often differentiate the two.

The biggest problem has always been trying to explain this to other programmers. I can go on forever about attributes and properties of code, but to most people that just doesn't stick. My "obsession" with arrangement to some, seems counter-productive, but only until they've learned for themselves that in getting it right, we don't have to needless keep pounding on the same code hoping that it will work better each time. Good code is easy to make work, poor code is not. Good code is always less work.

To some, the concept of elegance may not make sense, but to someone who's seen it, it is crazy not to build things this way. I'm not into putting in any extra work into my projects, I've just learned that keeping the code clean, simple, consistent and elegant is actually the least amount of effort. If we keep up with discipline, the workload decreases. It's that real understanding of how much time gets wasted with sloppy code and quick stupid changes that drives me forward, nothing else. The shortest path to a good long term product is via elegance. I know this to be true from multiple different working experiences.

To get the full sense of normalization, you need to understand the context, so I'll start off this discussion with a few abstract perspectives on code. Weird, yes, but entirely necessary later when understanding my (poor) attempts at normalization rules.


DECOMPOSITION

Software is a long sequence of instructions assembled for computer hardware to execute. It has a beginning, and at some point* it has an end. Over any given instance of the lifespan of a piece of software, the instructions executed are a finite list. The list may change from run to run, but it still is fixed.

* everything ends, but more specifically best practices for computer operations should involve rebooting the machine on a fixed schedule. Theoretical models of computing like finite state or Turing machines have infinite paper, but that just complicates matters unnecessarily in this case.

You can see the instructions from the hardware perspective, say as micro-code or assembler that is is functioning, but it is just as easy and convenient to see them in terms of a higher-level language, one that supports the modern notions of functions, scope, conditionals and looping constructs. Mostly for this discussion any the of the functional, or procedural languages will do (they do embed specific paradigms into their mechanics, but not enough to change their underlying nature).

In the crudest sense, if we wanted to create a new software program, we could just create a program with each and every instruction included, in its proper order. Yes, there would be a huge amount of redundancy, in fact most of the program would be redundant repetitive tasks happening over and over again.

For a program running with over an hour's worth of CPU time, there be would be a massively large number of instructions. It would be insane to attempt to sit down and type all of those instructions into a computer. Even with a totally impossible 100% accuracy, it would takes years and years to complete the work. Clearly that's impossible.

While it's one big set of instructions, most software interacts with some other control mechanism, be it a user, some hardware, or some other software. In that way we can partition the whole as just smaller sets of instructions triggered by individual entry-points into the list. Each subset performs some discrete piece of functionality and then returns back to one or more controlling loops.

In these many subsets of the instructions there are a huge number of repeating patterns of various sizes. Patterns that repeat quickly, over and over again, and longer running patterns that pay out through similar instructions for huge sections of the code. Patterns within patterns.

So we can really see software as a smaller set of lists mapped back to specific functional actions. Lists that are driven by functionality. This perspective helps to break down the big problem into smaller ones, but it's still not really that useful.


DIRECTED WITH NO CYCLES

The idea of having lots of these smaller lists does not make it easy to picture or build complex software. We need a better viewpoint for assembling the functionality.

The list-of-instructions view of software may be interesting from an conceptual point of view, but it really does not match how we build the code. To save time, energy and to make it less likely to have problems we have to take these huge lists and mentally break them down into a large number of sublists that we call names like functions, procedures or methods. The difference between the three is not important for this particular essay, so I'll refer to each smaller block of instructions as a just a function.

We continuously deconstruct the bigger lists into many many smaller ones, primarily to make the problem easier to handle. Once the functions are small enough, they become readily implementable.

A typical program consists of thousands and thousands of functions, broken down into collections based on underlying functionality and/or data. We group these functions together with various concepts likes libraries, packages, modules, etc.

Often at an even higher level, referred to as the architecture, we collect the libraries, packages, modules, etc. into larger parts, called things likes engines, subsystems, components, etc. In this way we start building up more complicated structural pieces from the little pieces that we have just torn off the main list. At each level it's just a specific term attached to a sub-list of some size.

Mostly we start by looking down on the problem, then decompose it into little pieces, and start building it up again. These layers of abstraction help us to encapsulate the massive complexity of the system, into a small finite number of discrete components that should all work together nicely. Software is total is too complex, so we must continually break it into pieces.


FUNCTIONAL PATHWAYS

Functions are a common visual representation for us. We work with them, but we've also become completely used to seeing them in other circumstances like stack traces. When an un-handled error occurs, most modern languages dump out a stack trace, a list of the the currently executing functions, at the time of the error.

This is a useful debugging device, but there is also more happening here. A stack trace is a specific pathway in the software. A collection of functions executed at a specific time, in a specific order. While we may see this as a time slice leading to an erroneous condition, the truth is that you could create a stack-trace for each and every instruction in the language. Why would you do this? If you took all of the possible stack traces, treating them as paths, you could assemble a much larger data-structure that shows the complete run-time linkages within your program. You'll get one big massive execution graph.

Nice, but it's still not leading anywhere. A graph is a rather hard structure to deal with. In it's purest definition it is just an unordered collection of vertices and edges. There's lots of theory and algorithms to deal with them, but life would easier if we continue to simplify.

We can flatten the expressibility somewhat by the realization that any cycles in the graph are caused by recursion. Function A calls function B which calls function A again. It is interesting to know where and when the design is recursive, but not a necessarily bit of knowledge for handling normalizations. Thus we can drop the recursions, by simply truncating any path at the first sign of a repetitive element.

This leaves us with a simpler structure, generally known as a dag, which stands for directed acyclic graph. What's nice about this structure, is that it pretty much looks like a tree where some of the children have been repeated in different locations. A tree where many different parents can point to the same underlying children. Thousands of functions point to utility functions like string append, for example. There is a lot of overlap.

Just to keep life mildly simpler, for the rest of this post I'll talk about the execution graph as a tree. When you see the word "tree", think dag, although I prefer the earlier term because it fits in a bit more with my concerns.

In this discussion I'm not readily concerned with recursion, or that fact that the same function pops up in multiple different places in the same tree. They may have some impact on a higher perspective, but that really shouldn't make a difference here. Because of that, we can just choose to see the whole thing as one big execution tree of functions.


THE IMPORTANCE OF TREES

Sometimes, if you get things framed with the right perspective, understanding comes more naturally. In this case, if we can see all software programs as just big trees of functions, we can make some every interesting statements about their arrangement and structure.

In most large programs in the first few levels of each tree, there is often some control looping construct such that the programmer has no influence over. Beyond that, at specific entry points in the system, a programmer can start attaching code in specific sub-trees. Simple programs might have a small number of entry-points, while complex ones might have hundreds.

Seeing a big complex program as a massive tree of functions is probably more detail than most people can handle, so we need to focus in on the details instead. We're not particularly interested in the whole of the program, as much as we are interested in specific sub-trees of the program, and often just within limited ranges (depths) for those trees. What we are most interested in is two things: the relative level of similar functions, and the sub-tree scope of any accessed data.

But we'll have to digress for the moment.


TYPES OF CODE

There are, as it were, only a small number of things that you are actually doing with your code.

Some code is basically a single long running algorithm that follows a particular set of logic to achieve a result, basically a specific connected series of instructions. In some cases, a large collection of algorithms has been stitched together conceptually in something like an engine, all co-operating with each other. More complex, but basically the same as a single algorithm.

Some code is just glue. We are tying together disassociated parts of the system, at either a very high level like a GUI interface, or a low level like an asynchronous callback. Another huge amount of glue in most systems is just taking an internal data model and allowing it to be persistent. Glue is really just a mapping between two orthogonal interfaces.

The final common type of code are those sets of reusable primitives intended to work over and over again. Common routines are here, but so are all of the explicit data handling that forms some internal model of the data that is accessed by other parts of the system. Not always, but the bulk of many large complex systems is the composition/parsing/traversal code that wraps the main data-types. We spend a lot of resources converting the persistent form into something more flexible, apply some simple type of operators and then repackaging it for long-term storage again.

Thus we have: algorithms, glue and primitives forming our most basic types of code.

Algorithms are easy to deal with, in that you really want to get the entire algorithm all into one big function. Splitting it over a lot of little functions, even if it matches some paradigm like object-oriented generally makes it significantly harder to debug. The biggest most important attribute of an algorithm is that it works. Usually it forms some anchor for the functionality, and it's often subject to permutations on input, making testing all the more critical.

A big function that handles the algorithm simplifies any of the issues, so its worth violating paradigms like object-oriented in order to maintain the oneness of the algorithm. Of course, the design of a full engine, particularly if it has lots of co-operating algorithms is considerably more difficult, as the programmer is forced to balance distributing the logic for cleanliness with making it more complex. Realistically, it often takes several attempts to find good trade-offs for complex engines, experience pushes the developers to accept having to do way more refactoring on that type of code, then is normal.

Glue code is just ugly by nature, and usually uglier in languages that don't make static initial declarations an easy process. Code that sits between any two arbitrary interfaces is inherently ugly by definition and there is little, other than comments to help. Glue is glue, and it is increasingly common in our code bases, the side-effect of having lots of underlying libraries to call. The best results are that the glue itself is encapsulated and not allowed to leak out across of the rest of the design. More about that later.

So mostly, the heart and soul of our systems are the models and primitive functions we build based around the fair amount of data that needs to be manipulated. We spend a great deal of effort in modern systems copying the data back and forth between a persistence representation and the runtime one. We generally build systems by implementing some internal model of the data we want to manipulate and then map it forwards and backwards to the other parts of the system. Forwards to the interfaces, and GUI. Backwards to the database and persistence.

For all of the complexities of modern software, there really isn't all that much happening under the hood. Sure there is a lot of copying the data around, combining it together and then parsing it again. Moving it from this block of functions, over to another one, and then back. Often there are tangles of if/else statements blocking out endless features, strange sub-loops, and scary error handling. And of course the GUI is inherently ugly, but so is the persistence handling, both parts of the system that quickly degenerate into silliness.

Early spaghetti code was such because it had no inherent structure. Concepts like abstract data-types (ADT) came along, giving us ways to create structure out of modeling the data in the system. We moved more of our code base into being nice well structured primitives. Object-orientation is just a language based implementation of that philosophy. In each case, the structure of the code is actually driven by the structure of the data. Sometimes it gets confused, and often it is not implemented that way, but that's the core of the ideas behind these paradigms.

This, I think is important to understand because it means that inherently the way we have been pushing ourselves to structure our code has always been indirectly driven by the actual structure of the data that we are choosing to manipulate. Granted, this often gets lost in modern dogma, but once we get back to understanding our execution trees and the scope of data within them, this data-oriented approach makes far more sense. Basing the system around the way data is transformed is a simpler perspective than basing it on the millions of steps needed to complete those transformations.


BALANCED TREES

Returning to our overall perspective, we can see every program as a series of entry-points into various sub-trees of functions. If we want the cleanest most simplified system then we can apply various rules at this level to move the instructions and/or functions around to achieve the cleanest, most balanced version of the sub-trees possible. The benefit of all this effort should be to reduce the system to a simple enough state that a larger degree of deficiencies become obvious visually-detectable coding problems.

Two of the key properties in the tree are balance and symmetry. Balance not only refers to width/height of the tree being optimized, it also implies that any two given sub-trees that are similar are in balance with each other, roughly the same height, width and depth and that the arguments to the different functions at the head of the sub-trees are nearly or exactly the same.

The first big property, balance, means that co-aligned primitives should sit at the same level together. All of the similar sub-trees always start together on the same level. All of the primitives are balanced if they form sub-trees of approx the same level and size. The level and depth of all similar functions should be in balance with each other.

For any instruction in a sub-tree, if there is a symmetrical instruction, it too should be in balance. For instance, an 'open' at a specific level should also have a 'close' at that level. The open/close pair should bound a block of code, visually, even if that means they exist by themselves.

This property of symmetry is important because it's absence is easily noticeable. It is a great way to spot code that is out of place. If all the functions have a starting instruction, and an ending one, then any function missing one or both is a problem. When we cannot use the computer to enforce this type of consistency, such as in aspect-oriented programming, we must do so visibly.

If the same underlying code is being used in multiple places at various different levels than that is an indicator of a problem. The underlying code and data should fit neatly into the puzzle. The more the structure is graph-like, the messier the architecture is. If all of the calls of a specific function are on the same tree level, then the use of that function is well-balanced.


MIXED PRIMITIVES AND OTHER FAUX PAS

One very common structural problem is to create a set of primitives from one interface paradigm and mix them with another set from other paradigm providing multiple redundant interfaces to the same underlying code. This common problem, that you'll often see in popular Java libraries for example, is caused by some assumption that more is better or that the library would be more beneficial if it was more flexible. Bad idea. Two overlapping primitives sets just expands out the complexity for no real benefit.

A complete primitive set forms a close loop, with just one non-overlapping operation per primitive. Simple examples are add/delete/modify or insert/update/delete, or even add/subtract/multiple/divide. What is crucial here is that all other operations can be expressed as a set of primitives, and that that total set spans all of the possible functionality. There is one and only way way to do everything with a balanced set of primitives, if there are two ways to accomplish the same goal, then one or more of the operators are overlapping.

It is far better to create two separate, clean implementations, one for each primitive set, then to mix the two together. It's just opening the door to potentially dangerous corner-case problems caused by badly mixing the calls. Why waste time working out all of the weird interactions, especially if they aren't necessary or shouldn't be used in that way. Why give programmers the means to write increasingly convoluted steps, just because they mis-understood how to work with each individual primitive set. It's the type of wasted effort that we should have learned to avoid by now.

Null handling is another common problem, although not necessarily that structural. Programmers overuse the nulls, but their purpose and point are very explicit. For instance there is no difference between an empty container and a null. Why distinguish with containers? Having to test if a container is null, and then again if it is empty is useless code. Just never allow null containers, and use the one and only condition as the indicator. Structurally empty containers overlap with nulls in virtually all usages. Nulls as an out-of-band signal for a condition are sometimes necessary, but not if their meaning is fake or artificial. To many programs are poor tangled webs of over-extended null handling.

Exception handling, another overused language feature, was intended to clean up specific low level handling code, and build a better highway for systems to pass up significant errors. Often, thought, programmers go beyond that low level, and high level usage, and start indiscriminately using it everywhere. Syntax paradigms like try/catch form secondary execution paths through the system. One nice execution path is visually verifiable, but overlap a lot of little, radically places and one quickly swamps the usefulness of the syntax.

Programming is often about restraint, self-discipline and reductionism. Exception handling is one case where it is wise to get rid of as many handler blocks as possible. For low-level external error handling, and high level handing, try/catch blocks are extremely helpful, but used anywhere else they should be eyed with suspicion.

All three of these issues are really just instances of programmers added an extra level of complexity in their instructions to over-compensate for the overall lack of structure. Wasted nulls and excessive try/catch blocks are very noticeable blights in elegant code, but just fit into the background noise in messy code.

Once the code has been balanced to some degree, it is far easier to see what can be easily deleted, because it is serving no real functional purpose.


DATA AND CODE

Looking at programs as sub-trees of functions allows us to give great consideration to the program's overall structure without getting too lost in the details. But the code by itself will not fully normalize a program into something elegant.

Programs are always composed of two distinct, and often conflicting things: code and data. The sub-trees lay a structural framework, but we also need to understand how the data access is distributed through-out the overall structure.

If we look at all of the data in the system, we can see that it is a relatively small discrete collection of data-structures, which are essentially containing all of the data-types in the system. That is, for any given system, the amount of data used in it is both limited and finite*. You could create a small fixed list of the major entities.

* even when most programmers support dynamic data representations they often do so in very static ways, defeating the full power of their dynamic code. It's a safe bet to ignore dynamic code, or at least to contain it all into a set of fixed 'dynamic' data-type (thus making it limited and finite).

This notion is extremely helpful because we can start looking at the scope of all of the data, in terms of the trees in the system. A well-balanced, normalized bit of code will encapsulate specific data structures within specific sub-trees. The data is hidden from any code outside of that tree. This is information hiding, and encapsulation (if the code is buried there too).

We want, at each functional level, for the concepts, information and ideas below that sub-tree to be a small consistent set. As we descend further down into the tree, we want a more enhanced scoping of the data. The data and code in a given fixed set of primitive string utilities for instance would, underneath it all, refer to just strings and specific manipulations. As the data sinks lower into the tree, the understanding of the data should be more and more general. Explicit parameters at a high level, are a hash table below that, and then just strings and keys below that. Thus the language we use in the code to describe variables, function names, parameters, etc all match the level and scope of the data. As the level gets deeper, the terminology gets more general.

For a normalized model, the collection of sub-trees that make up the interface all encapsulate the scope of the underlying data. Any data that gets beyond that scope "leaks" into other parts of the system, is global or is effectively global. And these problems with the data are more common than expected.


STATE AND ITS EFFECTS

A global variable is one that is accessible from any location in the entire tree. We've known for a long time that globals are considered dangerous, because they allow multiple access points in different parts of the system to quickly fall out of sync even with simple changes. A reckless change can be followed by a long and painful hunt for the culprit. Because of this, we actively try to avoid globals.

What we know is true and a big problem in the whole tree is also true and a big problem for any given sub-tree within the system. For any sub-tree, any common similar location of data is the "state" of that sub-tree. Sometimes we don't see it as such, but if multiple locations within the tree access the same variable, then it is essentially a global. Scoped, a bit, but still global.

We've known for a long time that state is bad. State hugely increases the likelihood of errors, and makes it very hard to test to see if the software is working or not. State problems may require weird compound testing methods to re-create, so they are very expensive both in terms of development and testing. Most non-obvious bugs* are a result of state problems of some type.

* OK, threading bugs in Java, and hanging pointers in C and C++ are probably way more popular, but these were "features" added to the languages to keep programmers employed.

Implicit state is still state-based, and is far more dangerous because the programmers are generally unaware of the problem. Any sort of data that is not explicitly passed in and/or out of a function is some type of state. That means that any and every side-effect is an implicit state of some kind. Any state that is changed in many different places in the program, even if the change method is a covering function call, is a defacto global variable, with all of its inherent problems and weaknesses. Any and all of the locations are dangerous.

Stateless code was a great idea and a best practice goal for a while, but that went horribly wrong with paradigms like object-oriented. Objects are inherently state-happy, and in that way they often hide other more hideous state problems from unsuspecting programmers. An instance of an object can be scattered across an execution graph like a splatter paint artwork. One object instance can easily be acting as a bad global variable for any other. it can get very ugly, very quickly.

Without careful consideration of these structural relationships, old problems that we banished for good reasons in the past can easily crop back into our code bases. Worse, they can be effectively hidden to most developers. Toss in a fine helping of threads, and it is easy to understand why so many popular applications occasionally, and often quietly cease to behave correctly. And why they do so only in a tiny fraction of their runs. Seemly random, non-deterministic problems, lurking in the background, wasting lots of and lots of time and effort.

These realizations, are one of the primary reasons why it is important to sometimes change perspective on a problem. Hidden, yet inherent flaws in one viewpoint become far more obvious and understandable from another.


WASTED RESOURCES

If you trace out the data in many systems you will find that it progresses through the code, jump-by-jump in a series of copies. Sometimes buffer copies, sometimes it is being parsed, somethings is is being reassembled. The path of any given piece of data through a system always involved lots of copies. Modern languages and paradigm have made this problem worse, causing this type of bloat to increase rapidly.

From a tree perspective, what is happing is that the data is scoped within a large series of different sub-trees. As it leaves one sub-tree, it is copied into the next one. In this sense, we can see each copy as a implicit violation of the sub-tree encapsulation. More to the point, if a large chunk of data is copied into a sub-tree only to facilitate some small set of manipulations, then that specific code could easily be moved to a more appropriate tree.

In that sense, the smaller the tree and the fewer of them that hold the data, the more encapsulated it is. Watching how the data flows through out the system gives a good indication as to a better working structure.


ARCHITECTURAL LINES

At an even higher level, we can see the architecture as how the big major sub-trees in the system are laid out with respect to each other. Balance and symmetry apply here as well as anywhere else.

To get a real architectural line between two pieces of code, they both need to have entirely separate sub-trees and data. Overlap in either crosses the line.

Encapsulation is burying all of the messy details, code and data, of something behind a small subset of sub-trees. They act as the interface, that hides all of the other detail. Decomposing a problem properly makes it easier to build a real workable solution, not just one that is close to workable. We need to encapsulate the details in order to manage the complexity of the project and actually get it done.

More importantly, libraries, modules, etc. should be organized about their underlying data not on their algorithmic code. That principle makes it really easy to just see the library as data containment functionality for a specific data type in the program. The algorithm handling and the data handling should be separated.

As a related note, often many user libraries and packages combine mashes of algorithms and data-handling that are inconsistent and unbalanced. Clearly defining the structure around well-balanced decompositions would make using most libraries considerably easier to use. We need a movement that wraps simple components based on very specific and complete access to specific data structures or algorithms, in a fully complete, access type of way. That would make the choice of using a specific library, really a decision about supporting a new type of data, and it would also cut down on new versions and upgrades.

The spasmodic and arbitrary blend of data and functionality whipped up into most modern libraries forces a constant cycle of updating, if for no other reasons than to try to get some of the contained functionality into a more complete state. For many libraries, this dynamic upgrade path is not necessary, but simply a by-product of disorganization and bad partitioning. This a clear example of why better normalized libraries would significantly cut-down on development effort.


REFACTORING

Knowing what a good structure is, doesn't help unless there is some easy and simple way to get any program there. Refactoring acts as the micro-normalization rules that can allow a programmer to start with anything and make it more orderly. Of course, simple consistency is also critical in making it all hold together.

You can see all of the refactoring algorithms as just ways of pushing and pulling the code up and down between the different levels of the tree. In this sense we can balance the functions, and then balance their usage, then balance the data, etc. Think of it like the "roll" operations in a weighted-balanced binary tree.

It is possible to take any working program, and after apply a very long series of non-volatile refactorings, return to another working program. Refactoring doesn't have to interfere with the functioning of the code, in fact it is far better to pass through with a large series of non-altering changes first, before moving onwards to expanding the code base to add in new functionality.

Not all refactorings in this way will be non-destruction, because by definition some of them will actually be removing bugs from the system. The changes in behavior are often ultimately good, but then there can be unexpected dependencies tied to buggy code. Under these types of circumstances, it's based to temporarily duplicate the code, with a new clean version and an existing broken one. That makes it possible to reassembles all of the pieces first and do some comparison testing to insure that the none of the behavior has changed, before moving on to deleting the dependencies on the broken code. Getting the code quickly back to working order quickly finds obvious problems and keep the development work moving forward in a series of small independent discrete steps.

Normalized code that has been refactored, then retested (lightly) sets a strong base for extending the system to encompass the next level of functionality. Without this type of behavior, the code simply degenerates into some hideous onion-nightmare, a sad and embarrassing state of affairs that is entirely unnecessary. Each time the code degenerates, it becomes more of a work magnet, drawing in masses of wasted time debugging stupid problems and working around fixable issues. Anybody working on that type of code base knows that pretty quickly more effort goes into badly patching sloppy problems, then goes into new development. A sad, and absolutely avoidable state.


FINAL SUMMARY

I wasn't really specific in producing a finite set of "forms" for normalizing code. But if you see it as a structural problem, then the rules themselves are less important, they are simply the easier way to transform one structure into a better one. The final structure is what's key.

Someday, I'm sure someone will come along with a clearer set of rules. Something that can easily fit onto the back of refactoring, that makes it easily understood at the higher level.

We know the code is normalized by the fact that the final structure we create, is an easy to read one. We've simplified the execution graph. The code maps to the structure, which maps back to the code again. A messy graph is usually messy code.

Be careful in applying this knowledge, for as I said in "The Nature of Simple", human based simplifications are not the same as machines ones. We are somewhat flawed, and as such our normalizations will be too. We don't want things that are truly universally simplified, just ones that are 'simpler' to us.

If that's true, then why bother? Like the database, a good developer knows what is normal form for their code, even if they don't strictly follow it. There are exceptions, but you cannot understand when they are OK, if you do not grasp the complete picture. Breaking the rules without understanding them just pushes back the success onto luck. Relying on luck fails often enough.

Don't forget, that this work, extra as it may be, isn't to be done for fun, or because it's right. It is to be done to make it easier to move the code base forward to the next version. It is to be done to clean up the old messes, and to make way for a better version. It is to be done to save time, and allow us to leverage our coding abilities better, instead of our ability to continuously hit "next" in a debugger. It's neither arbitrary nor extra, simply work that needs to be completed to insure that the overall health of the project gets better from month to month, not worse.

6 comments:

  1. Nice writeup...

    One possible typo: "heart and sole" is probably meant to be "heart and soul".

    ReplyDelete
  2. Hi Mike,

    Thanks for finding the typo. Freudian slip? I guess I was hungry, when I was writing :-) There's something fishy about that, isn't there?


    Paul.

    ReplyDelete
  3. I feel that the "time structure" should be considered. The topic of time or runtime behavior often appears between the lines in this post. I think this aspect of software is often overlooked. Software methods often focus mainly on "architecture" or "structure". They often produce more or less elaborate boxes-and-arrows diagrams, and very few timing diagrams. To be honest, I always felt that does big boxes-and-arrows diagrams are far from telling everything about how a system work. It's a bit like trying to understand how the human body works with only PET-scan or X-ray photographs.
    I can think of one case where the focus moves from the "static" structural point of view to the "dynamic" time-related point of view: communication protocol design.

    I don't think that, as the post suggests, you can look at the code just like if it's a painting and tell if it is wrong (ok, I exagerate a bit). I rather think that what one actually does when reading code is the mental reconstruction or simulation of the sequence actions taken by the program in some particular case of interest.

    Putting more focus on the runtime aspect of software already helps: user stories is one example. Maybe there's more to be discovered.

    ReplyDelete
  4. Hi Astrobe,

    Thanks for the comment. Quietly, underpinning the tree perspective is actually a time-based one. When you've normalized the coding structure, indirectly you've also made it far easier to envision what it looks like when it is running. For instance if you had a list of simple instructions, just by looking at it, you can tell if it's going to work or not.

    You're right that many programmers learn to simulate the code in their heads (I do), but that only works if it isn't spaghetti. Messy code is inordinately harder to work through. But that's exactly why clean orderly code is often easy to debug visually. While working through it, you stumble across something wrong or hard to interpret, so you fix it.

    In many ways some object-oriented designs just make that more complicated because they tend towards run-time circumstances that don't easily match the structural ones. Personally I'd say that it was actually a bad attribute, unless for that increase in complexity you ended up buying something way better. Extra complexity with no obvious benefit is never a good trade-off.

    Communication protocol design is often best visualized as a state machine. In that case, the easiest implementations are as just big algorithms, often only a series of discrete states and something to walk through them. A fancy OO version can be hard to debug, but just a collection of states (and a diagram) isn't so bad.

    Paul.

    ReplyDelete
  5. Marvellous post.

    I was looking forward to seeing your concise set of rules for balancing the code, but, as you said, "the rules themselves are less important."

    Surely you could twease out a shortlist of the rules! Don't wait for, "
    Someday, I'm sure someone will come along with a clearer set of rules."

    I practise a version of your subtree idea unfailingly:
    http://www.edmundkirwan.com/fracdoc/frac-toc.html

    Which then led to:
    http://www.edmundkirwan.com/pub/

    Write those rules!

    Ed Kirwan.

    ReplyDelete
  6. Hi Ed,

    Thanks for the comments. I've got another post almost done, but I'll revisit this subject after that. I was thinking of digging into visualization and compaction, so I could also get back into trying to define the rules.

    I do get a bit worried, because these types of rules will always be based on trade-offs. Strict applicability will often undo some of the normalization effects. Humans simplify irrationally (sometimes), so even with a fixed set of rules there should be times they are violated to make it more readable (same as database schemas). The rules are good to know, but they they are still just guidelines for average issues.

    Paul.

    ReplyDelete

Thanks for the Feedback!