Sunday, November 22, 2015

Containers, Collections and Null

A variable in a computer program holds exactly one member of a finite set of possibilities. This datum symbolically represents something in the physical world, although it is also possible that it represents something from an imaginary reality.

The usefulness of a variable is that it can be displayed at the appropriate moment to allow a person, or something mechanical, to make a decision. Beyond that, it is just being saved under the belief that it will be used for this purpose some day or that it will be an input into some other variable’s moment.

A variable can also have with it an associated null flag. It is not actually ‘part’ of that variable, but rather an independent boolean variable that sheds light on the original contents. A friend of mine used to refer to null values as ‘out-of-band’ signals; meaning that they sit outside of the set of possible values. This is important in that they cannot be confused with another member of the set, what he used to call ‘in-band’ signals or often ‘magic numbers’.

There is generally a lot of confusion surrounding the use of nulls. People ascribe all sorts of program-specific meanings to them, which act as implicit data in their program. In this way, they overload the out-of-band signal to represent a custom state relative to their own data. This is a bad idea. Overloading any meaning in a variable guarantees some sort of future problem since it is subjective and easily forgotten. It is far better to ascribe a tighter definition.

A variable is collected or it is not; computers are perfectly binary in that regard. If it is possible for a variable to not be collected, the only reason for this is that the variable is ‘optional’ otherwise the program will not continue with its processing. Obviously bad data should not be allowed to propagate throughout a system or get stored in the database. If there are different, and useful, excuses for not having specified a variable right away, then these are independent variables themselves. With that in mind, a null only ever means that the variable is optional and was not collected, nothing else should be inferred from it.

Then for the sake of modelling, for all variables, one only needs to know if the data is mandatory -- it must be collected -- or it is nullable. Getting a null in a mandatory field is obviously an error, which needs to invoke some error handling. Getting a null in an optional field is fine. If a variable is nullable ‘anywhere’ in the system then it should be nullable ‘everywhere’ in the system (it is part of the data model). If there is functionality dependent on an optional variable, then it should only ever execute when that variable is present, but should not be an error if it is missing. If the underlying data is optional, then any derived data or functionality built on top of it is optional as well. Optionality propagates upwards.

With that perspective, handling nulls for a variable is easy and consistent. If it doesn’t make sense that the data could be missing then don’t make it nullable. It also helps to understand where to reinforce the data model if the data is a bit trickier than normal. For instance, partially collected data is ‘bad’ data until all of the mandatory values have been specified. So it should be partitioned or marked appropriately until it is ready.

The strength of a computer is not that it can remember a single variable, but rather that it can remember a huge number of them. And they can be interrelated. What really adds to the knowledge is the structure of these cross-variable relationships. There can be a huge number of instances of the same variable and/or there can be a huge number of relationships to other types of variables.

The larger and more complex the structure, the more information that can be gleaned out of this collected data. The history of software has always been about trying to cope with increasingly larger and more complex data structures even if that hasn’t be explicitly stated as the goal. In a sense, we don’t just want data, we want a whole model of something (often including a time dimension) that we can use to make deeper, more precise decisions.

The complex structural relationships between variables are represented within a program as ‘containers’. A simple container, say a structure in C or an simple object in Java, is just a static group of related variables. These can be moved throughout the program as if they were a single variable, insuring that they are all appropriately kept together. In this arrangement, the name of the variable in many programming languages is a compile-time attribute. That is, the programmer refers to the name when writing the code, but no such name exists in the system at runtime. Some languages provide introspection, allowing the programmer to retrieve their name and use it at runtime.

Null handling for simple containers is similar to null handling for individual variables. Null means that ‘all’ of the data in the structure is optional. In some languages however, there can be confusion on how to handle mandatory data. With a simple variable, the language itself can be used to insist that it always exists, but with a pointer or reference to a container that check needs to be explicitly in the code. It can come in the form of asserts or some explicit branches that jump to error handling. The code should not continue with functionality that relies on mandatory data if it is missing.

The next level of sophistication allows for a runtime based ‘named’ variable. That is, both the name and the variable contents are passed around as variables themselves, together in a container. Frequently this container is implemented as a hashtable (sometimes called a dictionary), although in some cases ‘order’ is required so there can also be an underlying linked-list. This is quite useful for reusing code for functionally on similar data with only some minor processing attached to a few specific variables. Then it mostly leaves the bulk of the code to manipulate the data without having to really understand it, making it strongly reusable. This works well for communications, printing, displaying stuff in widgets, etc. Any part of the code whose data processing isn’t explicitly dependent on the explicit meaning of the underlying data, although sometimes there needs to be categories (often data types) of behaviour. Learning to utilize this paradigm can cut out a huge number of redundant lines of code in most common programs.

Null handling for containers of named variables is slightly different, in that the absence of a particular named pair is identical to the name existing with a null value. Given this overlap, it is usually best to not add empty, optional data into the container. This is also reflected symmetrically by not passing along values without a name either. This type of structuring means that processing such a container is simplified in that each pair is either mandatory, or it is optionally included. If a pair is missing, then it was optional. To enforce mandatory variables, again there needs to be some trivial code that interrupts processing.

Containers get more interesting when they allow multiple instances of the same data, such as an array, list or tree. Large groups of collected data shed brighter light on the behaviour of their individual datum, thus providing deeper details. For these ‘collections’, they can be ordered or unordered although the latter is really a figment of the programmer’s imagination in that everything on a computer has an intrinsic order, it is just sometimes ignored.

Ordering can be based on any variables within the data although often it is misunderstood; the classic case of that being tables in a relational database returning their default order based on their primary index construction, thus leading to the extremely common bug of the SELECT statement order changing unexpectedly when the tables grow large or rows are deleted. Programmers don’t see this potentially chaotic reordering when testing with small datasets, so they make bad assumptions about what really caused the visible ordering.

One frequent confusion that occurs is with null handling for collections, in that an empty collection or a null reference can be interpreted as the same thing. In most cases, it is really optional data has not been collected, so it doesn’t make sense to support this redundancy. It is more appropriate to handle the ‘zero’ items condition as an optional collection, and the null reference itself as a programming error. This is supported quite elegantly by having any code that returns a collection to always allocate an empty one. This can then be mirrored by any function that needs a collection as well, in that it can assume that null will not be passed in, just empty collections. This reduces bugs caused by inconsistent collection handling, but it does mean that every branch of any new code should be executed at least once in testing to catch any unwanted nulls. It doesn’t however mean that every permutation of the logic needs to be tested, just the empty case, so the minimum test cases are fairly small. This isn’t extra work in that the minimum reasonable practice for any testing is always that no ‘untested’ lines of code should ever be deployed. That’s just asking for trouble, and if it happens it is a process problem, not a coding one.

This null handling philosophy prevents the messy and time wasting inefficiencies of never knowing which direction an underlying programmer is going to choose. We’ll call it ‘normalized collection handling’. It does however require wrapping any questionable, underlying calls from other code just in case it doesn’t behave this way, but wrapping all third-party library calls was always considered a strong best practice, right up until it was forgotten.

Some programmers may not like it because they believe that it is more resource intensive. Passing around a lot of empty collections will definitely use extra memory. Getting the count out of a collection is probably more CPU than just checking for a null (but less if you have to do both which is usually the case). This is true, but because an empty collection eats up the base management costs, it also means that the resource usage during processing, at least from the lowest level, is considerably more consistent. Programs that fluctuate with large resource usage extremes are far more volatile, which makes them far more difficult for operations. That is, if whenever you run a program, its total resource usage is predictable and relative to load, then it becomes really easy to allocate the right hardware. If it swings to extremes, it becomes far more likely to exceed its allocation. Stable-resource systems are way easier to manage. A little extra memory then is a fair price to pay for simpler and more stable code.

We can generalize all of the above by realizing that null handling differs between static and dynamic variables. Adding nulls is extra work in the static case, while enforcing mandatory requirements is extra work in the dynamic case. Static data is easier to code, but dynamic data is both flexible and more reusable. In that sense if we are going to just hardcode a great deal of static data, it is best if the default is mandatory and optional data is the least frequent special case. The exact opposite is true with dynamic data. Most things should be optional, to avoid filling the code with mandatory checks. This behavioural flip flop causes a great deal of confusion because people want a one-size-fits-all approach.

A nice attribute about this whole perspective of data is that it simplifies lots of stuff. We only have containers of unnamed or named variables. Some containers are collections. This is actually an old philosophy, in that it shows up in languages like AWK that have ‘associated arrays’ where the array index can be an integer or a string. The former is a traditional array, while the latter is a hashtable, but they are treated identically. In fact, in AWK, I think it just cheats and makes everything a hashtable, converting any other data type to a string key. This makes it a rather nice example of an abstraction smoothing away the special cases, for the convenience of both the users of the language and the original programmers.

We have to, of course, extend this to allow for containers of a mix between variables and sub-containers, and do that in a fully recursive manner so that a variable can be a reference to a container or collection itself. This does open the door to having cycles, but in most cases they are trivial to detect and easily managed. Going a little farther, the ‘name’ of the variable or key of the hashtable itself can be a recursive container of structures as well, although eventually it needs to resolve down to something that is both comparable and will generate a constant hashing code (it can't change over time). Mixed all together, these techniques give us a fully expressible means of symbolically representing any model in spacetime, and thus can be increasingly used to make even more complex decisions.

We should try not to get lost in the recursive nature of what we need to express, and it is paradigms like ADTs and Object Oriented programming that are really helpful in this regard. A sophisticated program will have an extraordinarily complex data model with many, many different depths to it, but we can assemble and work on these one container at a time. That is, if the model is a list of queues of trees with stacks and matrices scattered about, we don’t have to understand the entire structure as a ‘whole’ in order to build and maintain the code. We can decompose it into its underlying components, data-structures and/or objects, and insure that each one works as expected. We can then work on, and test, each sub-relationship again independently. If we know all the structural relationships are correct, and we know the underlying handling is correct, then we can infer that the overall model is correct.

While that satisfies the ability to implement the code, it says nothing about how we might decide on a complex model to a real world solution. If there is art left in programming, much of it comes directly from this issue. What’s most obvious is that the construction of sophisticated models needs to be prototyped long before the time is spent to implement them, so that its suitability can be quickly rearranged as needed. The other point is that because of the intrinsic complexity, this type of modelling needs to be built from the bottom-up, while still being able to understand the top-down imposed context. Whiteboards, creativity and a lot of discussions seem to be the best tools for arriving at decent results. Realistically this is actually more of an analysis problem then a coding one. The structure of the data drives the behaviour, the code just needs to be aware of what it is supposed to be, and handle it in a consistent manner.

Some programmers would rather avoid this whole brand of complexity and stick to masses of independent static variables, but it is inevitable that larger, more complex, structural relationships will become increasingly common as the user’s requirements for their software gradually get more sophisticated. A good example of this being the non-trivial data-structures that underpin spreadsheets. They have only gained in popularity since they were invented. Their complexity unleashed power-users to solve programmatic issues for themselves that were unimaginable before this technology existed. That higher level of sophistication is sorely needed for many other domain based problems, but currently they are too often encoded statically. We’ve been making very painfully slow progress in this area, but it is progress and it will continue as users increasingly learn what is really possible with software.

Thinking of the possible variable arrangements as structural relationships between containers makes it fairly straightforward to understand how to represent increasingly complex real world data. With tighter definitions, we can avoid the disorganization brought on by vagueness, which will also cut down on bugs. Software is always founded on really simple concepts, the definition of a Turing machine for example, but we have built on top of this a lot of artificial complexity due to needless variations. Being a little stricter with our understanding allows us to return to minimal complexity, thus letting us focus our efforts on achieving both sophistication and predictability. If we want better software, we have to detach from the sloppiness of the past.