Sunday, November 19, 2017

Bombproof Data Entry

The title of this post is from a magazine article that I barely remember reading back in the mid-80s. Although I’ve forgotten much of what it said, its underlying points have stuck with me all of these decades.


We can think of any program as having an ‘inside’ and an ‘outside’. The inside of the program is any and all code that a development team has control of; that they can actively change. The outside is everything else. That includes users, other systems, libraries, the OS, shared resources, databases, etc. It even includes code from related internal teams, that is effectively unchangeable by the team in question. Most of the code utilized in large systems is really outside of the program, often there are effectively billions of lines that could get executed, by thousands of coders.


The idea is that any data coming from the outside needs to be checked first, and rejected if it is not valid. Bad data should never circulate inside a program.


Each datam within a program is finite. That is, for any given variable, there are only a finite number of different possible values that it can hold. For a data type like floats, there may be a mass of different possibilities, but it is still finite. For any sets of data, like a string of characters, they can essentially be infinite in size, but practically they are usually bounded by external constraints. In that sense, although we can collect a huge depth of permutations, the breadth of the data is usually limited. We can use that attribute to our advantage.


In all programming languages, for a variable like an integer, we allow any value between the minimum and the maximum, and there are usually some out-of-band signals possible as well, like NULL or Overflow. However, most times when we use an integer, only a tiny fraction of this set is acceptable. We might, for example, only want to collect an integer between one and ten. So what we really want to do as the data comes in from the outside is to reject all other possibilities. We only want a variable to hold our tiny little subset, e.g. ‘int1..10’. The tightest possible constraint.


Now, it would be nice if the programming language would allow us to explicitly declare this data type variation and to nicely reject any outside data appropriately, but I believe for all modern languages we have to do this ourselves at runtime. Then, for example, the user would enter a number into a widget in the GUI, but before we accept that data we would call a function on it and trigger possibly a set of Validation Errors (errors on data should never be singular, they should always be sets) if our exact constraints are not met.


Internally, we could pass around that variable ad nauseam, without having to ever check or copy it. Since it made it inside, it is now safe and exactly what is expected. If we needed to persist this data, we wouldn’t need to recheck it on the way to the database, but if the underlying schema wasn’t tightly in sync, we would need to check it on the way back in.


Overall, this gives us a pretty simple paradigm for both structuring all validation code, and for minimizing any data fiddling. However, dealing with only independent integers is easy. In practice, keeping to this philosophy is a bit more challenging, and sometimes requires deep contemplation and trade-offs.


The first hiccup comes from variables that need discontinuous ranges. That’s not too hard, we just need to think of them as concatenated, such as ‘int1..10,20-40’, and we can even allow for overlaps like ‘int1-10,5-7,35,40-45’. Overlaps are not efficient, but not really problematic.


Of course for floating point, we get the standard mathematical range issues of open/closed, which might look like ‘float[0..1)’, noting of course that my fake notation now clashes with the standard array notation in most languages and that that ambiguity would need to be properly addressed.


Strings might seem difficult as well, but if we take them to be containers of characters and realize that regular expressions do mostly articulate their constraints, we get data types like ‘string*ab?’ which rather nicely restrict their usage. Extending that, we can quite easily apply wildcards and set theory to any sort of container of underlying types. In Data Modeling, I also discussed internal structural relationships such as trees, which can be described and enforced as well. That then mixes with what I wrote in Containers, Collections and Null, so that we can specify variable inter-dependencies and external structure. That’s almost the full breadth of a static model.


In that way, the user might play with a set of widgets or canvas controls to construct some complicated data model which can be passed through validation and flagged with all of the constraint violations. With that strength of data entry, the rest of the program is nearly trivial. You just have to persist the data, then get it back into the widgets later when requested.


A long time ago, programmers tended towards hardcoding the size of their structures. Often that meant unexpected problems from running into these constraints, which usually involved having to quickly re-deploy the code. Over the decades, we’ve shifted far away from those types of limitations and shouldn’t go backwards. Still, the practical reality of most variable sized data in a system is that there are now often neglected external bounds that should be in place.


For example, in a system that collects user's names, if a key feature is to always produce high-quality documentation, such as direct client marketing materials, the need for the presentation to never-be-broken dictates the effective number of characters that can be used. You can’t, for example, have a first name that is 3000 characters long, since it would wrap over multiple lines in the output and look horrible. That type of unspecified data usage constraint is mostly ignored these days but commonly exists within most data models.


Even if the presentation is effectively unconstrained, resources are generally not. Allowing a user to have a few different sets of preferences is a nice feature, but if enough users abuse that privilege then the system could potentially be crippled in performance.


Any and all resources need to be bounded, but modifying those bounds should be easily configurable. We want both sides of this coin.


In an Object Oriented language, we might choose to implement these ideas by creating a new object for every unique underlying constrained type. We would also need an object for every unique Container and every unique Collection. All of them would essentially refuse to instantiate with bad data, returning a set of error messages. The upper-level code would simply leverage the low-level checks, building them up until all of the incoming data was finally processed. Thus validation and error handling become intertwined.


Now, this might seem like a huge number of objects, but once each one was completed it would be highly leverageable and extremely trustworthy. Each time the system is extended, the work would actually get faster, since the increasing majority of the system’s finite data would already have been both crafted and tested. Building up a system in this manner is initially slower, but the growing reductions in bugs, testing, coding, etc. ultimately win the day.


It is also possible to refactor an existing system into this approach, rather mindlessly, by gradually seeking out any primitives or language objects and slowly replacing them. This type of non-destructive refactoring isn’t particularly enjoyable, but it's safe work to do when you don’t feel like thinking, and it is well worth it as the system grows.


It also quite possible to apply this idea to other programming paradigms, in a sense this is really just a tighter variation on the standard data structure ideas. As well as primitive functions to access and modify the data, there is also a means to validate and return zero or more errors. Processing only moves forward when the error count is zero.


Now one big hiccup I haven’t yet mentioned is cross-variable type constraints. The most common example is a user selecting Canada for country should then be able to pick from a list of Provinces, but if they select the USA, it should be States. We don’t want them to have to select from the rather huge, combined set of all sub-country breakdowns, that would be awful. The secondary dependent variable effectively changes its enumerated data type based on the primary variable. For practical reasons, I am going to assume that all such inter-type relationships are invertible (mostly to avoid confusing the users). So we can cope with these types of constraints by making the higher-level types polymorphic down to some common base type. This might happen as a union in some programming languages or an underlying abstract object in others. Then the primary validation would have to check the variable, then switch on its associated sub-type validation for every secondary variable. This can get a bit complex to generalize in practice, but it is handled by essential moving up to the idea of composite-variables that then know their sub-type inter-relationship mappings. In our case above, Location would be aware of Country and Province/State constraints, with each subtype knowing their own enumerations.


So far everything I’ve talked about is with respect to static data models. This type of system has extremely rigid constraints on what it will and will not accept. That actually covers the bulk of most modern systems, but it is not very satisfying since we really need to build smarter and more adaptable technologies. The real world isn’t static so our systems to model and interact with it shouldn’t be either.


Fortunately, if we see these validations as code, and in this case easily computable code, then we can ‘nounify’ that code into data itself. That is, all of the rules to constrain the variables can be moved around as variables themselves. All we need do to them is allow the outside to have the power to request extensions to our data model. We can do that by explicitly asking for the specific data type, the structural changes and some reference or naming information. Outside, users or code can explicitly or even implicitly supply this information, which is passed around internally. When it does need to go external again, say to the database, essentially the inside code follows the exact same behavior in requesting a model extension. For any system that supports this basic dynamic model protocol, it will just be seamlessly interconnected. The sophistication of all of the parts will grow as the underlying data models are enhanced by users (with the obvious caveat that there needs to be some configurable approach to backfilling missing data as well).


We sort of do this right now, when we drive domain data from persistence. Using Country as an example again, many systems will store this type of enumerated set in a domain table in the database and initialize it at startup. More complex systems might even have an event mechanism to notify the runtime code to re-initialize. These types of system features are useful for both constraining the initial implementation, but also being able to adjust it on-the-fly. Rather than do this on an ad hoc basis though, it would be great if we could just apply it to the whole data model, to be used as needed.


Right now we spend an inordinate amount of effort hacking in partial checks and balances in random locations to inconsistently deal with runtime data quality. We can massively reduce that amount of work, making it easier and faster to code, by sticking to a fairly simple model of validation. It also has the nifty side-effect that it will reduce both bloat and CPU usage. Obviously, we’ve known this since at least the mid-80s, it just keeps getting lost and partially reinvented.