Saturday, December 9, 2017

Sophistication

Computers aren’t smart, but they are great at remembering things and can be both quick and precise. Those qualities can be quite helpful in a fast-paced modern world. What we’d like is for the computer to take care of our problems, while deferring to us on how or when this work should be executed. What we’d like then, is for our software running on those computers to be ‘sophisticated’ enough to make our lives easier.

A simple example of this is a todo list. Let’s say that this particular version can schedule conference calls. It sends out a time and medium to a list of participants, then collects back their responses. If enough or the right people agree, then the meeting is scheduled. That’s somewhat helpful, but the program could go farther. If it is tied to the phone used in the teleconference, it could detect that the current discussion is getting close to going over time and will impact the schedule of any following meetings. At that point, it could discretely signal the user and inquire if any of the following meetings should be rescheduled. A quite smart version of this might even negotiate with all of the future participants to reoptimize the new schedule to meet most of their demands all on its own.

That type of program would allow you to lean on it in the way that you might rely on a trusted human secretary. They often understand enough of the working context to be able to take some of the cognitive load off their boss, for specific issues. The program's scope is to understand all of the interactions and how they intertwine and to ensure that any rescheduling meets most of the constraints. In other words, it’s not just a brain-dead todo app; it doesn’t just blissfully kick off timers and display dialogues with buttons, instead, it has a sophisticated model of how you are communicating with other people and the ability to rearrange these interactions if one or more of those meetings exceed some threshold. So it’s not really intelligent, but it is far more sophisticated than just an egg timer.

It might seem that it would be a complicated program to write, but that upper-level logic isn’t hard. If the current meeting is running late, prompt the user and possibly reschedule. It just needs to know the current meeting status, have a concept of late, be able to interact with the user and then reschedule the other meetings.

A programmer might easily get lost in the details that would be necessary to craft this logic. It might, for example, be an app running on a phone that schedules a lot of different mediums, like conference calls, physical meetings, etc. It would need timers to check the meeting status, an interface to prompt the user and a lot of code to deal with the intrinsic complexities of quickly rescheduling, including plenty of data to deal with conflicts, unavailable contacts, etc.

It is certainly a non-trivial piece of code and to my knowledge, although parts of it probably appear spread across lots of different software products, none of them have it all together in one place. As one giant piece of code, it would be quite the tangle of timers, events, widgets, protocols, resources, contacts, business logic, conditional blocks of code and plenty of loops. It would be so complicated that it would be nearly impossible to get right and brutal to test. Still, conceptually it’s not rocket science, even if it is a lot of individual moving parts. It’s just sophisticated.

So how would we go about building something this sophisticated? Obviously, as one huge list of instructions, it would be impossible. But clearly, we can describe the higher level logic quite easily. So what we need to do is to make it easy to encode that logic as described:

If the current meeting is running late, prompt the user and possibly reschedule.

If that is all there is to the logic, then the problem is simple. We basically have four ‘parts’ that need to be tied together in a higher level block of code. The current meeting is just some data with respect to time. Running late is a periodic event check. Prompting the user is a simple interface and rescheduling while tricky, is a manageable algorithm. Each of these components wraps code and data. This is where encapsulation becomes critical. For each part, if it is a well-defined black box that really encapsulates all of the underlying issues, then we are able to build that sophistication on top. If not, then we’ll get lost in the details.

Some of the underlying data is shared, so the data captured and its structure need to span these different components. These components are not independent. The current meeting, for example, overlaps with the rescheduling, but in a way that both rely on the underlying list of contacts, meeting schedule and communications medium. That implies that under those two components there are at least three more. Having both higher level components anchor on using the same underlying code and data ensures that they will interoperate together correctly, thus we need to acknowledge these dependencies and even leverage them as reusable sub-components.

And of course, we want each of these upper-level components to be as polymorphic as possible. That is, we want a general concept of medium, without having to worry about whether that means a teleconference, video conference or a physical meeting. We don’t need to care, so we shouldn’t care.

So, in order to achieve sophistication, we clearly need encapsulation and polymorphism. These three are all tied together, and the last two fall directly under the banner of abstraction. This makes a great deal of sense in that for large programs to build up the necessary high-level capabilities you have to be both tightly organized and to generalize at that higher level.

Without organization and generalization, the code will grow so complex that it will eventually become unmanageable by humans. There is a real physical limit to our abilities to visual larger and larger contexts. If we fail to understand the behavior of enough code, then we can then no longer predict what it will do, which essentially is the definition of a fragile and buggy system.

It is worth noting here that we quite easily get to a reasonable decomposition by going at the problem from the top down, but we are actually just building up well-encapsulated components from the bottom up. This contradiction often scares software developers away from wanting to utilize concepts like abstraction, encapsulation, polymorphism, generalization, etc. in their efforts, but without them, they are really limited in what they can achieve.

Software that isn’t sophisticated is really just a fancy means of displaying collected data. Initially, that was useful, but as we collect so much more data from so many sources, having that distributed across dozens of different systems and interfaces becomes increasingly useless. It shifts any burden of stitching it all back together If over to the user, and that too will eventually exceed their capacity. So all of that extra effort to collect that data is heavily diminished. It doesn’t solve problems, rather it just shifts them around.

What the next generation of software needs to do is to acknowledge these failings, and start leveraging the basic qualities of computers to be able to really help the users with sophisticated programs. But to get there, we can’t just pound out brute force code on a keyboard. Our means of designing and building this new generation of software has to become sophisticated as well. We have to employ higher-level approaches to designing and building systems. We need to change the way we approach the work.

Sunday, November 19, 2017

Bombproof Data Entry

The title of this post is from a magazine article that I barely remember reading back in the mid-80s. Although I’ve forgotten much of what it said, its underlying points have stuck with me all of these decades.


We can think of any program as having an ‘inside’ and an ‘outside’. The inside of the program is any and all code that a development team has control of; that they can actively change. The outside is everything else. That includes users, other systems, libraries, the OS, shared resources, databases, etc. It even includes code from related internal teams, that is effectively unchangeable by the team in question. Most of the code utilized in large systems is really outside of the program, often there are effectively billions of lines that could get executed, by thousands of coders.


The idea is that any data coming from the outside needs to be checked first, and rejected if it is not valid. Bad data should never circulate inside a program.


Each datam within a program is finite. That is, for any given variable, there are only a finite number of different possible values that it can hold. For a data type like floats, there may be a mass of different possibilities, but it is still finite. For any sets of data, like a string of characters, they can essentially be infinite in size, but practically they are usually bounded by external constraints. In that sense, although we can collect a huge depth of permutations, the breadth of the data is usually limited. We can use that attribute to our advantage.


In all programming languages, for a variable like an integer, we allow any value between the minimum and the maximum, and there are usually some out-of-band signals possible as well, like NULL or Overflow. However, most times when we use an integer, only a tiny fraction of this set is acceptable. We might, for example, only want to collect an integer between one and ten. So what we really want to do as the data comes in from the outside is to reject all other possibilities. We only want a variable to hold our tiny little subset, e.g. ‘int1..10’. The tightest possible constraint.


Now, it would be nice if the programming language would allow us to explicitly declare this data type variation and to nicely reject any outside data appropriately, but I believe for all modern languages we have to do this ourselves at runtime. Then, for example, the user would enter a number into a widget in the GUI, but before we accept that data we would call a function on it and trigger possibly a set of Validation Errors (errors on data should never be singular, they should always be sets) if our exact constraints are not met.


Internally, we could pass around that variable ad nauseam, without having to ever check or copy it. Since it made it inside, it is now safe and exactly what is expected. If we needed to persist this data, we wouldn’t need to recheck it on the way to the database, but if the underlying schema wasn’t tightly in sync, we would need to check it on the way back in.


Overall, this gives us a pretty simple paradigm for both structuring all validation code, and for minimizing any data fiddling. However, dealing with only independent integers is easy. In practice, keeping to this philosophy is a bit more challenging, and sometimes requires deep contemplation and trade-offs.


The first hiccup comes from variables that need discontinuous ranges. That’s not too hard, we just need to think of them as concatenated, such as ‘int1..10,20-40’, and we can even allow for overlaps like ‘int1-10,5-7,35,40-45’. Overlaps are not efficient, but not really problematic.


Of course for floating point, we get the standard mathematical range issues of open/closed, which might look like ‘float[0..1)’, noting of course that my fake notation now clashes with the standard array notation in most languages and that that ambiguity would need to be properly addressed.


Strings might seem difficult as well, but if we take them to be containers of characters and realize that regular expressions do mostly articulate their constraints, we get data types like ‘string*ab?’ which rather nicely restrict their usage. Extending that, we can quite easily apply wildcards and set theory to any sort of container of underlying types. In Data Modeling, I also discussed internal structural relationships such as trees, which can be described and enforced as well. That then mixes with what I wrote in Containers, Collections and Null, so that we can specify variable inter-dependencies and external structure. That’s almost the full breadth of a static model.


In that way, the user might play with a set of widgets or canvas controls to construct some complicated data model which can be passed through validation and flagged with all of the constraint violations. With that strength of data entry, the rest of the program is nearly trivial. You just have to persist the data, then get it back into the widgets later when requested.


A long time ago, programmers tended towards hardcoding the size of their structures. Often that meant unexpected problems from running into these constraints, which usually involved having to quickly re-deploy the code. Over the decades, we’ve shifted far away from those types of limitations and shouldn’t go backwards. Still, the practical reality of most variable sized data in a system is that there are now often neglected external bounds that should be in place.


For example, in a system that collects user's names, if a key feature is to always produce high-quality documentation, such as direct client marketing materials, the need for the presentation to never-be-broken dictates the effective number of characters that can be used. You can’t, for example, have a first name that is 3000 characters long, since it would wrap over multiple lines in the output and look horrible. That type of unspecified data usage constraint is mostly ignored these days but commonly exists within most data models.


Even if the presentation is effectively unconstrained, resources are generally not. Allowing a user to have a few different sets of preferences is a nice feature, but if enough users abuse that privilege then the system could potentially be crippled in performance.


Any and all resources need to be bounded, but modifying those bounds should be easily configurable. We want both sides of this coin.


In an Object Oriented language, we might choose to implement these ideas by creating a new object for every unique underlying constrained type. We would also need an object for every unique Container and every unique Collection. All of them would essentially refuse to instantiate with bad data, returning a set of error messages. The upper-level code would simply leverage the low-level checks, building them up until all of the incoming data was finally processed. Thus validation and error handling become intertwined.


Now, this might seem like a huge number of objects, but once each one was completed it would be highly leverageable and extremely trustworthy. Each time the system is extended, the work would actually get faster, since the increasing majority of the system’s finite data would already have been both crafted and tested. Building up a system in this manner is initially slower, but the growing reductions in bugs, testing, coding, etc. ultimately win the day.


It is also possible to refactor an existing system into this approach, rather mindlessly, by gradually seeking out any primitives or language objects and slowly replacing them. This type of non-destructive refactoring isn’t particularly enjoyable, but it's safe work to do when you don’t feel like thinking, and it is well worth it as the system grows.


It also quite possible to apply this idea to other programming paradigms, in a sense this is really just a tighter variation on the standard data structure ideas. As well as primitive functions to access and modify the data, there is also a means to validate and return zero or more errors. Processing only moves forward when the error count is zero.


Now one big hiccup I haven’t yet mentioned is cross-variable type constraints. The most common example is a user selecting Canada for country should then be able to pick from a list of Provinces, but if they select the USA, it should be States. We don’t want them to have to select from the rather huge, combined set of all sub-country breakdowns, that would be awful. The secondary dependent variable effectively changes its enumerated data type based on the primary variable. For practical reasons, I am going to assume that all such inter-type relationships are invertible (mostly to avoid confusing the users). So we can cope with these types of constraints by making the higher-level types polymorphic down to some common base type. This might happen as a union in some programming languages or an underlying abstract object in others. Then the primary validation would have to check the variable, then switch on its associated sub-type validation for every secondary variable. This can get a bit complex to generalize in practice, but it is handled by essential moving up to the idea of composite-variables that then know their sub-type inter-relationship mappings. In our case above, Location would be aware of Country and Province/State constraints, with each subtype knowing their own enumerations.


So far everything I’ve talked about is with respect to static data models. This type of system has extremely rigid constraints on what it will and will not accept. That actually covers the bulk of most modern systems, but it is not very satisfying since we really need to build smarter and more adaptable technologies. The real world isn’t static so our systems to model and interact with it shouldn’t be either.


Fortunately, if we see these validations as code, and in this case easily computable code, then we can ‘nounify’ that code into data itself. That is, all of the rules to constrain the variables can be moved around as variables themselves. All we need do to them is allow the outside to have the power to request extensions to our data model. We can do that by explicitly asking for the specific data type, the structural changes and some reference or naming information. Outside, users or code can explicitly or even implicitly supply this information, which is passed around internally. When it does need to go external again, say to the database, essentially the inside code follows the exact same behavior in requesting a model extension. For any system that supports this basic dynamic model protocol, it will just be seamlessly interconnected. The sophistication of all of the parts will grow as the underlying data models are enhanced by users (with the obvious caveat that there needs to be some configurable approach to backfilling missing data as well).


We sort of do this right now, when we drive domain data from persistence. Using Country as an example again, many systems will store this type of enumerated set in a domain table in the database and initialize it at startup. More complex systems might even have an event mechanism to notify the runtime code to re-initialize. These types of system features are useful for both constraining the initial implementation, but also being able to adjust it on-the-fly. Rather than do this on an ad hoc basis though, it would be great if we could just apply it to the whole data model, to be used as needed.


Right now we spend an inordinate amount of effort hacking in partial checks and balances in random locations to inconsistently deal with runtime data quality. We can massively reduce that amount of work, making it easier and faster to code, by sticking to a fairly simple model of validation. It also has the nifty side-effect that it will reduce both bloat and CPU usage. Obviously, we’ve known this since at least the mid-80s, it just keeps getting lost and partially reinvented.

Saturday, October 28, 2017

Freedom

A big part of programming culture is the illusion of freedom. We pride ourselves on how we can tackle modern problems by creatively building new and innovative stuff. When we build for other programmers, we offer a mass of possible options because we don’t want to be the ones who restrict their freedoms.


We generally believe that absolutely every piece of code is malleable and that releasing code quickly is preferable over stewing on it for a long time. In the last decade, we have also added in that coding should be fun, as a fast-paced, dynamic response to the ever-changing and unknowable needs of the users. Although most of us are aware that the full range of programming knowledge far exceeds the capacity of our brains, we still express overconfidence in our abilities while contradictorily treating a lot of the underlying mechanics as magic.


Freedom is one of our deepest base principles. It appears to have its origin in the accelerated growth of the industry and the fact that every 5 to 10 years some brand new, fresh, underlying technology springs to life. So the past, most programmers believe, has very little knowledge to help with these new emerging technologies.


That, of course, is what we are told to believe and it has been passed down from generations of earlier programmers. Rather than fading, as the industry ages, it has been strengthening over the last few decades.


However, if we examine programming closely, it does not match that delusion. The freedoms that we proclaim and try so hard to protect are the very freedoms that keep us from moving forward. Although we’ve seen countless attempts to reverse this slide, as the industry grows most new generations of programmers cling to these beliefs and we actively ignore or push out those that don’t.


Quite obviously, if we want a program to even be mildly ‘readable’, we don’t actually have the freedom to name the variables anything we want. They are bound to the domain; being inconsistent or misusing terminology breeds confusion. It's really quite evident that if a piece of data flows through a single code base, that it should always have the same name everywhere, but each individual programmer doesn’t want to spend their time to find the name of existing things, so they call it whatever they want. Any additional programmers then have to cope with multiple shifting names for the exact same thing, thus making it hard to quickly understand the flow.


So there isn’t really any freedom to name variables ‘whatever’. If the code is going to be readable then names need to be correct, and in a large system, those names may have been picked years or even decades earlier. We’ve always talked a lot about following standards and conventions, but even cursory inspections of most large code bases show that it usually fails in practice.


We love the freedom to structure the code in any way we choose. To creatively come up with a new twist on structuring the solution. But that only really exists at the beginning of a large project. In order to organize and build most systems, as well as handle errors properly, there aren’t really many degrees of freedom. There needs to be a high-level structure that binds similar code together in an organized way. If that doesn’t exist, then its just a random pile of loosely associated code, with some ugly spaghetti interdependencies. Freely tossing code anywhere will just devalue it, often to the point of hopelessness. Putting it with other, similar pieces of code will make it easy to find and correct or enhance later.


We have a similar issue with error handling. It is treated as the last part of the effort, to be sloppily bolted on top. The focus is just on getting the main logic flow to run. But the constraints of adding proper error handling have a deep and meaningful impact on the structure of the underlying code. To handle errors properly, you can’t just keep patching the code at whatever level, whenever you discover a new bug. There needs to be a philosophy or structure, that is thought out in advance of writing the code. Without that, the error handling is just arbitrary, fragile and prone to not working when it is most needed. There is little freedom in properly handling errors, the underlying technology, the language and the constraints of the domain mostly dictate how the code must be structured if it is going to be reliable.


With all underlying technologies, they have their own inherent strengths and weaknesses. If you build on top of badly applied or weak components, they set the maximum quality of your system. If you, for instance, just toss some randomish denormalized tables into a jumble in an RDBMS, no matter how nice the code that sits on top, the system is basically an untrustworthy, unfixable, mess. If you fix the schema problems, all of the other code has to change. Everything. And it will cost at least twice as long to understand and correct it, as it would to just throw it away and start over. The value of code that ‘should’ be thrown away is near zero or can even be negative.


So we really don’t have the freedom to just pick any old technology and to use it in any old, creative, way we choose. That these technologies exist and can or should be used, bounds their usage and ultimately the results as well. Not every tech stack is usable for every system and picking one imposes severe, mostly unchangeable limitations. If you try to build a system for millions of users on top of a framework that can only really handle hundreds of them, it will fail. Reliably.


It might seem like we have the freedom to craft new algorithms, but even here it is exceptionally limited as well. If for example, you decide to roll your own algorithm from scratch, you are basically taking a huge step backward, maybe decades. Any existing implementations may not be perfect, but generally, the authors have based their work on previous work, so that knowledge percolates throughout the code. If they didn’t do that research then you might be able to do better, but you’ll still need to at least know and understand more than the original author. In an oversimplified example, one can read up on sorting algorithms and implement a new variant or just rely on the one in their language’s library. Trying to work it out from first principals, however, will take years.


To get something reasonable, there isn’t much latitude, thus no freedom. The programmers that have specialized in implementing algorithms have built up their skills by having deep knowledge of possibly hundreds of algorithms. To keep up, you have to find that knowledge, absorb it and then be able to practically apply it in the specific circumstance. Most programmers aren’t willing to do enough legwork, thus what they create is crude. At least one step backward. So, practically they aren’t free to craft their own versions (although they are chained by time and their own capabilities).


When building for other programmers it is fun to add in a whole boatload of cool options for them to play with, but that is only really useful if changing any subset of options is reliable. If the bulk of the permutations have not been tested and changing them is very likely to trigger bugs, then the options themselves are useless. Just wasted make-work. Software should always default to the most reasonable, expected options and the code that is the most useful fully encapsulates the underlying processing.


So, options are not really that useful, in most cases. In fact, in the name of sanity, most projects should mostly build on the closest deployments to the default. Reconfiguring the stuff to be extra fancy increases the odds of running into really bad problems later. You can’t trust that the underlying technology will hold any invariants over time, so you just can’t rely on them and it should be a very big and mostly unwanted decision to make any changes away from the default. Unfortunately, the freedom to fiddle with the configurations is tied to causing seriously bad long-term problems. Experience makes one extraordinary suspect of moving away from the default unless your back is up against the wall.


As far as interfaces go, the best ones are the ones that the user expects. A stupidly clever new alternative on a date selection widget, for example, is unlikely to be appreciated by the users, even if it does add some new little novel capability. That break from expectations generally does not offset a small increase in functionality. Thus, most of the time, the best interface is the one that is most consistent with the behaviors of the other software that the user uses. People want to get their work done, not to be confused by some new ‘odd’ way of doing things. In time, they do adjust, but for them to actually ‘like’ these new features, they have to be comfortably close to the norm. Thus, for most systems, most of the time, the user’s expectations trump almost all interface design freedoms. You can build something innovative, but don’t be surprised if the users complain and gradually over time they push hard to bring it back to the norm. Only at the beginning of some new technological shift is there any real freedom, and these are also really constrained by previous technologies and by human-based design issues.


If we dig deeply into software development, it becomes clear that it isn’t really as free as we like to believe. Most experienced programmers who have worked on the same code bases for years have seen that over time the development tends to be refined, and it is driven back towards ‘best practices’ or at least established practices. Believing there is freedom and choosing some eclectic means of handling the problems is usually perceived as ‘technical debt’. So ironically, we write quite often about these constraints, but then act and proceed like they do not exist. When we pursue false freedoms, we negatively affect the results of the work. That is, if you randomly name all of the variables in a system, you are just shooting yourself in the foot. It is a self-inflicted injury.


Instead, we need to accept that for the bulk of our code, we aren’t free to creatively whack it out however we please. If we want good, professional, quality code, we have to follow fairly strict guidelines, and do plenty of research, in order to ensure that result.


Yes, that does indeed making coding boring. And most of the code that programmers write should be boring. Coding is not and never will be the exciting creative part of building large systems. It’s usually just the hard, tedious, pedantic part of bringing a software system to life. It’s the ‘working’ part of the job.


But that is not to say that building software is boring. Rather, it is that the dynamic, creative parts of the process come from exploring and analyzing the domain, and from working around those boundaries in the design. Design and analysis, for most people, are the interesting parts. We intrinsically know this, because we’ve so often seen attempts to inject them directly into coding. Collectively, most programmers want all three parts to remain blurred together so that they don’t just end up as coding monkeys. My take is that we desperately need to separate them, but software developers should be trained and able to span all three. That is, you might start out coding from boring specs, learn enough to handle properly designing stuff and in time when you’ve got some experience in a given domain, go out and actually meet and talk to the users. One can start as a programmer, but they should grow into being a software developer as time progresses. Also, one should never confuse the two. If you promote a programmer to run a large project they don’t necessarily have all of the necessary design and analyze skills and it doesn’t matter how long they have been coding. Missing skills spells disaster.

Freedom is a great myth and selling point for new people entering the industry, but it really is an illusion. It is sometimes there, at very early stages, with new technologies, but even then it is often false. If we can finally get over its lack of existence, and stop trying to make a game out of the process, we can move forward on trying to collect and enhance our already massive amount of knowledge into something that is more reliably applied. That is, we can stop flailing at the keyboards and start building better, more trustworthy systems that actually do make our lives better, instead of just slowly driving everyone nuts with this continuing madness and sloppy results we currently deliver.

Sunday, October 15, 2017

Efficiency and Quality

The strongest process for building software would be to focus on getting the job done efficiently while trying to maximize the quality of the software.

The two objectives are not independent, although it is easy to miss their connection. In its simplest terms, efficiency would be the least amount of work to get the software into production. Quality then might seem to be able to range freely.

However, at one end of the spectrum, if the quality is really bad, we do know that it sucks in tonnes of effort to just patch it up enough to barely work. So, really awful quality obviously kills efficiency.

At the other end though, it most often seems like quality rests on an exponential scale. That is, getting the software to be 1% better requires something like 10x more work. At some point, perfection is likely asymptotic, so dumping mass amounts of effort into just getting trivial increases in quality also doesn’t seem to be particularly efficient either.

Are they separated in the middle?

That too is unlikely in that quality seems to be an indirect byproduct of organization. That is, if the team is really well-organized, then their work will probably produce a rather consistent quality. If they are disorganized, then at times the quality will be poor, which drags down the quality of the whole. Disorganization seems to breed inefficiency.

From practical experience, mostly I’ve seen that rampant inefficiency results in consistently poor quality. They go hand in hand. A development team not doing the right things at the right time, will not accidentally produce the right product. And they certainly won’t do it efficiently.

How do you increase both?

One of the key problems with our industry is the assumption that ‘code’ is the only vital ingredient in software. That, if there was only more ‘code’ that would fix everything. Code however is just a manifestation of the collected understanding of the problem to solve. It’s the final product, but certainly not the only place were quality and efficiency are noticeable.

Software has 5 distinct stages: it goes from analysis to design, then into coding and testing and finally it is deployed. All five stages must function correctly for the effort to be efficient. Thus, even if the coding was fabulous, the software might still be crap.

Little or hasty analysis usually results in unexpected surprises. These both eat into efficiency, but also force short-cuts which drag down quality. If you don’t know what you are supposed to build or what data it will hold then you’ll end up just flailing around in circles hoping to get it right.

Designing a system is mostly organization. A good design lays out strong lines between the pieces and insures that each line of code is located properly. That helps in construction and in testing. It’s easier to test well-written code, and its inordinately cheaper to test code if you properly know the precise impact of any of the changes.

On top of that you need to rely on the right technologies to satisfy any engineering constraints. If you pick the wrong technologies then no amount of effort will ever meet the specifications. That’s a key aspect of design.

If the operational requirements never made it into the analysis and design, the costs of supporting the system will be extraordinary, which will eventually eat away at the resources available for the other stages. A badly performing system will rapidly drag down its own development.

Realistically, a focus on efficiency, across all of the stages, is the best way to insure maximum quality. There are other factors, of course, but unless the software development process is smoothly humming along, they aren’t going to make a significant impact. You have to get the base software production flow correct first, then focus on refining it to produce high quality output. It’s not just about coding. Generating more bad code is never going to help.

Sunday, October 1, 2017

Rainy Days

When first we practice to code, we do, of course, worry most about branching between different instructions and repeating similar blocks of work over and over again.

In time, we move on to longer and more complex manipulations.

Once those start to build up, we work to cobble them together. Continually building up larger and larger arrangements, targeting bigger features within the software.

That’s nice and all, but since we’ve usually only focused on getting the code to run, it really is only reliable on ‘sunny’ days. Days when there are no clouds, no rain, no wind, etc. They are warm and pleasant. When everything just works.

Code, however, needs to withstand all of the elements; it runs in a cruel, cruel world where plenty of unexpected things occur at regular frequencies. Storms come often, without warning.

Rainy day coding is a rather different problem to solve. It means expecting the worst and planning for it. It also means building the code in a way that its rainy day behavior is both predictable and easily explainable.

For instance, any and all resources might be temporarily unavailable. And they might go down in any number of permutations. What happens then should be easily explainable, and it should match that explanation. The code should also ensure that during the outages no data is lost, no user uninformed, no questions unanswered. It needs to be trustworthy.

Any data coming from the outside might be wrong. Any events might occur in weird order. Any underlying technology might suddenly behave erratically. Any piece of hardware might fail. Anything at all could happen...

If that sounds like a lot of work, it is! It is at least an order of magnitude harder than sunny day coding. It involves a lot of digging to fully understand rainy days, or tornadoes, or hurricanes or even earthquakes. You can’t list out the correct instructions for a rainy day if you’ve never thought about or tried to understand them.

As software eats the world, it must live up to the cruelties that exist out there. When it doesn’t, it might be temporarily more convenient on the sunny days, but it can actually spawn off more work than it saves when it rains. The overall effect can be quite negative. We were gradually learning how to write robust systems, but as the underlying complexity spiraled out of control these skills have diminished. Too many sunny days and too little patience have driven us to rely on fragile systems while deliberately ignoring the consequences. If we want to get the most out of computers, we’ll have to change this...

Sunday, September 17, 2017

Decisions

We can model large endeavors as a series of decisions. Ultimately, their success relies on getting work completed, but the underlying effort cannot even be started until all of the preceding decisions are made. The work can be physical or it can be intellectual or it can even be creative.

If there are decisions that can be postponed, then we will adopt the convention that they refer to separate, but related pieces of work. They can occur serially with the later piece of work relying on some of the earlier decisions as well as the new ones. Some decisions are based on what is known up-front, while others can’t be made until an earlier dependent bit of work is completed.

For now, we’ll concentrate on a single piece of work that is dependent on a series of decisions. Later we can discussion parallel series and how they intertwine.

Decisions are rarely ever ‘right’ or ‘wrong’, so we need some other metric for them. In our case, we will use ‘quality’. We will take the decision relative to its current context and then talk about ‘better’ or ‘worse’ in terms of quality. A better decision will direct the work closer to the target of the endeavor, while a worse one will stray farther away. We’ll accept decisions as being a point in a continuous set and we can bound them between 0.0 and 100.0 for convenience. This allows for us to adequately map them back to the grayness of the real world.

We can take the quality of any given decision is the series as being relative to the decision before it. That is, even if there are 3 ‘worse’ decisions in a row, the 4th one can be ‘better’. It is tied to the others, but it is made in a sub-context and only has a limited range of outcomes.

We could model this in a demented way as a continuous tree whose leaves fall onto the final continuous range of quality for the endeavor itself. So, if the goal is to build a small piece of software, at the end it has a specific quality that is made up from the quality of its subparts, which are directly driven by the decisions made to get each of them constructed. Some subparts will more heavily weight the results, but all of it contributes to the quality. With software, there is also a timeliness dimension, which in many cases would out-weight the code itself. A very late project could be deemed a complete failure.

To keep things clean, we can take each decision itself to be about one and only one degree of variability. If for some underlying complex choice, there are many non-independent variables together, representing this as a set of decisions leaves room to understand that the one or more decision makers may not have realized the interdependence. Thus the collection of decisions together may not have been rational on the whole, even if all of the individual decisions were rational. In this sense, we need to define ‘rational’ as being full consideration of all things necessary or known relative to the current series of decisions. That is, one may make a rational decision in the middle of a rather irrational endeavor.

Any decision at any moment would be predicated both on an understanding of the past and the future. For the past, we accrue a great deal of knowledge about the world. All of it for a given subject would be its depth. An ‘outside’ oversimplification of it would be shallow. The overall understanding of the past would then be about the individual’s or group’s depth that they have in any knowledge necessary to make the decision. Less depth would rely on luck to get better quality. More depth would obviously decrease the amount of luck necessary.

Looking towards the future is a bit tricker. We can’t know the future, but we can know the current trajectory. That is, if uninterrupted, what happened in the past will continue. When interrupted, we assume that that is done so by a specific event. That event may have occurred in the past, so we have some indication that it is, say, a once-in-a-year event. Or once-in-a-decade. Obviously, if the decision makers have been around an area for a long time, then they will have experienced most of the lower frequency events, thus they can account for them. A person with only a year’s experience will get caught by surprise at a once-in-a-decade event. Most people will get caught by surprise at a once-in-a-century event. Getting caught by surprise is getting unlucky. In that sense, a decision that is low quality because of luck may indicate a lack of past or future understanding or both. Someone lucky may look like they possess more knowledge of both than they actually do.

If we have two parallel series of decisions, the best case is that they may be independent. Choices made on one side will have no effect on the work on the other side. But it is also possible that the decisions collide. One decision will interfere with another and thus cause some work to be done incorrectly or become unnecessary. These types of collisions can often be the by-product of disorganization. The decision makers are both sides are unaware of their overlap because the choice to pursue parallel effort was not deliberate.

This shows that some decisions are implicitly made by willingly or accidentally choosing to ignore specific aspects of the problem. So, if everyone focuses on a secondary issue instead of what is really important, that in itself is an implicit decision. If the choices are made by people without the prerequisite knowledge or experience, that too is an implicit decision.

Within this model then, even for a small endeavor, we can see that there are a huge number of implicit and explicit decisions and that in many cases it is likely that there are many more implicit ones made than explicit. If the people are inexperienced, then we expect the ratio to have a really high multiplier and that the quality of the outcome then relies more heavily on luck. If there is a great deal of experience, and the choices made are explicit, then the final quality is more reflective of the underlying abilities.

Now all decisions are made relative to their given context and we can categorize these across the range of being ‘strategic’ or ‘tactical’. Strategic decisions are either environmental or directional. That is, at the higher level someone has to set up the ‘game’ and point it in a direction. Then as we get nearer to the actual work, the choices become more tactical. They get grounded in the detail, become quite fine-grained and although they can have severe long-term implications, they are about getting things accomplished right now. Given any work, there are an endless number of tiny decisions that must be made by the worker, based on their current situation, the tactics and hopefully the strategic direction. So, with this, we get some notion of decisions having a scope and ultimately an impact. Some very poor choices by a low-level core programmer on a huge project, for instance, can have ramifications that literally last for decades, and that degrade any ability to make larger strategic decisions.

In that sense, for a large endeavor, the series of decisions made, both large and small, accumulate together to contribute to the final quality. For long running software projects, each recurring set of decisions for each release builds up not only a context, but also boundaries that intrinsically limit the best and worst possible outcomes. Thus a development project 10 years in the making is not going to radically shift direction since it is weighted down by all of its past decisions. Turning large endeavors slows as they build up more choices.

In terms of experience, it is important for everyone involved at the various levels to understand whether they have the knowledge and experience to make a particular decision and also whether they are the right person at the right level as well. An upper-level strategic choice to set some weird programming convention, for example, is probably not appropriate if the management does not understand the consequences. A lower-level choice to shove in some new technology that is not inline with the overall direction is also equally dubious. As the work progresses bad decisions at any level will reverberate throughout the endeavor, generally reducing quality but also the possible effectiveness of any upcoming decisions. In this way, one strong quality of a ‘highly effective’ team is that the right decisions get made at the right level by the right people.

The converse is also true. In a project that has derailed, a careful study of the accumulated decisions can lead one back through a series of bad choices. That can be traced way back, far enough, to find earlier decisions that should have taken better options.

It’s extraordinarily hard to choose to reverse a choice made years ago, but if it is understood that the other outcomes will never be positive enough, it can be easier to weigh all of the future options correctly. While this is understandable, in practice we rarely see people with the context, knowledge, tolerance and experience to safely make these types of radical decisions. More often, they just stick to the same gradually decaying trajectory or rely entirely on pure luck to reset the direction. Thus the status quo is preserved or ‘change’ is pursued without any real sense of whether it might be actually worse. We tend to ping-pong between these extremities.

This model then seems to help understand why complexity just grows and why it is so hard to tame. At some point, things can become so complex that the knowledge and experience needed to fix them is well beyond the capacity of any single human. If many of the prior decisions were increasingly poor than any sort of radical change will likely just make things worse. Unless one can objectively analyze both the complexity and the decisions leading to it, they cannot return to a limited set of decisions that need to be reversed properly, and so at some point, the amount of luck necessary will exceed that of winning a lottery ticket. In that sense, we have seen that arbitrary change and/or oversimplifications generally make things worse.

Once we accept that the decisions are the banks of the river that the work flows through, we can orient ourselves into trying to architect better outcomes. Given an endeavor proceeding really poorly, we might need to examine the faults and reset the processes, environment or decision makers appropriately to redirect the flow of work in a more productive direction. This doesn’t mean just focusing on the workers or just focusing on the management. All of these levels intertwine over time, so unwinding it enough to improve it is quite challenging. Still, it is likely that if we start with the little problems, and work them back upwards while tracking how the context grows, at some point we should be able to identify significant reversible poor decisions. If we revisit those, with enough knowledge and experience, we should be able to identify better choices. This gives us a means of taking feedback from the outcomes and reapplying it to the process so we can have some confidence that the effects of the change will be positive. That is, we really shouldn’t be relying on luck to make changes, but not doing so is far less than trivial and requires deeper knowledge.

Sunday, September 10, 2017

Some Rules

Big projects can be confusing and people rarely have little time or energy to think deeply about their full intertwined complexity. Not thinking enough is often the start of serious problems.

Programmers would prefer to concentrate on one area at a time, blindly following established rules for the other areas. Most of our existing rules for software are unfortunately detached from each other or are far too inflexible, so I created yet another set:

  1. The work needs to get used
  2. There is never enough time to do everything properly
  3. Never lie to any users
  4. The code needs to be as readable as possible
  5. The system needs to be fully organized and encapsulated
  6. Don’t be clever, be smart
  7. Avoid redundancy, it is a waste of time

The work needs to get used


At the end of the day, stuff needs to get done. No software project will survive unless it actually produces usable software. Software is expensive, someone is paying for it. It needs backers and they need constant reassurance that the project is progressing. Fail at that, nothing else matters.

There is never enough time to do everything properly


Time is the #1 enemy of software projects. It’s lack of time that forces shortcuts which build up and spiral the technical debt out of control. Thus, using the extremely limited time wisely is crucial to surviving. All obvious make-work is a waste of time, so each bit of work that gets done needs to have a well-understood use; it can’t just be “because”. Each bit of code needs to do what it needs to do. Each piece of documentation needs to be read. All analysis and design needs to flow into helping to create actual code. The tests need to have the potential to find bugs.

Still, even without enough time, some parts of the system require more intensive focus, particularly if they are deeply embedded. Corners can be cut in some places, but not everywhere. Time lost for non-essential issues is time that is not available for this type of fundamental work. Building on a bad foundation will fail.

Never lie to any users


Lying causes misinformation, misinformation causes confusion, confusion causes a waste of time and resources, lack of resources causes panic and short-cuts which results in a disaster. It all starts with deviating from the truth. All of the code, data, interfaces, discussions, documentation, etc. should be honest with all users (including other programmers and operations people); get this wrong and a sea of chaos erupts. It is that simple. If someone is going to rely on the work, they need it to be truthful.

The code needs to be as readable as possible


The code in most systems is the only documentation that is actually trustworthy. In a huge system, most knowledge is long gone from people's memory, the documentation has gone stale, so if the code isn't readable that knowledge is basically forgotten and it has become dangerous. If you depend on the unknown and blind luck, expect severe problems. If, however, you can actually read the code quickly, life is a whole lot easier.

The system needs to be fully organized and encapsulated


If the organization of the code, configuration, documentation or data is a mess, you can never find the right parts to read or change when you need to. So it all needs organization, and as the project grows, the organizational schemes need to grow with it and they need organization too, which means that one must keep stacking more and higher levels of organization on top. It is never ending and it is part of the ongoing work that cannot be ignored. In huge systems, there is a lot of stuff that needs organization.

Some of the intrinsic complexity can be mitigated by creating black boxes; encapsulating sub-parts of the system. If these boxes are clean and well-thought out, they can be used easily at whatever level is necessary and the underlying details can be temporarily ignored. That makes development faster and builds up ‘reusable’ pieces which leverages the previous work, which saves time and of course, not having time is the big problem. It takes plenty of thought and hard work to encapsulate, but it isn’t optional once the system gets large enough. Ignoring or procrastinating on this issue only makes the fundamental problems worse.

Don’t be clever, be smart


Languages, technologies and tools often allow for cute or clever hacks. That is fine for a quick proof of concept, but it is never industrial strength. Clever is the death of readability, organization and it is indirectly a subtle form of lying, so it causes lots of grief and eats lots of time. If something clever is completely and totally unavoidable then it must be well-documented and totally encapsulated. But it should be initially viewed as stupendously dangerous and wrong. Every other option should be tried first.

Avoid redundancy, it is a huge waste of time


It is easy to stop thinking and just type the same stuff in over and over again. Redundancy is convenient. But it is ultimately a massive waste of time. Most projects have far more time than they realize, but they waste it foolishly. Redundancy makes the system fragile and it reduces its quality. It generates more testing. It may sometimes seem like the easiest approach, but it isn’t. If you apply brute force to just pound out the ugliest code possible, while it is initially faster it is ultimately destructive.

Overall

There really is no finite, well-ordered set of one-size-fits-all rules for software development but this set, in this order, will likely bypass most of the common problems we keep seeing over the decades. Still, software projects cannot be run without having to think deeply about the interrelated issues and to constantly correct for the ever changing context. To keep a big project moving forward efficiently requires a lot of special skills and a large knowledge base of why projects fail. The only real existing way to gain these skills is to get mentored on a really successful project. At depth, many aspects of developing big systems is counter-intuitive to outsiders with oversimplifications. That is, with no strong, prior industry experience, most people will misunderstand where they need to focus their attention. If you are working hard on the wrong area of the project, the neglected areas are likely to fail.

Often times, given an unexpectedly crazy context, many of the rules have to be temporarily dropped, but they should never be forgotten and all attempts should be made to return the project to a well-controlled, proactive state. Just accepting the madness and assuming that it should be that way is the road to failure.