Tuesday, July 6, 2010

Syntactic Noise

When programming, we usually know in some simplistic terms the behavior we expect from the computer. We can easily express this notion in some vague non-standard pseudo-code-like notation, such as:

for all bonds in the index
    calculate the yield
    add the weighted sum to the total
end
divide total by the number of bonds

This, in a sense is the essence of the instructions that we want to perform, in a manner that is somewhat ambiguous to the computer, but clear enough the we can understand it.

By the time we’ve coded this in a format precise enough for the computer to correctly understand, the meaning has been obscured by a lot of strange symbols, odd terms and inelegant syntax:

public double calcIndex()
{
   double weightedSum = 0.0;
   int bondCount = 0;
   for (int i=0;i
       status = yieldCalc(bondInfo, Calculate.YIELD,
           true, bondFacts);

       if (status != true) {
           throw new 
               YieldCalculationException(getCalcError());
       }
       weightedSum += calcWeight(bondFacts, Weight.NORMAL);
       bondCount++;
   }
   return weightedSum / bondCount;
}

Which is a considerably longer and more complex representation of the steps we want the computer to perform.

In the above contrived example, with the exception of the error handling and the flags for the calculation options, the code retains a close similarity to the pseudo code. Still, even as close as it is, the added characters for the blocks, the function calls and the operators makes the original intent of the code somewhat obscured.

Programmers learn to see through this additional syntactic noise in their own code, but it becomes a significant factor in making their work less readable to others.

Now my above example is pretty light in this regard. Most computer languages allow users to express and compact their code using an unlimited amount of noise. The ultimate example is the obfuscated C contest, were entries often make extreme use of the language’s preprocessor cpp.

Used well, cpp can allow programmers to encapsulate syntactic noise into a simple macro that gets expanded before the code is compiled. Used poorly, macros can contribute to strange bugs and mis-leading error messages.

For example, a decade (or so) ago, I used something like:

#define BEGIN(function) do {\
    TRACE(“Entered”, __LINE__, __FILE__, #function);\
    char last[] = #function;\
    } while(0);

#define RETURN(value) do {\
    TRACE(“Exited”, last);\
    return value;\
    } while(0);

in C for each function call, so that the code could be easily profiled (the original was quite a bit more complex, I’ve forgotten most of what was there) and each function call used the same consistent methods for handling its tracing.

The cost was the requirement for all code to look like:

int functionCall(int x, in y) {
   BEGIN(functionCall);
       ...
       ...
   RETURN(output);
}

But for the minor extra work and discipline, the result was the ability to easily and quietly encapsulate a lot of really powerful tracing ‘scaffolding’ into the code as needed. More importantly the macros hide a lot of the ugly syntactic noise such as language constants like __FILE__.

In a language like Java, the programmer doesn’t have a powerful tool like a preprocessor, but they still have lots of options. Developers could use an external preprocessor like m4, but they would likely interfere with the cushy IDEs that most programmers have become addicted to. Still, there are always ways to increase or decrease the noise in the code:

return new String[] { new Data().modifyOnce().andAgain().toString(),
“value”, ( x == 2 ? new NameString() : null )};

While the above is a completely contrived example -- hopefully people aren’t doing horrible things like this -- it shows that it is fairly easy to “build up” very noisy statements and blocks. However:

String modified = twiceModified();
String name = nameString(x);

return stringArray(twiceModified, “value”, name);

accomplishes the same result, but with the addition of having to create three methods to encapsulate some of the logic (you can guess what they look like). The noise in the initial example is completely unnecessary. What we want to capture are the three values that are returned, how they are generated is a problem that can be encapsulated elsewhere in the code.

Along with excess syntactic noise there are a couple of other related issues.

First, while try/catch blocks can be useful, they actually form a secondary flow of logic in and on top of the base code, thus they generate confusion. Recently a friend was convinced that the JVM was acting up (objects were apparently only getting ‘partially initialized’), but it was actually just a combination of sloppy error handling and threading problems.

Naming too, is a big factor in making the code readable. Data are usually nouns, and methods are usually verbs. The shortest possible “full name” at the given level of abstraction should match either the business domain, or the technical one. If you don’t immediately know what it is called, it’s time for some research, not just a random guess.

And finally, method calls are essentially free. OK, they are not, but unless you’re coding against some ultra-super extreme specs, they’re not going to make any significant difference, so use them where ever, and whenever possible. Big fat dense code is extremely hard to extend, which increases the likelihood that the next coder is going to trash your work. You don’t want that, so to preserve you labors, make it easy to extend the code.

Syntactic noise is one of the easiest problems to remove from a code base, but it seems to also be one of the more common ones. Programmers get used to ugly syntax and just assume that a) it is impossible to remove and b) everyone else will tolerate it. However, with a bit of thinking and a small amount of refactoring it can be quickly reduced, and often even nearly eliminated.

Good, readable code blocks have always been able to minimize the noise, thus allowing the intent of the code to shine through. As for other programmers, a few may tolerate or even contribute to the mess, but most will seize the opportunity to send it to The Daily WTF and then re-write it from scratch. Great programmers can do more than just get their code to work, they can also build a foundation to allow their efforts to be extended.

11 comments:

  1. Leonard BrüningsJuly 7, 2010 at 8:42 PM

    Ok we don't have preprocessors in Java, but I'm sure you've heard of AOP. You can do most of the Tracing stuff with it.

    And I don't fully follow your argument that reducing syntactic noise with stuff like preprocessor directive makes it easier for another developer to read your code, because he needs to know what happens in them.

    P.S. the for loop in calcIndex doesn't make sense, maybe code formatting screwed up?
    And you can write: "if (status != true)" as "if (!status)"

    ReplyDelete
  2. Very noisy .....
    Thank you for the post..

    ReplyDelete
  3. Hi Leonard,

    Thanks for the comments. Aspects are great, although I've never had a chance to use them in a project, but they are more or less what I was doing with the defines (although it was manual). There are often horizontal redundancies which can be eliminated (as opposed to vertical, like a hierarchy).

    I haven't played with AOP because, although I think it would be useful, there hasn't been a strong enough reason yet to add it into the technologies that I am using. If it were built into a language, I'd be very happy.

    When you're reading someone else's code, the first thing you are interested in is what is happening at the 'higher' level. Basically the pseudo code. The details, while important, can be nicely encapsulated away from you until you are forced to deal with them. That is, if the original programmer solved "the problem", then you don't have to resolved it, until your forced to, and only when extending the code. In my example with the cpp defines, none of the other programmers need to know or care what's in the BEGIN define. They can extend the code perfectly well, while only focusing on the bits they do need to understand (the main algorithm). That's the strength of encapsulation. If it's solved, and it doesn't need to be extended, then you can ignore it for the time being. It's just less to worry about.

    You are right about the loop, I didn't notice (but it explains some last minute formatting problems I was having). The less-than symbol wasn't escaped properly and got eaten. Blogger is great for text, but I kind of wish I was working with wordpress sometimes, it handles the code and math way better. I'll see if I can't fix it later, thanks for letting me know :-)

    Paul.

    ReplyDelete
  4. Hi Drupal Toronto,

    Thanks for the comment :-) I saw your web site. Did the recent recession have a big effect? Are you seeing an increased demand for iPhone apps? Just curious :-)


    Paul.

    ReplyDelete
  5. You may be interested in Concept Programming:

    http://xlr.sourceforge.net/concept/top.html

    I believe it is still a work in progress, both on the theory side (concept programming) and on the implementation side (XLR language).

    One issue you mentionned and that the author doesn't seem to address is error handling.

    People are inclined to consider it as "noise", and obviously Exceptions were designed and invented to reduce this kind of noise.

    My opinion is different: error handling is an important part of the application logic that one shouldn't try to "hide under carpet".

    In my daily job, it is important to know exactly what the program does when a network connection fails or when it couldn't parse something. I program embedded devices, so I generally can't afford that my programs misbehave completely even when something is dead wrong.

    I also think it is an opportunity to make more clever programs. I believe that writing the algorithm in pseudocode, where everything is ideally right, is generally the easiest part; handling the error conditions correctly so that the program helps the user when something goes wrong, is so hard that one may wonder if we shouldn't start from that.

    ReplyDelete
  6. Hi Astrobe,

    I loved the introduction to Concept Programming, it's right in sync with my views. The short examples of the XL language however, seemed to remind me of basic.

    You're right, sometimes we want explicit error handling, sometimes we don't. Sometimes we want a compressed syntax, sometimes we want something more like COBOL.

    What I think would be cool is for a language to exist that allows programmers to write tailored programming languages for other programmers. A sort of meta-language that is used to create DSLs. So if you're writing something that needs heavy error handling, the syntax and semantics make it easy to see and verify this. If it's just a basic app with redundant GUI code, then a lot more is buried implicitly in the DSL.

    Of course, to work it would have to be powerful enough to allow the meta-programmer to create most other existing programming languages. Also, it would allow people to switch between meta-implementations. That is, the program could be a mix of C, Java, SQL and awk (and a small amount of binding noise).

    I've seen a few similar ideas (like COLA and LINQ) but nothing like a meta-language.

    What I always figured would work as well is for everybody to implement the logic within their own intuitive formal system, but somehow the computer could translate that for different people. We see this with some style issues. There are lots of programs to 'pretty-print' the code into a different formats. In theory, the repository could be neutral, and the code reformatted as it was extracted. Doing that at a higher level with naming, structure and normalization would be pretty revolutionary. Code isn't ambiguous like natural language, so it wouldn't have the same translation issues.

    Paul.

    ReplyDelete
  7. The short examples of the XL language however, seemed to remind me of basic.

    The author acknowledges a strong influence by Ada.

    What I always figured would work as well is for everybody to implement the logic within their own intuitive formal system, but somehow the computer could translate that for different people

    A similar solution has been proposed recently:

    http://arxiv.org/abs/1005.1213

    I don't know what to think about this idea. On one hand, it is strange to compute different views of the same program when programming is slowly becoming a social activity ( pair programming, public CVS,...). I think that multiple views might incur communication between programmers.
    On the other hand, one-size-fits-all begin to fail, generally, for big values of "all". It is quite normal then to have custom views.

    My feeling on DSLs is quite similar: on one hand, you may customize the language to fit exactly your needs; on th other hand, it might make it harder for others to maintain your code because you ended up creating your own language.

    I think that we should learn to make trade-offs: a language should be flexible enough in order to allow a clear expression in most cases, but not enough to let someone turn it into a different language.

    ReplyDelete
  8. Why not this code instead ? :
    -Error handling is in getWeight method.
    -Using of for each
    -Avoid unecessary variable to get list length

    double lSum = 0.0;

    for (Bond lBond : lBondList) {
    lSum += bond.getWeight();
    }

    return lSum / lBondList.size();

    ReplyDelete
  9. Hi Foudres,

    Simple answer: because you presented code that is way better :-) I wanted something that was noisy so I deliberately picked messier ways of coding. I picked a style that was a few decades older, giving it that 1980's K&R C feel (overlayed on Java code). I didn't want to get lost in a complicated example, so I needed to amp up the noise, but in a way that you might see it in practice.

    I did try to capture the fact that in most bond calculation programs, the yield calculation is a separate specialized library call, with lots of strange options. Basically a complex engine used in many places in a big system. Similarly, there is a many-to-one relationship between the yield calculations (facts) and the bond info. Because yield calculators are so hard to write, they're often abstracted to run over a range of dates, not just a single one.

    I prefer to bury error handling deep, so it isn't visible at most layers, but sometimes with a library call you have to bind one error paradigm to another one. Wrapping it in a little method would be a far more correct than what I did.

    Interestingly enough, my example didn't even use i. In real practice, I'd follow what you did with the list size, unless the algorithm specifically included some way to ignore some individual bonds. Bonds aren't always prices and/or they can drift into junk status, so just because one is in the index, doesn't mean that it will be used.

    Overall, I think this is a good example of how the messiness of the domain, and an ugly coding style can combine to take what is basically a simple loop, and make it way harder. I didn't really construct the example like this on purpose, but in retrospect it matches well to the types of problems we see constantly in industrial strength code. If it wasn't for the details, it might actually be elegant :-)

    Paul.

    ReplyDelete
  10. Hi Astrobe,

    "On one hand, it is strange to compute different views of the same program when programming is slowly becoming a social activity ( pair programming, public CVS,...)."

    I've been flip flopping on this for the last few decades :-)

    Somedays I see no reason why I can't take a team of twenty well trained programmers and get them to output a big, neat, consistent, elegant solution.

    Other days I see it this way:

    http://theprogrammersparadox.blogspot.com/2010/05/personality-effect.html

    Slowly over the years, I am leaning more and more to the side that programmer's can't work together, that the best works have always come from very small teams that were luckily in sync.

    If that really is the case, what we need to do for a massive project is to leverage everybody's work as much as possible. Since a computer can't tell the difference between code, style, naming, etc. are all just inherent human issues. We should be able to unwind it, normalize it, and then map it back to some personalized naming/style conventions. Code isn't ambiguous, so it should be symbolically manipulatable. If I can work on a view that I am comfortable with, and everyone else can too, then we can share more work, and thus leverage it better.

    Funny enough, we use the tool that could make this possible all day, but we rarely take full advantage of its capabilities.

    Paul.

    ReplyDelete
  11. I would write the final fragment as simply:

    return stringArray(twiceModified(), “value”, nameString(x));

    It reduces the cognitive load of two temporaries at the cost of two sets of parens.

    ReplyDelete

Thanks for the Feedback!