Tuesday, September 20, 2022

defensive coding

 

One of the things happening to people in testing positions is that every now and then we get to say "I told you so", usually around a bug report that was filed and closed as a "won't fix\not a bug\not important" and came back to bite us in the rear. While there's always the basic joy of being right (and more importantly, of other people being wrong), over the years I've learned to see those cases as a professional failures instead of sources of joy. After all, I saw the problem in advance, knew that it was a problem and maybe I could have done something differently to actually get it fixed. Maybe I could have presented the problem differently, talked to other people who could advocate it better for me, collected more evidence or perhaps it was only a matter of being more persistent in asking it to be fixed. In other cases, there was nothing I could do at the time since the reason for not addressing it is rooted in the organizational culture, that I now can start pushing towards. Saying "I told you so" is not the professional thing to do. 

Last week I had just such a case - Something didn't work for a customer, and upon further inspection - had never worked since deployment. Before we got to difficult debugging, we went over the short checklist of problems. Something quite equivalent of your ISP tech support asking you to reboot your router when you call. In our case, this list consisted on one thing - checking the configuration file on the server. With two relevant entries, it was a rather short glance - the authentication token looked fine and the destination URL pointed to the correct base path. So far, so good. 
Then, by sheer luck, something stirred my memory and I've noticed that the URL has been typed with schema+FQDN, you know - the way URLs are usually formatted. I recalled that when I've worked on that feature there was something odd regarding to having the schema provided in the file. A short trip to our bug tracking system, and indeed my memory was correct - if we provide the schema (for those less fluent in this specific terminology, that's the https:// part) , it won't work as the client consuming this configuration will add the schema themselves, and https://https://any.domain will fail. To make it that much more fun, there won't be any way to understand that from the logs. The ticket was logged in last December (about 9 months ago!) and in the discussion around it there were some acceptable reasons to not fixing it, a case could have been made for a fix anyway, but back then it would have been a harder battle to win. It's not that any fix would have been difficult -  the team configuring the server could add a regular expression validation to their tool, the server could reject the config or remove the schema, the client could do the same and log a meaningful error and all of us could be monitoring for this feature once it was deployed We could even change the name of the parameter so that instead of "...URL" it would be "...DOMAIN" and reduce the chance of errors. For almost all steps that we could have taken but did not there's a common reason: Optimistic coding. 
Optimistic coding is a state of mind where we assume that everything is going to be fine - the API is to be used only internally, so everyone will know what they should be doing. And if someone makes a mistake? Well,  it's their problem to fix.  
What we should be doing (and I intend to use this incident as leverage to push towards such behavior) is to create our software with a slightly more paranoid approach. Software is created to be used and operated by human beings, and human beings make mistakes. We need to assume that people operating, configuring and debugging the system will act very stupidly, not because they are stupid but because they are doing something else, under time-pressure and with a lot of distractions around. Most likely, at least some of those people will be our future selves. If we keep that in mind we can adopt a "Nothing gets passed me" approach - any mistake that can be detected in a given phase should be dealt with at this place - fix it if possible, return (or report) an error if fix is not possible.  and almost never let a problem pass from one component to the next.

No comments:

Post a Comment