Checking for Correctness

“There are many angles in which you can fall, but only one angle in which you can stand straight.” – G.K. Chesterton

Chesterton wasn’t talking about computing environments, but I think as technologists we have to think in a cross discipline way. I like Chesterton’s statement because it strikes at the heart of the futility of enumerating failure modes, no matter what type of failure they are. There is a lesson in this for building monitoring systems, and I’ve learned this lesson the hard way.

About 15 years ago I was a brash young system admin. I shared an on-call rotation with four other sys admin and we ran UNIX systems for a large logistics and distribution company. Warehouses ran all night long while the four of us preferred to be home sleeping. The problem was that whoever was on-call rarely got to sleep through the night. I was tired, a little sleep deprived, and I finally made a rule for myself that 4 hours of sleep was the minimum that I was going to get before making it into the office. We had to change the way we were doing things.

We started going deeper in our corrective actions. One of things I started to expand was the monitoring checks. Every failure meant that we deployed a new check that would find that failure so that we could correct it before an outage occurred. We started looking for error messages in logs, we checked load averages, we checked network connectivity. We were running more and more checks and every failure meant a new check to chase that new failure. Even as the number of checks ballooned we still got most of our issues reported by users and not by the monitoring system.

The insanity had to stop. We were putting so much load on the monitoring system that it was looking like we would need another one. Then the most obvious and clear realization hit me. Our monitoring philosophy was fundamentally wrong. We were chasing failures and we weren’t sure what the normal non-failure mode was. We needed to make a change and focus on correctness.

A great example of monitoring for correctness using a webpage is to check for it’s return code. HTTP 404 is a missing page, 500 is an internal server error. I could list for days the possible return codes for the server and try and handle each error. The problem is I’ll never know if I captured every error code, and all error codes that might come with the next version of the software. What I do know is that HTTP 200 is a “Correct” return code. If I monitor for a 200 response and treat all other codes as errors then I’ll capture all the possible failure codes.

A decade later in my career I encountered this exact same type of problem. Working with storage systems and we needed to find and replace failing disks. We discovered that our monitoring wasn’t flagging all the failed disks. As soon as I saw the code I knew what the problem was. Here is a statement similar to what was in the code:


if [ $diskState == "Degraded" ]; then
echo "Disk Failed"
fi

The statement only captured degraded disks. Those were the most common failure, but it failed to flag disks that were in other failed states like missing, or offline. We quickly rewrote the check to be:


if [ $diskState != "Healthy" ]; then
echo "Disk Failed"
fi

You can see that “not healthy” could capture a far broader set of failures than simply checking for degraded behavior.

When we search for things that a broken we’ll only find the forms of brokenness we already know about. When we start searching for things that aren’t correct we’ll start seeing new forms of broken, and new forms of correct. Our understanding of what our systems are doing will evolve and we’ll be better able to evolve the system because we have to know it better. I’ve discovered all kinds of faults searching for things that weren’t correct. Some of them were benign, and others were harbingers of future cataclysmic failures. Either way, I’ll never again look for failures, only for things that aren’t correct.