When implementing a monitoring system, or, being more precise, visualization part for the monitoring system, it’s very important to keep in mind several concepts of a human mind.
Same image means different things in different contexts. Check that image out:
Have you noticed that B and 13 are pretty much the same image? But when reading top-to-bottom and left-to-right, its quite easy for us to interpret them differently, right? You have no doubt that you're reading A-B-C sequence when reading left-to-right and 12-13-14 sequence when looking up-to-bottom.
Same goes for patterns. Consider a graph (1):
Here we see server memory, raising pretty fast, and noticable drops. And another graph (2):
Here we see data points that are staying in pretty much same range.
Without a good context, it’s hard to tell which one may speak of abnormalitites. In case first one is JVM memory, you may be certain that things are going well, and Garbage Collection cleans up unnecessary resources. On the other hand, if it’s website performance times, that climb up and then suddenly drop, you may want to check your application code.
Or, for example that graph:
Here we see some reasonable distribution, but also some spikes.
If it’s your backend server processor times, it may be fine (it gets something new/heavy to do every once in a while, other than that it remains in normal funcitonality mode). Same pattern for web server may be considered critical, and require investigation, for example, on the load balancer side.
Given different context, we will interpret some behavior as normal, and some other behavior as abnormal. For processor times. Also, different number ranges for different number classes gields different conclusions.
For example, when speaking of TCP packet loss/retransmits, it’s important to see the number. It’s definitely not a binary metric (happens all the time), but often, especially if the numbers are aggregare (10, 100, 1000 machines), it’s hard to estimate wether current retransmits frequency is large or small.
Here, several things help:
When you aggregate exceptions from your application servers, after certain number of occurences you have less interest to the precise number. Also, different exceptions, as they influence different part of your application, may be reacted upon very differently.
If your cache clusters becomes unreachable, you get a rather large amount of cache-miss exceptions. Website gets slower, but is still functional. On the other hand, when there’s a bug in application logic that doesn’t allow user to log in, see his dashboard or app can’t connect to the database server at all, you may not see as many of these exceptions in the database. Nevertheless, they are obviously more critical and important.
For that cause, it’s important to have a real-time ability to prioritize certain exception types, and escalate issues immediately and to have sane escalation defaults.
For example, new exception occurs:
Here under “escalate” everyone should have his own logic: ticket in your favorite system -> email notification -> sms -> call -> team call sounds generic enough, but YMMV.
the accurate intuitions of experts are better explained by the effects of prolonged practice than by heuristics.
skill and heuristics are alternative sources of intuitive judgments and choices.
These are quotes from Daniel Kahneman book, "Thinking, Fast and Slow"
Heuristic is a mental shortcut to make congnitive load of making a decision.
User of your monitoring solution may be an experienced devops, or just a beginner. Of course, for an experienced person it’s often enough to see rough sketches of graph, generated by graphite or RRDTool. For less experienced person, you may want to provide visualizations or facts that make decision making easier.
One of such heuristics is horizon graph. It helps you to increase data density, while preserving resoluiton. Without prior knowledge about horizon graphs they may look quite confusing:
Same graph originally looked like that:
But learning how they work:
Changes your understanding. Now you see that higher-dense blue indicates a number, that's lower that lower-dense blue, and higher-dense red represents a negative number, that's absolute value is lower than lower-dense red area. Now, looking at a graph you can visually compare an amount of different colors on graph, and understand where your values lay.
Seeing a pattern may be a difficult task. Marking abnormalities, or values that are particularly interesting for observer with different colors, helps a lot, too. It's sometimes called Theshold Encoding:
Here, temperature above 55°F is marked with red color. It's very obvious for an observer, which values are above the threshold and which are below.
Yet another thing that helps a lot is a grid. Consider the following graph:
The line is clear, points are visible. But it's still hard to guess an exact value, right? Everything changes as soon as we bring the grid in:
We can now see where values cross X axis, and it becomes more obvious which quadrants values lay in.
If you find these things useful, in next post I'm going to cover different principles, such as:
Published on Jan 6 2013
If you like my content, you might want to follow me on Twitter to subscribe for the future updates!