Today is another rant day, or, to put it politely a clarification that needs to be made.
As you probably know by now, I’m an infra/Ops guy. So monitoring has always been our core interest and tooling.
There are many tools out there, some dating back to pre-cloud era, some brand new and cloud oriented, some focused on the application, some on the infrastructure. And with some tuning, you can always find the right one for you.
But beware of a fundamental misunderstanding, that is very common : monitoring is not alerting, and vice-versa.
Let me explain a bit. Monitoring is the action of gathering some information about the value of a probe. This probe can measure anything, from CPU load to an application return code. Monitoring will then store this data and give you the ability to graph/query/display/export that.
Alerting is one of the possible actions taken when a probe reaches a defined value. The alert can be an email sent to your Ops team when a certain CPU reaches 80%, or it could be a notification on your IPhone when your spouse get within 50m of your home.
Of course, most tools have both abilities, but that does not mean that you need to mix them and setup alerting for any probe that you have setup.
My regular use case is an IoT solution, cloud-based. We would manage the cloud infrastructure backing the IoT devices and application. In that case usually we would have a minimum of two alerting probes. These two probes would be the number of live connected devices, and the response time of the cloud infrastructure (based on an application scenario).
And that would be it for alerting, in a perfect world. Yes we would have many statistics and probes gathering information about the state of the cloud components (Web applications, databases, load balancers etc.). And these would make nice and pretty graphs, and provide data for analytics. But in the end, who cares if a CPU on one instance of the web app reaches 80%. As long as the response time is still acceptable and there are no marginal variation on the number of connected devices, everything is fine.
When one of the alerting probes goes Blink, then you would need to look into the other probes and statistics to figure out what is going on.
About the solution
There are so many tools available to alert and monitor, there cannot be one size fits all.
Some tools are focused on gathering data and alerting, but not really on the graphing/monitoring part (like Sensu, or some basic Nagios setups) and some are good at both (Nagios+Centreon, NewRelic). Some are mostly application oriented (Application Insight, NewRelic) some are focused on infrastructure, or even hardware (HPE SIM for example).
I have worked with many, and they all their strength and weaknesses. I would not use this blog to promote one or the other, but if you’re interested in discussing the subject, drop me a tweet or an email!
The key thing here is to keep your alerting to a minimum, so that your support team can work in a decluttered environment and be very reactive when an alert is triggered, rather than having a ton of fake alarms, false warnings and “this is red but it’s normal, don’t worry” 🙂
Note : the idea from this post goes to a colleague of mine, and the second screenshot from a tool another colleague wrote.
3 Replies to “Monitoring and alerting”
Nice article, but I do not totally agree with your analysis. What about capacity planing? Let say your cloud application find a huge success (Internet buzz effect). Your cpu is constantly climbing up to 80%. Not a concern about real time performance but how about tomorrow? Next week?
IMHO a good alert is based on a threshold multiplied by time.
I agree when you say nobody cares that the cpu is 80%. But how about cpu is +10% each day? Don’t you want an extra extra time to troubleshoot before downtime? 😉
You’re right. Actually, I did not include the ITIL processes outside of daily incident management.
Regarding capacity planning, the analysis would be of course be based on monitoring probes and their trends.
Your comment give me food for thought, and I’ll probably extend that into a full article later on.
(BTW, sorry for the delay in approving, it seems the notifications were misplaced 🙂 )