Where To Start With Infrastructure Monitoring

Recently I spent time revisiting our monitoring system. It needed a little bit of TLC and some of the staff wasn’t clear on exactly how it works and does its magic. As a follow-up, I thought it might be useful to write a little about monitoring. I mean, so what’s the point anyway?

Monitoring has evolved over the years. Especially with cloud computing and more resilient infrastructures. The tools have also progressed and I think it’s pretty clear that anyone deploying a serious monitoring system has long since abandoned the old-days of MRTG (http://oss.oetiker.ch/mrtg/), and mon (https://mon.wiki.kernel.org/index.php/Main_Page).  Even the infamous and all power Nagios is falling by the wayside. Finally, on the other end of the spectrum is software like SMARTS (http://www.emc.com/it-management/smarts/index.htm) formerly System Management Arts.

A good monitoring tool starts with:

  • Trends collected data (collection history)
  • Applies thresholds to data
  • Sends notifications based
  • Displays information in a meaningful way

That’s really the nuts and bolts of it. After that, things get much more in-depth. For example, how the data is collected and what escalation rules can be applied when sending notifications, etc. In addition, what about correlating the data and setting dependencies? The feature list goes on and on.

Once the data is collected and made useful (graphs, excel, whatever) it opens up the doors to things outside of monitoring such as planning, troubleshooting, faster SLAs, etc.

So if you’re planning on doing a Performance Monitoring Project, think about what you want and a little bit about how you might get there. What makes a tool do performance and monitoring in one package? Explore what others have and how they have leveraged it to improve their SLAs, planning, troubleshooting etc. Finally, it would also be worth considering what software has been used in conjunction with Monitoring Software to leverage it even further.

I noticed our system works well and is now a mature deployment. Our challenges now revolve around making sure people really know how to leverage the data and continuously document and improve the system.