For many of us starting in this area, our concept of monitors consists of top, some apache, mysql and application log files and perhaps an external ping service that tells us when our web site is unavailable. Anything beyond that generally ran into the commercial product realm. We were scared off of monitoring by these old monolithic products that required huge licensing fees and armies of professional services people. Thankfully, times have changed.
And our application footprint has grown. No longer are we just deploying web servers and databases. Our application stack starts with our automated testing framework and runs through continuous integration and continuous deployment. Jenkins, Travis, Puppet/Chef, etc ... they're all critical. It also includes our deployment partners ... that army of SaaS applications we use to make our life easier. Any SaaS solution worth its salt has a status API available for tracking availability. Our monitoring needs are now wide and diverse.
My first exposure to the next generation of monitoring tools came with the awesome Etsy post "Measure anything, measure everything"
The concept of Measure Everything wasn't new to me. I'd been working on StackTach for OpenStack around the same time and understood the value of getting a visual representation of the internals of an application. Even from my old management days we used to say "you can't manage what you can't measure." I lived this with my Google Analytics experiences from running various web sites and my software development management interests were aiming towards Six Sigma techniques over the hand-wavey agile methods. Essentially, numbers are good. But this was giving us a way to apply those same measurement techniques to running software. It was a lens into the black box. Could the days of parsing log files be over?
The first generation of these new monitoring tools included Zenoss, Nagios, RRDtool, Cacti, Munin and Gaglia to name a few. They were built out of necessity and often have some really nasty warts that people just hate. This latest generation of tools have learned from their mistakes.
The Etsy tool chain started with statsd with graphite. This introduced to me the concept of using UDP packets for instrumenting the running applications ... which was, pretty brilliant. For those unfamiliar with statsd and graphite, here's the flow:
- your application wants to measure something, so it sends a UDP packet to the statsd service. UDP packets are lossy and unreliable but fast for large amounts of data. Most large video networks send via UDP packets.
- statsd is a node.js in-memory data aggregator (it accumulates received data and every so often sends it to graphite)
- graphite is a django app that archives received data and gives a funky web interface for presenting and querying the data.
There are a number of cool things happening here:
- adding statsd integration to an existing application is very easy. No special libraries needed and sockets are available in nearly all languages.
- since statsd uses UDP there is very little risk of the production application crashing if statsd fails. The packets just get lost.
- since statsd is in-memory, it can process a lot of data very quickly. But rather than take on the task of archiving and disk access, it simply forwards the results to something that can do it better.
- graphite has an easy REST interface which makes it easily accessible by technical product managers to create their own dashboards and status reports.
Side note: if your application is written in python and you want to experiment with this stuff without touching your existing code base, have a look at the Tach application. This monkey-patches your python application and sends the output to statsd or graphite directly. Pretty cool. Although it was originally written for use with OpenStack, it can work with anything.
But the real insight here is a set of atomic, well focused tools that could be put together to create a monitoring stack. The tool chest of the devops team just expanded.
As our experiences with statsd and graphite had grown within the company we also saw where the monitoring stack failed. A UDP-based approach won't work for billing or auditing. For these scenarios you need to have a reliable transport for events. In OpenStack we publish event notifications to AMQP queues for consumption by various other tools. These are important events, often with large payloads. When the StackTach application is unavailable these queues can grow very quickly, and we don't want to drop events. This is manageable for something like OpenStack Compute, but other applications like Storage produce an incredible amount of data across a wide range of servers. Using a notification-based system would be difficult. Instead we needed to look at syslog-based archiving and processing solutions. The new monitoring stack offers tools like LogStash and, in the OpenStack case, Slogging.
Then there is the post-processing. To add value to the raw events we often need to apply other functions to the data such as times series averaging. This can be tricky. We need to wait for all the collected data to arrive before we can start the post-processing. We may need to ensure proper ordering. Historically this would be done with cron jobs and batch processing, but the new monitoring stack includes tools like Riemann which can do this post-processing inline.
It seems evident that Nagios isn't going anywhere any time soon, but there are some other tools offering alternatives such as Shinkin and Sensu.
Recently our team has been working on bringing what we've learned with StackTach to the OpenStack-blessed monitoring solution called Ceilometer. Without standing back and looking at the larger monitoring community it would have been very easy to want to recreate an entire monitoring stack on our own. But now it's clear that we can focus on the minimal set of missing functionality and augment that from an already powerful set of tools. This is a very attractive proposition for one simple reason ... the project has an end in sight. There are lots of fun problems out there to tackle and knowing you don't have to reinvent the wheel is very compelling.
There is a cost though. The monitoring stack today consists of a variety of tools all written in different languages and each with different care and and feeding instructions. One could argue that the workload on operations will only increase by mixing and matching. My knee-jerk reaction is to agree, but I know that the greater win is to get familiar with all of these new tools. In production, these monitoring tools need monitoring as well. So we may have to monitor Java, Ruby, Python and C# VM's running bytecode from a potential variety of languages.
If this all seems too daunting, perhaps the hosted offerings are a better choice for you. For nearly every open source offering there are hosted offerings. Look at loggly, papertrail, pagerduty, librato, datadog, hostedgraphite, boundary, new relic, etc.
This brings me back to Monitorama. The Monitorama conference had a format that worked very well for me for the following reasons:
1. It was only two days long.
2. Day One focused on hearing about "the state of the art" from industry leaders.
3. Day Two was tactical with the tools and included a hackathon which let you understand where the real-world pain lived in each of these components.
4. It was small enough so you could actually talk to people and have meaningful conversations.
The Day One talks made it clear that "Alert Fatigue" (a term borrowed from the medical industry) is a big problem. Too many alerts hitting our inbox. Some are important, most are noise. There are people working on it, but it's perhaps the biggest source of angst for operations currently.
Side story: for the hackathon I started work on a tool that allowed members of the company to track external events that might affect production. Things like sales events, big holidays, new customer deployments or internal events such as new code deployments, hardware upgrades, etc. The idea was to have these events show up on the spikes in the dashboard graphs so we could say "That spike was due to Foo and that ravine was due to Blah." I made some good progress for the day and then one of the other attendees showed me his side project Anthricite, which does all this and more. The author was sitting in the room next to me. What are the odds?
For a while I was getting disillusioned with this space because I saw it dominated with commercial solutions or that the problem was so big it would be a lifetime of work to build as open source. But now I see there are viable open source components and there is enough of the stack available that we can focus on some of the smaller missing pieces. Also, there is a smart community out there facing the exact same problems and actively working on solutions. There is a light at the end of the tunnel.
I may not attend http://monitorama.eu/ ... but definitely the US one next year. But for now, I've got some products to learn.
(ps> if you made it this far, I wrote a little about monitoring from an OpenStack perspective here)



