We're close to releasing a big new feature for Hosted Graphite - the ability to define alerts that can notify you when your metric data indicates something might be wrong with your infrastructure. You can choose to be notified via email, webhooks, PagerDuty, Slack, and now, HipChat. The alerting feature is in beta right now but many teams are already using it. If you'd like to get early access to try it out, get in touch with our support team and we'll flip the switch for you.
When our friends at Atlassian asked if we'd be interested in building an add-on for HipChat, we weren't immediately sold on the idea. However, once we saw that we could embed parts of the Hosted Graphite experience inside HipChat, we paid closer attention!
We thought the new alerting feature would be a great place to start. Most of the notification methods are relatively straightforward - when something goes wrong, we notify you. When it's fixed, we notify you.
Nothing ground-breaking there, but with the HipChat Connect API we were able to offer a much richer experience by embedding parts of the Hosted Graphite product experience right into the HipChat interface. This is pretty powerful, and something that other chat tools don't offer.
Notifications
First, the basics. When an alert goes off, we post a notification to your HipChat room:
This is a good start - there's a thumbnail of a graph, a link to the full size graph, the metric name related to the alert, and the conditions that caused it to fire. This "card" view lets us pack a lot of information into a small space, making the notification as useful as possible and giving the signal-to-noise ratio a welcome boost.
Keeping up with the chaos
One of the challenges of ChatOps is keeping everyone on the same page during a chaotic incident, without everyone having to read every single thing that's said and done in the chat room. We found that the HipChat "glance" feature provides an excellent way to communicate real-time high level status information to everyone in the room by embedding a small widget into the right-hand panel that sits next to the conversation. The picture on the right shows what it looks like when everything is healthy.
Unhealthy looks like this:
Context switching
Context switches are expensive - flipping between multiple tools always adds overhead, so the more you can do in a single tool, the less time you'll waste. Seeing the overall state of the infrastructure at a glance is useful, but we wanted more. If there's something wrong, you need to know which things are wrong in order to be able to act. If you click the "glance" inside HipChat, the side bar changes to a filterable list of all your alerts, showing which ones are unhealthy and what the alert conditions are.
With other tools you'd have to switch to another browser tab and navigate the UI to get a high level view of the state of your infrastructure, and that's assuming you're already logged in. Having the information right next to the relevant discussion is powerful, and it's available for your entire team, which adds up to a lot of saved time.
Acting quickly
Now that you know what's alerting, the next step is usually to check one of your dashboards to get the full picture, look for correlations, see what the magnitude of the problem is, etc. For that, you can jump right from HipChat to a Hosted Graphite dashboard. For teams with a large number of dashboards, there's a quick filtering box.
These dashboard links include a Hosted Graphite access key, so everyone in the room gets one-click read-only access as quickly as possible to help them diagnose the problem. They'll need to login to make any changes, of course.
Summary
Using our HipChat add-on is the richest way to take advantage of our new alerting feature. It keeps your team informed, improves the signal-to-noise ratio in your chat rooms, reduces context switching, and provides for lower time-to-resolution, which directly impacts your customers.
Want to try it out? You can install the Hosted Graphite for HipChat integration in the Atlassian Marketplace.