Status change alert

Jun 21, 2012 at 6:45 PM

An alert when a status changes from bad to good.  Right now, I have email notifications set to only send failures (which would be common).  A service starts failing, I get emails.  When it starts working again, I'd like a single email that "hey, I'm back up", and then quiet again.

Jun 22, 2012 at 7:30 AM
Edited Jun 22, 2012 at 7:31 AM

I understand a constant stream of alerts could be a bad thing - if something starts failing, continually sending failure messages can get annoying and is counter-productive. Conversely, if something starts working again you don't want a persistent stream of alerts telling you everything is ok!

So - my thoughts are this. Presently there is a flag, found on some but not all healthchecks called "PublishOnlyIfFailure" - its a boolean and the idea was to help shape when a check issued alerts. I propose to make an enum, "AlertMode" that will allow the Wolfpack infrastructure to take care of alert shaping for each check, all the check would have do is tell Wolfpack what mode it is configured for - eg: "OnlyFailures", "OnlySuccess".

I would also introduce some new modes to support this...

  • "Default" - sends an alert everytime the check runs regardless of result
  • "FailureOnly" - sends an alert for every result failure
  • "SuccessOnly" - sends an alert for every result success
  • "StateChange" - only raise an alert when a check flips its result state, eg: failed -> success, success -> failure.
  • "StateChangeFailureNag" - as per "StateChange" but it will continue to send failure alerts if it remain in that state.

Final thought is to make this infrastructure plugin based/customisable, so if you had some whacky alert shaping logic you could roll your own and drop it in. My initial thoughts are that the existing Result Publisher Filters feature could be used as a base for this.

Ok, so I'll open this one up - any thoughts on how this should work?

Jun 22, 2012 at 7:48 PM

This is something I have been wanting but have not had time to relay give attention to. I like all the states you have listed and the ability to do a custom plugin when you got something weird is perfect. 

It would also be nice to have a delay state meaning that once the state changes it would be nice to maybe change the check to a different schedule. That way you still get some feedback over time but you don't get crushed by some check that is running every minute or worse. Then it would go back to normal schedule once the check was good again.  

I like this approach because you will know if say a site starts to waver and then bang your server drops. If you just have it only give you alerts on state change you may not know that its not just down but its on fire.  

Jun 25, 2012 at 1:01 PM

Ok, I like the "change schedule" thing too - like you say once something fails there is no need for a torrent of failures to come storming in from a high frequency failing check.

So - my approach is to modify the Agent class - currently it acts as a message hub - health check messages are received by the Agent, then it augments the data (with agent metadata) then forwards the messages onto the publishers.

My idea is to formalise the role and make a proper "Message Hub" component that will track the alert history of each component and shape the alerts based on each checks "mode".

Work has started - should have something up and running soon.

Jun 25, 2012 at 11:07 PM

Very cool, looking forward to using the new feature.

Jun 27, 2012 at 11:26 PM

ok, all done and code checked into repo. I have modified one check "WmiProcessRunningCheck" so far, its config has a new property "NotificationMode", values are,

SuccessOnly
FailureOnly
StateChange
StateChangeNagFail


Have a play - the StateChangeNagFail one also supports slowing down of failure alerts as more alerts are generated. I'll release it soon once it's had some decent testing.

Cheers,

James

Jun 27, 2012 at 11:38 PM

...and if you want to trace how it works then stick a breakpoint in the new Core\NotificationHub component - this will take you to the action!

Jun 28, 2012 at 9:34 PM

James, I pulled the code and compiled. Now I am looking at the Filters.Notifications. Also looked at the class and Config for the wmiProcessRunningCheck. I see that there is a new config setting and that the value would be one of the 4 above. I don't really have the time now to work out the details of setting this up from the code. Can you give me a specific config that shows how this could be used to produce a check that would have an adjusted interval once it has entered the failed state.

Thanks, John

Jun 28, 2012 at 9:53 PM
Edited Jul 9, 2012 at 2:43 PM

ok, so the filter that fits this best is "StateChangeNagFail" - its characteristics are....

  • It will sent an alert if it flips state (fail -> success, success -> fail)
  • If it remains in a failed state the filter will start to retard the frequency of the alerts based on the increasing number of failures generated....

Core\Filters\Notification\StateChangeNagFailNotificationFilter.cs...

Look at the constructor...

        public StateChangeNagFailNotificationFilter()
            : this(new KeyValuePair<intint>(0, 1),
            new KeyValuePair<intint>(3, 3),
            new KeyValuePair<intint>(5, 10),
            new KeyValuePair<intint>(10, 60))

The list of KeyValuePairs represents the attempt band/minutes separation....so the above is...

0 - 2 attempts => alert every 1 minute

3 - 4 attempts -> alert every 3 minutes

5 - 9 attempts => ...every 10 mins

10+ attempts => ...every 60 mins

So for example on the 11th attempt, which could be 1 minute after the 10th attempt no alert would occur as the last one was not > 60 mins ago.

To change the schedule, inherit from StateChangeNagFailNotificationFilter and pass a different set of KeyValuePairs on the ctor. Just return a unique "Mode" and set the "NotificationMode" property on the check to the same...bingo your custom filter is running with the bespoke schedule.

I'm keen to get feedback on the default schedule (KeyValuePairs)  I have set up - is it the right shape for a "default"?