Problems setting up Agent-Agent publish

Sep 16, 2014 at 9:07 PM
Hey, loving v3 so far. I think it's going to be very helpful in my systems monitoring.

I'm trying to setup a web service publish to build a hierarchy of agents. However, I'm getting an odd error from deep in CastleWindsor when I add my new publisher config. Adding a custom publisher, and a sql publisher went great, perfect textbook examples. But this publisher has thrown me for a loop....

It looks like:

Topshelf.Hosts.ConsoleRunHost Error: 0 : An exception occurred, System.InvalidCastException: At least one element in the
source array could not be cast down to the destination array type.

The entire publisher config looks like this:

{
"Name": "KKDEVWolfpackPublisher",
"Description": "This activity is used to publish notifications to another Wolfpack instance via the Web REST Api.",
"InterfaceType": null,
"PluginType": "Wolfpack.Core.WebServices.Publisher.WebServicePublisher, Wolfpack.Core.WebServices",
"ConfigurationType": "Wolfpack.Core.WebServices.Interfaces.Entities.WebServicePublisherConfig, Wolfpack.Core.WebServices.Interfaces",
"Data": "{\"BaseFolder\":\"_outbox\",\"SendIntervalInSeconds\":10,\"UserAgent\":\"\",\"BaseUrl\":\"http://kkdeveloper:802/\",\"ApiKey\":\"\",\"FriendlyId\":\"KKDEVWolfpackPublisher\",\"Enabled\":true}",
"Tags": [
"Activity",
"WebService",
"Publisher"
],
"Link": null,
"RequiredProperties": {
"Name": "KKDEVWolfpackPublisher"
}
}
Sep 16, 2014 at 9:47 PM
Oh jeez, how confused I am. My problem was this this needs to be in the Config\Activities folder, not Config\Publishers folder!
Coordinator
Sep 16, 2014 at 10:37 PM
Hi - welcome on board!

How are you configuring the web components? If you go through the UI it should work ok - if you are going through the UI and it's not working then l'll take a look. Each type of component has a dedicated loader so 'crossing the streams' will definitely cause problems!

V3.1 has a mini overhaul of the config in order to simplify things - I'll also be adding to the docs about the config discovery features too.

Appreciate any and all feedback and welcome contributions to custom plugins that might benefit the community - if you write anything and want to release it I'd be happy to help package it/assist.

Cheers,
James
Sep 16, 2014 at 11:04 PM
Thanks James!

I used the UI after the fact, which is why I found I chose the wrong location. I am trying to build deployable solutions for my company,
So, I can use the UI but after will want to copy the file to a project that will be deployed by Octopus.

If I make any plugins that are pretty generic I will consider releasing them.

I have the agent-agent working now. I'm done for the day, but 2 issues arise. There is a bug somewhere when the notification is serialized,
The GeneratedOnUtc field was sent as utc, but gets set to local time. I'll find it tomorrow.

The other thing I'd that the stale message exception causes the notification to not be consumed since the web service leaks the exception to the client. The client then keeps retrying, but the notification never gets "less stale".

-jerry
Sep 17, 2014 at 1:50 PM
What I found is that the MessageStalenessCheckStep.Execute is not accounting for GeneratedOnUtc being local time. The string format of the serialized time is localtime+timezone. For example:

2014-09-16 17:15:03,208 [6] INFO Wolfpack [(null)] - Received Notification (HealthCheck) 3ee44382-de4e-4904-b1ce-1a45c4180eec
{
"Id": "3ee44382-de4e-4904-b1ce-1a45c4180eec",
"EventType": "HealthCheck",
"SiteId": "JALBRO",
"AgentId": "Agent1",
"CheckId": "IsNotepadRunning",
"Message": "There are 0 instances of process 'notepad.exe' on localhost",
"CriticalFailure": false,
"CriticalFailureDetails": null,
"Result": false,
"ResultCount": 0.0,
"DisplayUnit": null,
"GeneratedOnUtc": "2014-09-16T17:14:58.7417862-04:00",
"ReceivedOnUtc": "2014-09-16T21:15:03.2085245Z",

...
It may be arguable that DateTime.UtcNow.Subtract should check the DateTimeKind of the operands, but it appears to not do so. GeneratedOnUtc is effectively correct, but it has a Kind of DateTimeKind.Local.

I recommend the fix is simply to change MessageStalenessCheckStep.Execute to use GeneratedOnUtc.ToUniversalTime() in the subtraction. This then yields the correct delta T.
    public override void Execute(WebServiceReceiverContext context)
    {
        if (DateTime.UtcNow.Subtract(context.Notification.GeneratedOnUtc.ToUniversalTime()).TotalMinutes > _config.MaxAgeInMinutes)
        {
Another way to correct this is to update NotificationModule.cs to change the Kind on GeneratedOnUtc to a UTC time...
        Post["/notify"] = request =>
        {
            var message = this.Bind<NotificationEvent>();
            message.GeneratedOnUtc = message.GeneratedOnUtc.ToUniversalTime();
            message.State = MessageStateTypes.Delivered;
            message.ReceivedOnUtc = DateTime.UtcNow;
In fact, doing both works since once GeneratedOnUtc has the Kind property set to Utc the ToUniversalTime() conversion is moot.


-jerry
Coordinator
Sep 17, 2014 at 2:19 PM
Cool - thanks for the feedback on the agent-agent mechanism.

I'll add the utc fix into the next build.

As for the stale message "workflow" - I'll take a look and revise - like you say it will never get less stale and will continue to fail and retry ad-infinitum. I'll probably change it to consume the message but park it somewhere, "dead-letter" style - something like that (suggestions welcome).

I'd be keen to hear any suggestions for improving the configuration/helpers/plugins for managing deployment of a large scale of agents - I confess I've never had to deploy/manage large numbers of agents but its something I'm aware that could be a pain and would be looking to add backlog items/features to smooth this over.

Thanks,
James
Sep 17, 2014 at 3:13 PM
I'll provide info and feedback as I go. My planned scenario is about 50 physical sites, each site has a main server and about 15 to 20 other machines. So, each machine will run wolfpack, no publishers, just feeds up to the site server. site server has its own set of health checks, a sql publisher, and feeds up to a central corporate server running wolfpack. That will publish to sql and (filtered) to a Alert Manager server. The Alert Manager server provides a REST service to accept alerts and generates email and text message alerts.

By the way, believe it or not I worked in Microsoft Office365 on the monitoring team. I was quite busy developing certain aspects of the Active Monitoring (or Managed Availability) systems.

One concept a wee bit different than Wolfpack has is the Responders (instead of publishers). You could attach a chain of responders to an health check. The basic pattern would be the first responder (if error less than say 1/2 hour old) might restart a service to attempt to correct a failure. The second responder would only run if the error was still present after say 1 hour, and reboot the server. Then the 3rd level responder would only run if the error was there for 1.5 hours and would page the On Call Engineer.

The other change that evolved from Exchange 2010 to Exchange 2013 was that the monitor shifted from looking for a vast array of failure indicators (eg, event log entries) to a more black box "is the service working for the customer" point of view. For example send an email and see if it ends up in target test mailbox.

There is a LOT more to it though. See http://technet.microsoft.com/en-us/library/dn482056.aspx
Coordinator
Sep 17, 2014 at 3:51 PM
Awesome stuff!!...and the responders component makes a lot of sense - I've had a feeling that "publisher/publishing" was a restricted view point of handling a notification as I've created several components (in the build analytics plugins) that are publishers but don't actually act as a publisher...so I've been aware that I'm bending things a little but now you mention "responders" it fits much better - particularly as it also provides a solution to the escalation problem too.

I'll certainly look at the notification hub/internals to accommodate something like "chaining responders" once I get the next release out. I quite like the idea of making the configuration more notification centric rather than setting things up globally.

Thanks for the info/feedback - much appreciated!
Sep 18, 2014 at 5:47 PM
Maybe we should start a new thread to expand on this topic. Anyway, this may clarify what Exchange is doing. http://blogs.technet.com/b/exchange/archive/2013/07/16/managed-availability-monitors.aspx

Basically,
  1. Probes does some test, and write ProbeResults.
  2. Monitors check ProbeResults, then write MonitorResults. A monitorresult reports not merely healthy/unhealthy, but there are different levels of unhealthy, depending upon how low the monitor has been in an unhealthy state.
  3. Responders check MonitorResults, and if the monitorresult unhealthy state matches the target state of the responder (and the responder is not throttled) the responder executes.
In someways Exchange Managed Availability is overly complex, and I don't think using the Crimson event log channels as pseudo databases was a great decision, but overall it's a well thought out and effective system.