Monday, April 30, 2012

Partial Blackout - Stop Failed / Modify Failed

Recently, there was a bigger network maintenance. I knew that all servers would lose their connection completely at some point and to prevent getting flooded by Enterprise Manager mails, I set all targets in blackout (EM Version: 11.1.0.1).

When everything was done, I stopped the blackout from the EM console. What I got was:

Partial Blackout - Stop Failed.

One agent was reported as being unavailable, so the ending of the blackout could not get propagated to it. But when I checked the agent's state on the server, everything was fine. "emctl status agent" showed OK, an "emctl upload" worked without problems, even resecuring was successful.

So I hoped that it might only be an oracle_emd (agent) target problem and clicked "Edit" in the console to change the blackout from "Full" to "Selected Targets" and removed the checkbox from all but the agent targets. What I now got was:

Partial Blackout - Modify Failed

Also that did not help.

I realized that during our maintenance, we had some failover instances relocated to another host for load balancing reasons - but their respective oracle_database targets were still on that node that showed unavailable. Seemed like the relocation script did not work for the EM target for some reason. Well, at least those should be possible to be monitored again when moved to their current, availalbe nodes - so I relocated the targets using the emcli:

./emcli relocate_targets -target_name=[database target name] -target_type=oracle_database -src_agent=[unavailable host]:3872 -dest_agent=[current available host]:3872 -copy_from_src

Once I was finished doing that for all misplaced targets - the blackout on the "unavailable" host could be stopped without a problem.

Lessons learned:
  1. Agent unavailable does not have to really mean agent unavailable.
  2. The blackout system in EM can be a mess. That's not the first problem I have with it (more frequent are Zombie blackouts from the past that just activate again), just the one it took the longest for me to diagnose and that had the least helpful error messages.
  3. Always check that your targets are associated to the correct monitoring agent, you could get weird EM behavior without meaningful error messages otherwise.

No comments:

Post a Comment