Sane patch schedule for Windows 2003 cluster

We've got a cluster of 75 Win2k3 nodes at work in a coarse grained compute cluster. The cluster is behind a mountain of firewalls and resides in its own VLAN. Jobs of all sizes and types run on the cluster and all of the executables running are custom-made.

(ed: additional notes on our executables) The jobs range from 30 seconds to 7 days in duration, and may contain one executable or 2000 sub-jobs (of short duration). Obviously we are trying to avoid the situation where our IT schedules a reboot during a 7 day production job.

We have scheduling software which accomodates all of the normal tasks for a coarse grained cluster and we can control which machines are active for submission, etc. If WSUS was in some way scriptable (or the client could state it's availability for shutdown) we could coordinate the two systems and help out.

Currently, the patch schedule is the Sunday after Super Tuesday regardless of what is running on the cluster. We have to ask for an exemption every time we want to delay patching a machine for a long running production job. Basically, while our group is responsible for the machines we have little control over IT's patch schedule.

  1. Is patching monthly with MS's schedule sane for a production Windows cluster?
  2. Are there software hooks in WSUS where we could say, "please don't reboot just yet"?

Solution 1:

1.Is patching monthly with MS's schedule sane for a production Windows cluster?

Yes however a cluster should not have any downtime associated with a patch as it should fail the jobs over to another node- I would NOT patch the entire cluster at the same time (that would be insane)

2.Are there software hooks in WSUS where we could say, "please don't reboot just yet"?

There is no way for end users to stop a WSUS update or reboot but it sounds to me like you have a real communication problem between your group and the IT group; however you should be able to lose 1 node at a time with little impact to production.

Solution 2:

By using Config Mgr to manage the deployment of updates you can stop the servers from rebooting. So updates are applied (but might not be in effect until reboot) and IT will have reports showing those servers that are pending a reboot. They can easily give you this list and I expect that you can easily hand schedule the reboots of particular nodes without too much interuption. IT can easily have a failsafe deployment (with forced reboots) and a long deadline time as well, so that this will ultimately force the updates and reboots should you fail to keep your side of the bargain!

For the standard update deployments IT (and you) will probably want a very short deadlines on totally silent (non-rebooting deployment) and also a slightly longer deadline deployment which isn't silent so you will see notification if you login to the server. Neither of these deployments should force the reboot.

You still might come across the situation where something fails as a library or other code component was updated while not in use and then gets used before rebooting has a made the rest of the updates take effect.

This is an efficient way to get what you and IT want and each of you has some visibility of what is going on. The reporting of which servers are in what state according to the deployments is really useful for both of you as well.

Solution 3:

Sounds like you're getting a lot of 'talk to the hand' attitude from your IT Department. You need to sit them down (or bribe them with beer?) explain your situation & see if they can do something like create a downstream WSUS server with manual patch approvals.

The settings for WSUS are all set by Group Policies, these are set in active directory at the domain or OU level. If the servers are on the corporate domain with no separate OU, then they get what everyone else is getting which doesn't sound like its appropriate.

If you can't solve the issue with your IT department, then remove the computers from the domain?