How can I tell SGE to stop assigning work to a compute node?

I want to mark a node (or set of nodes) as "offline" in the sense that I want Sun Grid Engine to stop assigning new work to them. This would be for some kind of maintenance work on the nodes themselves. The nodes should finish whatever work they've been assigned, and then just go into some kind of idle ("offline") state. I've been hunting through the qconf documentation, but I can't find this use case in any howto.


Solution 1:

Searching about has led me to the qmod utility. I've done a simple test of

qmod -d QUEUENAME.q@MACHINENAME

and this seems to be working, although I haven't actually tried it with jobs running. The qstat output is changed to indicate that the node is disabled - a "d" flag appears.

qmod -e QUEUENAME.q@MACHINENAME

will enable the machine again.

On our cluster, the machines are named worker-##-## where the two numbers are rack number and rank number. We only run one master queue, called "all.q". And the machines in our cluster get listed with a ".local" suffix in the qstat output. So the above command ends up being

qmod -d [email protected]

to take the machine at rack 9, rank 9 out of the queueing rotation.