Failed line of communication… rebooting for morons?
I work at a large corporation where we use many legacy systems. To note some of the systems: HP-UX 10.20, Windows 2000, VMEBus systems, systems designed 30+ years ago that do not communicate via TCP/IP protocols, and more.
Throughout the work week we are constantly plagued with these legacy systems losing communication with one another. Usually rebooting a system to try to restore communication is the last approach. It has become a common belief that rebooting a system is just a “fix-all” for ignorant co-workers. I was wondering, if there is ever validity to rebooting a system (legacy or not) to restore a failed line of communication?
I realize renewing IP addresses in windows should effectively restore network communication; but is there the possibility of a deeper problem in the underlying operating system that could become corrupt and need a reboot? A failed socket that times out, doesn’t close, or maybe doesn’t try to reconnect?
It seems to me rebooting would be a viable resolution when having such a complex network of mismatch systems. But (at least at my workplace) when a system is rebooted, and everything magically starts working again it’s always a “coincidence”; never a solution. Thoughts?
The answer is "it depends".
Rebooting can fix issues or make it easier to detect issues by providing better logging or easily observed problems. (Hmmm... rebooting shouldn't take 10 minutes)
Resorting to reboots as a standard troubleshooting technique is a bad practice, however. Somebody needs to understand how things are disconnected so you can triage, isolate the failed components and start troubleshooting the problem.
I hate to say it, but it may be useful to look at something like ITIL, particularly incident and problem management. It may help you or your management reorganize your support system to actually function in a rational way.
Yeah, "restart and call me if it still doesn't work" is often the first line of troubleshooting for sysadmins or helpdesk staff who are out of ideas. I'll cop to using this as well, but telling someone to reboot a server is a completely different exercise than having a user reboot their workstation, depending of course on what the server is used for.
I hate to give this advice, but speaking pragmatically, sometimes for true legacy systems that you're not at liberty to replace, if reboot works to fix the problem then it's better to just do it as needed and work towards justifying an upgrade than to extend downtime unneccesarily.
My thoughts on trying to educate people is to take the least intrusive path first.
As you said rebooting should be the LAST option.
So least intrustive would be more like, - Re-starting the communications service - Re-starting the application service - Re-starting the communications layer of the application (if exists) - Etc
This applied to more then just old systems and applies to any troubleshooting. One day one of those systems will not come back up.
By cycling though the different parts of a system, this might also let you find what is actually causing the failure and also have a faster fix since an entire reboot is not done.
With fail-over clusters (I use RedHat Cluster) rebooting is a good thing for a few reasons:
It's part of the high availability protocol as "STONITH" (Shoot The Other Node in The Head), whereby an unresponsive host is forcibly disconnected/rebooted. You better make sure it's properly set up and that it's going to reboot in working order. When something goes wrong, you can find yourself rebooting machines several times over unless the problem is obvious.
The system is optimized around having a node going down, but it's not very good -- in fact it sucks -- at figuring a node is merely misbehaving. Having a service relocate to another node takes a few seconds. If the current node is misbehaving, pulling the plug on it is the surest and fastest way to do that, otherwise the cluster could be trying to do things too nicely, and wait for an ACK that will never come.
Because your question spans multiple operating systems, there cannot be a single correct answer.
I can say this for Windows 2000 systems: I've run thousands of them, and can only recall a handful of cases where communication had failed AND the system was not completely hung. Often a simple disable/re-enable of the incommunicado NIC would solve this, followed by a driver update and/or replacement of NIC with something less cheesy.
(IOW, I've only seen it with old buggy drivers and/or off-brand NICs.)