Diagnose Network Faults
Disclaimer: I am a developer, not a system admin, please be gentle.
Where I work we are having a lot of intermittent network problems. Sometimes the DNS will fail, but access to servers can be done via IP, sometimes access via IP fails. So far as we can tell nothing has been changed on the servers, firewalls, managed switches etc. Also, frustratingly the faults don't cause issues with all users all of the time, but so far as we can tell, all the users have had problems at some point.
- The servers are not reporting any faults.
- The physical network seems fine (it's a small site).
- The firewalls aren't reporting anything out of the ordinary.
- The managed switches have passwords that are only stored in the sys admin's head (an issue we know!)
Our in house sysadmin is unavailable at the moment, so it's left to the developers to try and figure something out.
So, given that I have almost no clue, where do I start?
Update
I've tried the tracrt/ping combo and it looks like it's an internal issue. The external stuff seems to be fairly consistent, but the internal bits are proving to be flaky.
Traceroute to an internet site you know will be up. eg google.com. Then run a constant ping against 3 targets, your router, your routers default gateway and google.com.
This should at least tell you if your losing any packets along the way or if it's your internet or internal network having the problem.
After that post back if/when you've got the next answer.
It sounds like something is dropping connections somewhere.
Best advice though would be track down your sysadmin, thats why he/she is there...
It kinda sounds like you've got either a bad interface on a switch/server, or a rogue traffic source on the network. Without the ability to capture some spanned traffic or see interface stats, actually tracking either of those down would be neigh impossible. Have you added any new devices lately? Especially, in my personal order of suspicious devices: network devices, servers attached to more than one network, printers.
However, a lone sysadmin that has gone on vacation and left the shop with no visibility into the network is a very bad situation. Some things to discuss once he/she returns:
- monitoring - there are numerous free/OSS monitoring solutions for everything from per port statistics (Cacti) to in-depth monitoring of services (Nagios). It sounds like you need both.
- documentation - if you have only one person qualified to adminster the network, then that person must document, document, document! In addition, it must be in a medium that is easily accessible even if the network is down! This includes securely storing the passwords, even if it's hardcopy stored in a safe, so that the company does not suffer even if the sysadmin gets run over by the black bus.
- notification - once you've implemented a decent monitoring solution, you must decide on an escalation plan so that you're not sending notifications to only one person.
I was the sole network administrator for a multi-million dollar company for over 7 years (I have minions now =) and on-call 24/7/365 for pretty much that entire time and can say, pretty definitively, that if you've made yourself the only person that can do a certain thing, you can rest assured that you will be called whenever that thing needs doing.
The one thing you can 100% rely on is the probability that whatever can break when you're the only that can fix it is the the thing that is absolutely guaranteed to break when you leave for vacation.