Failure rate statistics for cisco switches
We have this network with a core of old and trusty Cisco switches. I have the gut feeling that they can go down at any time; that'll mean Crisis as this is the kind of infrastructure that:
they managed to build using only single points of failure. I know, you'd think some redundancy would slip in, but no
I'm looking for statistic data showing the average life span of network hardware. I need hard data to give strength to my arguments.
If that helps the oldest Cisco are from the 3500XL family. Some of them died in a short interval a few years before.
Not a direct answer, but:
Do they sound like they are failing? This may seem silly, but older switches might have fans that sound really bad, and that is a strong argument right there "Come listen to this."
Compromise, a Cold Spare:
Also, another way to approach this would be to argue that although the downtime is not worth it to them implement full redundancy, but what about the price of a single switch that you can have on the rack as a cold spare. This way if one does fail, your recovery time will be much less time.
Joel Spolsky mentioned this in one of the stackoverflow podcasts, something like "Recovery time is more important that how often it goes down." The argument being as I remember it, that downtime isn't as big of a deal if you are back up in a couple minutes, but is if you are not back up for half a day. Smart way to look at it in my opinion.
So a your new argument might be that since switches are not that expensive, it is cost effective for the business to have at least one cold spare because it can make the different between being down for minutes instead of a whole day.
Also if you win your argument this way, be sure to pre-configure the cold spare ;-)
First I'll say that adding redundancy to a network makes it substantially more complex and if you don't know what you're doing, it may not make it any more reliable. Sometimes getting lucky is a good plan. It's not the best plan, but to the person writing the checks, it may seem like the best way to go.
I'll make the assumption that you've got a whole shaky pile of these things, not just a couple.
As an extension of Kyle's suggestion of getting a spare, what about upgrading a few of the devices (and putting the replacement devices into a vrrp / hsrp redundancy sort of config) and then putting aside the ones pulled from service for use as spares for the others still in service.
Also, they do have backups for all the configs for all their devices, right? That's priority #1. You can, in a pinch, substitute one L2 / L3 device for another, but only if you know what the old thing was doing just before it failed.